Abstract

The present series of experiments explores several issues related to gesture–speech integration and synchrony during sentence processing. To be able to more precisely manipulate gesture–speech synchrony, we used gesture fragments instead of complete gestures, thereby avoiding the usual long temporal overlap of gestures with their coexpressive speech. In a pretest, the minimal duration of an iconic gesture fragment needed to disambiguate a homonym (i.e., disambiguation point) was therefore identified. In three subsequent ERP experiments, we then investigated whether the gesture information available at the disambiguation point has immediate as well as delayed consequences on the processing of a temporarily ambiguous spoken sentence, and whether these gesture–speech integration processes are susceptible to temporal synchrony. Experiment 1, which used asynchronous stimuli as well as an explicit task, showed clear N400 effects at the homonym as well as at the target word presented further downstream, suggesting that asynchrony does not prevent integration under explicit task conditions. No such effects were found when asynchronous stimuli were presented using a more shallow task (Experiment 2). Finally, when gesture fragment and homonym were synchronous, similar results as in Experiment 1 were found, even under shallow task conditions (Experiment 3). We conclude that when iconic gesture fragments and speech are in synchrony, their interaction is more or less automatic. When they are not, more controlled, active memory processes are necessary to be able to combine the gesture fragment and speech context in such a way that the homonym is disambiguated correctly.

INTRODUCTION

In everyday face-to-face conversation, speakers not only use speech to transfer information but also rely on facial expressions, body posture, and gestures. In particular, co-speech gestures play an important role in daily communication. This category of spontaneous hand movements consists of different subtypes such as beats, emblems, and deictic, metaphoric, and iconic gestures. Iconic gestures are distinguished by their “close formal relationship to the semantic content of speech” (McNeill, 1992, p. 12) and are the most thoroughly studied gesture type. For instance, a speaker might perform a typing movement with his fingers while saying: “Yesterday I wrote the letter.” In this example, both gesture and speech refer to the same underlying idea, namely, writing something. A steadily increasing number of studies have shown that such iconic gestures are not only closely linked to the content of the accompanying speech but that they also have an effect on speech comprehension (ERP studies: Holle & Gunter, 2007; Kelly, Ward, Creigh, & Bartolotti, 2007; Özyürek, Willems, Kita, & Hagoort, 2007; Wu & Coulson, 2005; Kelly, Kravitz, & Hopkins, 2004; fMRI studies: Holle, Obleser, Rueschemeyer, & Gunter, 2010; Green et al., 2009; Holle, Gunter, Rueschemeyer, Hennenlotter, & Iacoboni, 2008; Willems, Özyürek, & Hagoort, 2007). Although such experiments suggest that a listener can extract additional information from iconic gestures (e.g., in the example above, we know that the letter was written on a keyboard and not with a pen) and use that information linguistically, little is known so far about the timing issues involved in this process and how the interplay between gesture and speech is actually functioning. Before discussing how the present study investigated speech–gesture timing, we will review how timing has been discussed in the gesture literature so far.

Based on physical properties, a gesture phrase can usually be divided into three consecutive time phases: preparation, stroke, and retraction (McNeill, 1992). The preparation phase is the period from the beginning of the gesture movement up to the stroke. Typically, the hands start to rise from a resting position and travel to the gesture space where the stroke is to be performed. Very often, some “phonological” features of the stroke are already present during the preparation phase. This is why it has been claimed that a key characteristic of the preparation phase is anticipation (McNeill, 1992).

The most salient part of a gesture is the stroke phase. The peak effort of the hand movement marks the onset of the stroke phase. This phase lasts until the hands start to return to the resting position. The form of the stroke does not anticipate some upcoming event (like the preparation does), but “the effort is focused on the form of the movement itself—its trajectory, shape, posture” (McNeill, 1992, p. 376). Most researches consider the stroke as the essential part of a gesture phrase (but see Kita, van Gijn, & van der Hulst, 1998), whereas the preparation and the retraction can be omitted. The retraction phase begins where the hands start to return to the resting position and ends when they have reached this position. This final gesture phase mirrors the anticipation phase in that the effort is focused on reaching an end position.1

Although these time phases are defined on a formal level, previous work suggests that meaning is conveyed to a different degree throughout the three gesture phases. In particular, the stroke phase has been described as the segment of iconic gesture in which the meaning is expressed (McNeill, 1992). Speakers have a high tendency to produce the onset of the stroke in combination with that part of speech which the gesture describes (cf. the lexical affiliate; Levelt, Richardson, & la Heij, 1985). This synchronization between gesture and speech at the stroke is probably one important cause for the observation that the stroke phase is the most meaningful segment of an iconic gesture.

In essence, it seems that gesture timing is discussed in rather broad terms of three consecutive phases. The present series of experiments was set out to detail the timing issues related to gesture–speech integration during sentence processing. To do so, we were faced with an interesting challenge, namely, the large overlap in time between gesture and speech. It is clear from the foregoing paragraphs that before the most meaningful part of a gesture (i.e., the stroke) is present, some preparatory motor activity is going on, whereas after the stroke, gesture tends to linger on during the retraction phase. Such large overlaps with speech will make a precise measurement of speech–gesture integration (including synchrony issues) very difficult in a sentence context. As an illustration, the timing parameters of the stimulus material as used by Holle and Gunter (2007) are given in Figure 1A. As can be seen clearly, the full-length gesture completely overlapped with the first part of the second sentence, which makes it impossible to manipulate gesture–speech synchrony using full gestures without simultaneously changing the amount of gesture information that is available at a given point in time. To avoid such an undesirable confound between synchrony and gesture information, we decided to find a way to reduce the length of gesture information without changing its impact on speech substantially. Determining such gesture fragments would then enable us to investigate timing issues with much greater precision.

Figure 1. 

Timing of gesture and speech. (A) The temporal alignment of the original full-length gesture material used by Holle and Gunter (2007). As can be seen, the full-length gestures overlap with most of the main clause, including verb, homonym, and target word. Thus, a precise investigation of the temporal aspects of gesture speech integration as well as distinction between integration and disambiguation is virtually impossible with full-length gestures. (B) The timing of the gesture fragments with respect to their accompanying speech, as used in the present study. In Experiments 1 and 2, gesture fragment (disambiguation point [DP]) and speech (IP) were asynchronous, that is, on average, gesture did not temporally overlap with the homonym. In contrast, gesture fragments and speech were synchronous in Experiment 3.

Figure 1. 

Timing of gesture and speech. (A) The temporal alignment of the original full-length gesture material used by Holle and Gunter (2007). As can be seen, the full-length gestures overlap with most of the main clause, including verb, homonym, and target word. Thus, a precise investigation of the temporal aspects of gesture speech integration as well as distinction between integration and disambiguation is virtually impossible with full-length gestures. (B) The timing of the gesture fragments with respect to their accompanying speech, as used in the present study. In Experiments 1 and 2, gesture fragment (disambiguation point [DP]) and speech (IP) were asynchronous, that is, on average, gesture did not temporally overlap with the homonym. In contrast, gesture fragments and speech were synchronous in Experiment 3.

One possibility to find gesture fragments of minimal length that still have a communicative impact is the use of a gating paradigm. Although usually implemented in studies on spoken word recognition, gating can be used with virtually any kind of sequential linguistic stimuli. In gating, stimuli are presented in sequences of increasing duration. After each segment, participants are asked to classify the segment on a given parameter. For example, participants can judge the sequence with respect to prosodic (Grosjean, 1983) or syntactic parameters (Friederici, Gunter, Hahne, & Mauth, 2004) or try to identify the complete stimulus based on the perceived segment (Grosjean, 1996). An important difference between these examples is the number of possible response alternatives, ranging from very few in the case of word category identification to thousands of possibilities in the case of spoken word recognition. With regard to iconic gestures, it is known that participants have difficulties identifying the correct corresponding speech unit when these gestures are presented in isolation (Hadar & Pinchas-Zamir, 2004; Krauss, Morrel-Samuels, & Colasante, 1991). Thus, iconic gestures more clearly convey their meaning when viewed in combination with a semantic context. Therefore, in contrast to, for instance, words or signs, which are lexicalized and can be identified without any additional contextual information, the only appropriate way to identify when an iconic gesture becomes meaningful is to investigate it in the presence of a semantic context.

Homonyms (words with multiple unrelated meanings, e.g., ball) may be especially suited to provide such a context. In the situation of an unbalanced homonym such as ball, there are only two response alternatives to choose from: a more frequent dominant meaning (e.g., game) and a lesser frequent subordinate meaning (e.g., dance). Previously conducted experiments indicate that a complete iconic gesture can disambiguate a homonym (Holle & Gunter, 2007). In these ERP experiments, participants watched videos of an actress producing German sentences that contained an unbalanced homonym in the initial clause.2 The homonym was disambiguated at a target word in the following subclause. Coincident with the initial part of the sentence, the actress made a gesture that supported either the dominant or the subordinate meaning. Holle and Gunter (2007) found that the amplitude of the N4003 at the target words varied reliably as a function of the congruency of the preceding gesture–homonym combination. The N400 was found to be smaller when the targets were preceded by a congruent gesture–homonym combination and larger when preceded by a gesture–homonym combination incongruent to their meaning.

As mentioned above, the present series of experiments tried to detail the role of timing/synchrony in gesture–speech integration during sentence processing. To do so, we first use a gating procedure as pretest in order to solve the challenge of the large overlap in time between gesture and speech normally seen for complete gestures. In this context-guided gating paradigm, we determine a so-called disambiguation point (DP) of the original gestures of Holle and Gunter (2007). In the next step, these DPs are used to shorten the original gestures. These gesture fragments are then used in three ERP experiments which investigate whether the gesture information available at the DP has immediate as well as delayed consequences on the processing of a temporarily ambiguous spoken sentence, and whether these gesture–speech integration processes are susceptible to temporal synchrony (see Figure 1B). Because timing is a critical issue here, we use ERPs as the dependent measure as they have excellent temporal resolution and give the opportunity to measure gesture–speech interaction under different task conditions. Additionally, by using ERPs, we are able to investigate the disambiguating effect of gesture at different points in time. We conclude that when gesture and speech are in synchrony, their interaction is more or less automatic. In contrast, when both domains are not in synchrony, active gesture-related memory processes are necessary to be able to combine the gesture fragment and speech context in such a way that the homonym is successfully disambiguated.

EXPERIMENT 1: DISAMBIGUATION USING GESTURE FRAGMENTS

Experiment 1 serves as a basis for the other two experiments. It tests whether gesture fragments presented up to their DP are able to disambiguate speech. The DPs were assessed in a pretest using a context-guided gating done on the original material of Holle and Gunter (2007). The big advantage of using gesture fragments made out of this particular material is that we can measure the brain activity of our participants with greater precision at two positions in time. At the homonym position, speech–gesture integration can be directly measured, whereas a few words downstream the target word position gives us direct on-line evidence whether this integration led indeed to a successful disambiguation. As can be seen in Figure 1B, the gesture fragment was presented earlier than the homonym in Experiment 1 (i.e., there was a clear gesture–speech asynchrony).

Methods

The Original Stimulus Material

The original stimulus material of Holle and Gunter (2007) made use of 48 homonyms having a clear dominant and subordinate meaning (for a description how dominant and subordinate meaning were determined, see Gunter, Wagner, & Friederici, 2003). For each of these homonyms, two 2-sentence utterances were constructed including either a dominant or the subordinate target word. The utterances consisted of a short introductory sentence introducing a person followed by a longer complex sentence describing an action of that person. The complex sentence was composed of a main clause containing the homonym and a successive subclause containing the target word. Previous to the target word, the sentences for the dominant and subordinate versions were identical. A professional actress was videotaped while uttering the sentences. She was asked to simultaneously perform a gesture that supported the sentence. To minimize the influence of facial expression, the face of the actress was masked with a nylon stocking. All gestures directly related to the target word or resembling emblems were excluded. To improve the sound quality, the speech of the sentences was re-recorded in a separate session. Cross-splicing was applied to minimize the possibility that participants might use prosodic cues for lexical ambiguity resolution. Afterwards, the re-recorded speech was recombined with the original video material in such way that the temporal alignment of a gesture and the corresponding homonym was identical to the alignment in the original video recordings (for more details about the recording scenario and preparation of the original stimuli, please see Holle & Gunter, 2007).

Rating of the Gesture Phases

In order to get a more detailed understanding of the stimulus material and as a preparation for the present set of experiments, the onset of the gesture preparation, as well as the on- and offset of the gesture stroke of the original gesture material, was independently assessed by two persons. The phases of the gestures were determined according to their kinetic features described in the guidelines on gesture transcription by McNeill (1992, p. 375f). To avoid a confounding influence of speech, the gesture videos were presented without sound for the rating procedure, as has been suggested in the Neuropsychological Gesture Coding System (Lausberg & Slöetjes, 2008; see also Lausberg & Kita, 2003). First, the onset and the offset of the complete hand movement were determined, then the on- and offset of the stroke were identified based on the change of effort in the movement, that is, changes in movement trajectory, shape, posture, and movement dynamics (for details, see McNeill, 1992). The phase prior to the stroke onset was determined as the preparation phase, the phase after the stroke offset as the retraction. The movements did not include any holds. Both raters highly agreed on the classification of the different gesture phases [e.g., interrater reliability (time of stroke onset) > .90]. In case there was dissent about the exact point in time of preparation onset, stroke onset or offset, raters afterward discussed the results and chose the point in time they both felt appropriate. The values for the on- and offsets did not differ significantly across gesture conditions [all F(1, 94) < 1; see Table 1].

Table 1. 

Stimulus Properties

Gesture
Speech
Gesture Preparation Onset
Disambiguation Point (DP)
Gesture Stroke Onset
Gesture Stroke Offset
Homonym Onset
Homonym Identification Point (IP)
Target Word Onset
Target Word Offset
1.72 (0.41) 2.10 (0.44) 2.07 (0.46) 2.91 (0.48) 2.84 (0.40) 3.09 (0.41) 3.78 (0.38) 4.16 (0.38) 
1.72 (0.41) 2.10 (0.44) 2.07 (0.46) 2.91 (0.48) 2.84 (0.40) 3.09 (0.41) 3.80 (0.38) 4.17 (0.38) 
1.68 (0.50) 2.10 (0.51) 2.17 (0.52) 3.01 (0.51) 2.84 (0.40) 3.09 (0.41) 3.78 (0.38) 4.16 (0.38) 
1.68 (0.50) 2.10 (0.51) 2.17 (0.53) 3.01 (0.51) 2.84 (0.40) 3.09 (0.41) 3.80 (0.38) 4.17 (0.38) 
Mean  1.70 (0.45) 2.10 (0.47) 2.12 (0.49) 2.96 (0.50) 2.84 (0.40) 3.09 (0.41) 3.79 (0.38) 4.17 (0.38) 
Gesture
Speech
Gesture Preparation Onset
Disambiguation Point (DP)
Gesture Stroke Onset
Gesture Stroke Offset
Homonym Onset
Homonym Identification Point (IP)
Target Word Onset
Target Word Offset
1.72 (0.41) 2.10 (0.44) 2.07 (0.46) 2.91 (0.48) 2.84 (0.40) 3.09 (0.41) 3.78 (0.38) 4.16 (0.38) 
1.72 (0.41) 2.10 (0.44) 2.07 (0.46) 2.91 (0.48) 2.84 (0.40) 3.09 (0.41) 3.80 (0.38) 4.17 (0.38) 
1.68 (0.50) 2.10 (0.51) 2.17 (0.52) 3.01 (0.51) 2.84 (0.40) 3.09 (0.41) 3.78 (0.38) 4.16 (0.38) 
1.68 (0.50) 2.10 (0.51) 2.17 (0.53) 3.01 (0.51) 2.84 (0.40) 3.09 (0.41) 3.80 (0.38) 4.17 (0.38) 
Mean  1.70 (0.45) 2.10 (0.47) 2.12 (0.49) 2.96 (0.50) 2.84 (0.40) 3.09 (0.41) 3.79 (0.38) 4.17 (0.38) 

Mean onset and offset values are in seconds relative to the onset of the introductory sentence (SD in parentheses).

Pretest: Gating

To determine the point in time at which a gesture can reliably disambiguate a homonym, a context-guided gating was applied. Gating is a very popular paradigm in spoken word recognition (Grosjean, 1996). Its rationale is based on the assumption that spoken word recognition is a discriminative process, that is, with increasing auditory information, the number of potential candidate words is reduced until only the correct word remains (e.g., cohort model; see Gaskell & Marslen-Wilson, 1997; Marslen-Wilson, 1987). In this context, the identification point (IP) is defined as the amount of information needed to identify a word without change in response thereafter. Although gating is most common in spoken word recognition, it can be used with virtually any kind of sequential material [e.g., ASL (Emmorey & Corina, 1990); music sequences (Jansen & Povel, 2004)]. Because iconic gestures convey their meaning more clearly when produced with co-occurring speech, we employed a context-guided gating task to identify the isolation points of the gestures. Homonyms were used as context. This procedure has the advantage that the number of possible gesture interpretations is restricted to two, namely, the dominant meaning and the subordinate meaning of the homonym.

Forty native German-speaking participants took part in the gating pretest. A gating trial started with the visual presentation of the homonym for 500 msec (e.g., ball), followed by the gated gesture video. At 500 msec after the video offset, the participants had to determine whether the homonym referred to the dominant or the subordinate meaning based on gesture information. Three response alternatives were possible and simultaneously presented on the screen: (1) dominant meaning (e.g., the word game was displayed on the screen), (2) subordinate meaning (e.g., dance), and (3) “next frame.” Participants were instructed to choose the third response alternative until they felt they had some indication of which meaning was targeted by the gesture. The increment size was one video frame which corresponded to 40 msec, that is, each gate was 40 msec longer than the previous one. Gating started at the onset of the preparation phase and ended either when the offset of the stroke phase was reached or when the subject gave a correct response for 10 consecutive segments. Because very short video sequences are difficult to display and recognize, each segment also contained the 500 msec directly before the onset of the preparation. Thus, the shortest segment of each gesture had a length of 540 msec (500 + 40 msec for the first frame of the preparation phase). The gesture items were pseudorandomly distributed across two experimental lists. Each of the lists contained 24 of the original dominant and 24 of the original subordinate gestures, resulting in a total of 48 gestures per experimental list. For each homonym, either the dominant or the subordinate gesture was presented within one list.

The relevant information was the DP, which corresponds to the amount of gesture information needed to identify a gesture as either being related to the dominant or the subordinate meaning of a homonym without any changes in response thereafter. The mean DPs for the single items ranged from 2.22 to 19.63 frames (M = 9.88, SD = 3.6), calculated relative to preparation onset. Thus, on average, the participants needed to see about 400 msec of gesture to disambiguate a homonym. An ANOVA with the factors Word meaning frequency (2) and List (2) revealed that dominant gestures (M = 9.33, SD = 3.6) were identified earlier than subordinate gestures (M = 10.42, SD = 3.58), as indicated by the significant main effect of Word meaning frequency [F(1, 94) = 4.2, p < .05]. This result indicates that more gesture information is needed to select the subordinate meaning.

When investigating the distribution of the DPs relative to the stroke onset, we found a surprising result. DPs ranged from almost 20 frames before the stroke onset to 9 frames past the stroke onset, with the DPs of 60 gestures being prior to the stroke onset. This means that almost two thirds of all gestures enabled a meaning selection before the participants had actually seen the stroke (see Supplementary Figure 1). The difference between DP and stroke onset was found to be significantly smaller than zero across participants [t1(1, 39) = −4.7, p < .001] and items [t2(1, 95) = −2.3, p < .05]. The corresponding minF′ statistic (Clark, 1973) was significant [minF′(1, 128) = 4.26, p < .05], indicating that gestures reliably enabled a meaning selection before stroke onset.

Stimuli: Gesture Fragments

The stimuli for Experiment 1 were constructed as follows. First, the original gesture and speech streams were separated. Full-length gesture streams were then replaced with gesture streams cut at the DP. The duration of the gesture streams was adjusted to the duration of the speech streams by adding a recording of the corresponding empty video background. This manipulation created the illusion of a speaker disappearing from the screen while the speech was still continuing for a short amount of time. Speech streams were recombined with both the clipped dominant as well as clipped subordinate gesture streams, resulting in a 2 × 2 design, with Gesture (Dominant and Subordinate) and Speech (Dominant and Subordinate) as within-subject factors (see Table 2; see also supplementary on-line materials for examples of the videos). Each of the four conditions (DD, DS, SD, and SS) contained 48 items, resulting in an experimental set of 192 items. The items were pseudorandomly distributed to four blocks of 48 items, ensuring that (i) each block contained 12 items of all four conditions and (ii) each block contained only one of the four possible gesture–speech combinations for each homonym.

Table 2. 

Stimulus Examples

Gesture
Target Word
Gesture/Homonym
Target Word
Introduction: Alle waren von Sandra beeindruckt. 
 Everybody was impressed by Sandra. 
Sie beherrschte den Ballamb, was sich im Spiel beim Aufschlag deutlich zeigte. 
She controlled the ballamb, which during the gameat the serve clearly showed. 
graphic
 
Sie beherrschte den Ballamb, was sich im Tanz mit dem Bräutigam deutlich zeigte. 
She controlled the ballamb, which during the dancewith the bridegroom clearly showed. 
graphic
 
Sie beherrschte den Ballamb, was sich im Tanz mit dem Bräutigam deutlich zeigte. 
She controlled the ballamb, which during the dancewith the bridegroom clearly showed. 
graphic
 
Sie beherrschte den Ballamb, was sich im Spiel beim Aufschlag deutlich zeigte. 
She controlled the ballamb, which during the gameat the serve clearly showed. 
graphic
 
Gesture
Target Word
Gesture/Homonym
Target Word
Introduction: Alle waren von Sandra beeindruckt. 
 Everybody was impressed by Sandra. 
Sie beherrschte den Ballamb, was sich im Spiel beim Aufschlag deutlich zeigte. 
She controlled the ballamb, which during the gameat the serve clearly showed. 
graphic
 
Sie beherrschte den Ballamb, was sich im Tanz mit dem Bräutigam deutlich zeigte. 
She controlled the ballamb, which during the dancewith the bridegroom clearly showed. 
graphic
 
Sie beherrschte den Ballamb, was sich im Tanz mit dem Bräutigam deutlich zeigte. 
She controlled the ballamb, which during the dancewith the bridegroom clearly showed. 
graphic
 
Sie beherrschte den Ballamb, was sich im Spiel beim Aufschlag deutlich zeigte. 
She controlled the ballamb, which during the gameat the serve clearly showed. 
graphic
 

Introductory sentence was identical for all four conditions. The first two columns indicate the conveyed meaning of gesture and the subsequent target word: dominant (D) or subordinate (S). Target words are in bold. Literal translations are in italics. Cross-splicing was performed at the end of the main clause (i.e., in this case after the word “Ball”).

In order to investigate the on-line integration of the gesture fragments with the homonym with as much temporal precision as possible, we also determined the earliest point in time at which the homonym is identified. In a gating paradigm, spoken word fragments of increasing duration (increment size: 20 msec) were presented to 20 participants who did not participate in any of the experiments reported here. The IP was determined as the gate where the participants started to give the correct response without any change in response thereafter. On average, participants were able to identify the homonyms after 260 msec. These homonym IPs were used as triggers for the ERPs that dealt with the direct integration of the gesture fragments with the homonyms (see also Figure 1B).

Participants

Thirty-nine native German-speaking participants were paid for their participation and signed a written informed consent. Seven of them were excluded because of excessive artifacts. The remaining 32 participants (16 women; 20–28 years, mean = 23.8 years) were right-handed (mean laterality coefficient = 94.3; Oldfield, 1971), had normal or corrected-to-normal vision, had no known hearing deficits, and had not taken part in the pretest of the stimulus material.

Procedure

Participants were seated in a dimly lit, sound-proof booth facing a computer screen. They were instructed to attend both to the movements in the video as well as the accompanying speech. After each item, participants judged whether gesture and speech were compatible. Note that in order to perform this task, participants had to compare the meaning indicated by the homonym–gesture combination with the meaning expressed by the target word (see Table 2). A trial started with a fixation cross, which was presented for 2000 msec, followed by the video presentation. The videos were centered on a black background and extended for 10° visual angle horizontally and 8° vertically. Subsequently, a question mark prompted the participants to respond within 2000 msec after which feedback was given for 1000 msec.

The experiment was divided into four blocks of approximately 9 min each. For all blocks, the presentation order of the items was varied in a pseudorandomized fashion. Block order and key assignment were counterbalanced across participants, resulting in a total of eight different experimental lists with 192 items each. One of the eight lists was randomly assigned to each participant. Thus, each experimental list was presented to four participants. An experimental session lasted approximately 45 min.

ERP Recording

The EEG was recorded from 59 Ag/AgCl electrodes (Electro-Cap International, Eaton, OH). It was amplified using a PORTI-32/MREFA amplifier (DC to 135 Hz) and digitized at 500 Hz. Electrode impedance was kept below 5 kΩ. The left mastoid served as a reference. Vertical and horizontal electrooculogram was measured for artifact rejection purposes.

Data Analysis

Participants' response accuracy was assessed by a repeated measures ANOVA with Gesture (D, S) and Target word (D, S) as within-subject factors. EEG data were rejected off-line by applying an automatic artifact rejection using a 200-msec sliding window on the electrooculogram (±30 μV) and EEG channels (±40 μV). All trials followed by incorrect responses were also rejected. On the basis of these criteria, approximately 33% of the data were excluded from further analysis. Single-subject averages were calculated for every condition both at the homonym and target word position.

In the analyses at the homonym position, epochs were time-locked to the IP of the homonyms and lasted from 200 msec prior to the IP to 1000 msec afterward. A 200-msec prestimulus baseline was applied. Ten regions of interest (ROIs) were defined, namely, anterior left [AL] (AF7, F5, FC5), anterior center–left [ACL] (AF3, F3, FC3), anterior center [AC] (AFz, Fz, FCz), anterior center–right [ACR] (AF4, F4, FC4), anterior right [AR] (AF8, F6, FC6), posterior left [PL] (CP5, P5, PO7), posterior center–left [PCL] (CP3, P3, PO3), posterior center [PC] (CPz, Pz, POz), posterior center–right [PCR] (CP4, P4, PO4), and posterior right [PR] (CP6, P6, PO8). Based on visual inspection, a time window ranging from 100 to 400 msec was used to analyze the integration of gesture and homonym. A repeated measures ANOVA using Gesture (D, S), ROI (1, 2, 3, 4, 5), and Region (anterior, posterior) as within-subject factors was calculated. Only effects which involve the crucial factor gesture will be reported.

In the target word analysis, epochs were time-locked to the target word and lasted from 200 msec prior to the target onset to 1000 msec post target. A 200-msec prestimulus baseline was applied. The identical 10 ROIs as in the previous analysis were used. The standard N400 time window ranging from 300 to 500 msec after target word onset was selected to analyze N400 effects. A repeated measures ANOVA using gesture (D, S), target word (D, S), ROI (1, 2, 3, 4, 5), and region (anterior, posterior) as within-subject factors was performed. Only effects which involve the crucial factors gesture or speech will be reported. In all statistical analyses, the Greenhouse–Geisser (1959) correction was applied where necessary. In such cases, the uncorrected degrees of freedom (df), the corrected p values, and the correction factor ɛ are reported. Prior to all statistical analyses, the data were filtered with a high-pass filter of 0.2 Hz. Additionally, a 10-Hz low-pass filter was used for presentation purposes only.

Behavioral Data

The response accuracy was adequate across the different congruency conditions (congruent gesture speech pairings: 77%; incongruent gesture speech pairings: 73%). A significant main effect of congruency [paired t(31) = 2.30, p < .05] indicated that response accuracy was better for the congruent than the incongruent pairings. Congruent pairings also showed a faster RT [congruent: 450 msec; incongruent: 474 msec; paired t(31) = −2.04, p = .05]. Note that because the response occurred with some delay, the RT data should be treated with caution. Overall, the behavioral data suggest that when gesture fragment and speech are congruent, comprehension is enhanced compared to when they are incongruent.

ERP Data: Homonym

In Figure 2A, an early enhanced negativity can be observed when the homonym is preceded by subordinate gesture fragments as compared to dominant gesture fragments. Although this effect seems very early, this negativity is likely to be a member of the N400 family when considering its scalp distribution. The early onset can be explained by the use of the IP as the onset trigger of the averages. A repeated measures ANOVA revealed a main effect of Gesture [F(1, 31) = 17.61, p < .0002], indicating that the integration of a subordinate gesture fragment with the corresponding homonym is more effortful than the integration of a dominant gesture fragment.

Figure 2. 

ERPs as found in Experiment 1. The left panel (A) shows the ERPs time-locked to the identification point of the homonyms. The solid line represents the ERP when the homonym was preceded by a dominant gesture fragment. The dotted line represents the ERP when the homonym was preceded by a subordinate gesture fragment. The middle (B) and left (C) panels represent the ERPs time-locked to the onset of the target word. The solid line represents the cases in which gesture cue and subsequent target word were congruent. The dotted line represents those instances were gesture cue and target word were incongruent.

Figure 2. 

ERPs as found in Experiment 1. The left panel (A) shows the ERPs time-locked to the identification point of the homonyms. The solid line represents the ERP when the homonym was preceded by a dominant gesture fragment. The dotted line represents the ERP when the homonym was preceded by a subordinate gesture fragment. The middle (B) and left (C) panels represent the ERPs time-locked to the onset of the target word. The solid line represents the cases in which gesture cue and subsequent target word were congruent. The dotted line represents those instances were gesture cue and target word were incongruent.

ERP Data: Target Word

As can be seen in Figure 2B and C, the ERPs show an increased negativity starting at about 300 msec for incongruent gesture–target word relations (DS, SD) in comparison to the congruent ones (DD, SS). Based on its latency and scalp distribution, the negativity was identified as an N400. The analysis of the 300–500 msec time window showed a significant two-way interaction of Gesture and Target word [F(1, 31) = 16.33, p < .0005], as well as a significant two-way interaction of Target word and Region [F(1, 31) = 4.79, p < .05].

On the basis of the gesture and target word interaction, step-down analyses were computed to assess the main effect of Gesture for both target word conditions. At dominant target words, the N400 was larger after a subordinate gesture compared to a dominant gesture [F(1, 31) = 10.14, p < .01]. In contrast, the N400 at subordinate target words was larger when being preceded by a dominant gesture [F(1, 31) = 12.16, p < .01]. Thus, incongruent gesture context elicited a larger N400 at both target word conditions. Yet, the effect was slightly larger for subordinate (Cohen's f2 = 0.38) than dominant targets (Cohen's f2 = 0.32).

Discussion

Experiment 1 clearly shows that gesture fragments presented up to the DP are able to influence speech–gesture integration and are able to disambiguate speech. Before discussing the ERP results, the results observed during the gating pretest merit attention.

Context-guided Gating

The DPs found in the context-guided gating might be considered as surprisingly early, given what McNeill has written about the preparation phase (see Introduction). Relative to the gesture phases as determined by our rating, most of the DPs actually occurred before stroke onset within the preparation phase of the gestures. It therefore seems that the preparation phase already suffices to select the appropriate meaning of a homonym. Although potentially intriguing, we have to be cautious in interpreting this result because there are several methodologically related explanations that may account for such an early effect.

First, it is possible that the way we determined our stroke onset (with the sound turned off) may have resulted in later stroke onsets than a rating conforming entirely to the suggestions of McNeill (1992, p. 375f) would have. McNeill has suggested how to determine the phases of gesture with sound turned on. This methodological difference makes it difficult to relate our finding to McNeill's claims about the preparation phase.

Second, an inherent feature of a gating procedure is the highly repetitive nature of the task. Such repetitions may have induced processing strategies different from those used in real-time speech comprehension. It is also well known from studies on spoken word recognition (e.g., Salasoo & Pisoni, 1985; Grosjean, 1980) that additional contextual information enables participants to identify words earlier than without context. Because iconic gestures are seldom, if ever, produced without context (i.e., accompanying speech), the meaning of a gesture may be accessible rather early. One could speculate that the earliness of meaning comprehension may depend on the degree of contextual constraint. For instance, the participants in our study might have been able to decide upon the correct meaning more easily and faster, because they only had to choose between the two different meanings of a homonym; that is, gestures related to Kamm, which means either comb or crest, can be easily discriminated by hand shape. The preparation of the comb video contains the beginning of a one-handed gripping movement while there is an ascending and expanding two-handed movement in the crest video. This in line with Kita et al. (1998), who argue that hand-internal information such as hand shape or wrist location tends to emerge toward the end of the preparation phase, in other words, the preparation anticipates features of the stroke (McNeill, 1992). Thus, it is not that surprising that the preparation phase is informing the recipient what type of stroke phase might be following. It is, however, a novel finding that a recipient can actively interpret and use such preparatory motor activity in a forced-choice situation. It is important to note that this meaning anticipation only seems to be possible within the context of speech. An additional behavioral study, in which nine participants had to guess the meaning of the gestures clipped at DP without any context, showed that participants were not able to get the correct meaning of the gesture fragments.4 This result confirms that without any context, the meaning of a gesture is rather imprecise (Hadar & Pinchas-Zamir, 2004; Krauss et al., 1991). When a context is given, however, most of our gesture fragments are able to disambiguate by means of displaying solely prestroke information.

ERPs at the Homonym Position

Experiment 1 addressed the question of whether gestures clipped at the DP suffice as disambiguation cues in on-line speech comprehension using a congruency judgment task. The observed ERP effects at the homonym and at the target word position indicate that, indeed, these short gesture fragments can be used for disambiguation, even though there was an asynchrony of 970 msec between the end of a gesture fragment and the corresponding homonym IP. The ERPs elicited at the IP position of the homonym showed a direct influence of gesture type. Subordinate gestures elicited a more negative ERP compared to dominant gestures. Although its onset was very early (probably due to the use of the IP as trigger point), we would like to suggest, on the basis of its scalp distribution, that this component belongs to the N400 family. The data therefore suggest that the integration of the homonym with the subordinate gesture fragment is probably more effortful than the integration with the dominant gesture fragment. A more extended discussion of this effect will be given in the general discussion. For the moment, it is enough to know that the gesture fragments had a direct and differential impact during the processing of the homonym. The next question relates to whether this impact leads to a disambiguation of the homonym, influencing sentence processing further downstream. Such an effect would indicate that the gesture fragments, indeed, contained disambiguating information.

ERPs at the Target Position

The ERP data on the target word showed clearly that the gesture fragments were used to disambiguate the homonym. When a target word was incongruent with how the gesture fragments disambiguated the homonym, a larger N400 was elicited compared to when targets were congruent with the preceding gesture-driven disambiguation. Interestingly, both types of target words showed this effect, suggesting that the activation of both meanings of a homonym varied reliably as a function of the preceding gesture context. Note, however, that the N400 effect was larger for the subordinate target words, suggesting a larger sensitivity toward gesture influence than dominant targets. Such a finding may indicate that gesture fragments are a relatively weak disambiguating context (see Martin, Vu, Kellas, & Metcalf, 1999; Simpson, 1981).

It is important to note that in Experiment 1, participants were explicitly asked to compare the semantic content of a gesture fragment–homonym combination with the subsequent target word in order to solve the task. Thus, the task forced them to actively combine and integrate both sources of information. Due to the large distance between the end of the gesture fragment and the homonym IP (about 970 msec, see Table 1), it is, on the one hand, realistic to assume that gesture–speech integration in this particular case is an effortful memory-related process, because the gestural information has to be actively kept in working memory until the homonym is encountered. On the other hand, there are many suggestions in the literature that speech–gesture integration should occur more or less automatically and, therefore, should be effortless (Özyürek et al., 2007; Kelly et al., 2004). Automatic processes are characterized as being very fast, occurring without awareness and intention, and not tapping into limited-capacity resources (Schneider & Shiffrin, 1977; Shiffrin & Schneider, 1977; Posner & Snyder, 1975). If the integration of the gesture fragments with the homonyms is an automatic process, as suggested for complete gesture–speech integration by McNeill, Cassell, and McCullough (1994), it should be independent from experimental context and task.

To explore the underlying nature of the integration of a gesture fragment with a homonym, we used a more shallow memory task in Experiment 2 and examined whether participants would still use the gesture fragments as disambiguation cues even when the task did not require them to do so. As in Experiment 1, there was an asynchrony between gesture and speech, in that the gesture fragments ended about 970 msec before the IP of the homonyms. The rationale of the task was as follows: After a few trials without a task prompt, participants were asked whether they had seen a certain movement or heard a certain word in the previous video. No reference was made to the potential relationship between gesture and speech in the task instructions. Thus, participants had to pay attention to both gesture and speech, but were not required to actively combine both streams to solve the task. Holle and Gunter (2007), who used the same shallow task to investigate whether the integration of full gestures is automatic, found an N400 effect for both target word conditions. Based on that study, we hypothesized that the shortened gestures used in the present study should also modulate the N400 at the position of the target word under shallow task conditions. Additionally, we also expected an enhanced negativity for the integration of the subordinate as compared to dominant gesture fragments at the position of the homonym, as it was observed in Experiment 1.

EXPERIMENT 2: ON THE AUTOMATICITY OF THE INTEGRATION OF GESTURE FRAGMENTS INTO SPEECH

Methods

Participants

Thirty-four native German-speaking participants were paid for their participation and signed a written informed consent. Two of them were excluded because of excessive artifacts. The remaining 32 participants (16 women, age range = 21–29 years, mean = 25.6 years) were right-handed (mean laterality coefficient = 93.8), had normal or corrected-to-normal vision, and had no known hearing deficits. None had taken part in any of the previous experiments.

Stimuli

The same stimuli as in Experiment 1 were used.

Procedure

Presentation of the stimuli was identical to Experiment 1. Participants were, however, performing a different, more shallow task, and received the following instructions: “In this experiment, you will be seeing a number of short videos with sound. During these videos the speaker moves her arms. After some videos, you will be asked whether you have seen a certain movement or heard a certain word in the previous video.”

A visual prompt cue was presented after the offset of each video. After 87.5% of all videos, the prompt cue indicated the upcoming trial, that is, no response was required in these trials. After 6.25% of all videos, the prompt cue indicated to prepare for the movement task. A short silent video clip was presented as a probe. The probes consisted of soundless full-length gesture videos. After the offset of each probe video, a question mark prompted the participants to respond whether the probe contained the movement of the previous experimental item. Feedback was given if participants answered incorrectly or if they failed to respond within 2000 msec after the response cue. After the remaining 6.25% of the videos, the prompt cue informed the participants that the word task had to be carried out. Participants had to indicate whether a visually presented probe word had been part of the previous sentence. The probe words were selected from sentence-initial, -middle, and -final positions of the experimental sentence. Response and feedback were identical to the movement task trials.

ERP Recording and Data Analysis

The parameters for the recording, artifact rejection, and analysis were the same as in Experiment 1. The amount of behavioral data obtained in the present experiment is quite small (24 responses overall), with half of them originating from the movement task and the other half of the word task. Therefore, we decided not to use the behavioral data as a rejection criterion for the ERP analyses. Approximately 22% of all trials were rejected for the final analysis of both the homonym as well as target word position.

Results

Behavioral Data

Overall, participants gave 87% correct answers, indicating that although the task in Experiment 2 was rather shallow, participants nonetheless paid attention to the stimulus material. Performance was less accurate in the movement task (79% correct) than in the word task (96% correct; Wilcoxon signed-rank test; z = −4.72; p < .001).

ERP Data: Homonym

Figure 3A shows no visible difference between subordinate and dominant gesture fragments at the homonym position. The corresponding ANOVA indicated no statistically significant differences (all Fs < .53, p > .49).

Figure 3. 

ERPs as found in Experiment 2.

Figure 3. 

ERPs as found in Experiment 2.

ERP Data: Target Word

As can be seen in Figure 3B and C, there is barely a visible difference between the congruent and incongruent gesture cues for both target word conditions. The repeated measures ANOVA confirmed this impression by yielding no significant four-way interaction of Gesture, Target word, ROI, and Region [F(4, 124) = 0.32, p > .69; ɛ = .42], nor any other significant interaction involving the crucial factors of Gesture or Target word (all Fs < 1.28, all ps > .28); that is, there was no significant disambiguating influence of gesture on speech in the data.

Discussion

Experiment 2 dealt with the question of whether gesture fragments are integrated with speech when a shallow task was used. Both at the homonym as well as on the target words, no significant ERP effects were found. Thus, fragmented gestures do not influence the processing of coexpressive ambiguous speech when the task does not explicitly require an integration of gesture and speech. One way to interpret this finding is to suggest that the integration of gesture fragments is not an automatic process. Such a conclusion would, however, contradict the literature that indicates that gesture–speech integration is more or less automatic in nature (McNeill et al., 1994). It is therefore sensible to look more carefully at Experiment 2 and see whether a more parsimonious hypothesis can be formulated. Using the identical experimental setup, Holle and Gunter (2007) found a disambiguating effect of the original full-length gestures under shallow task conditions. One crucial difference between the gestures used by Holle and Gunter and the gesture fragments used here is whether the gesture overlaps with its corresponding co-speech unit (i.e., the homonym). Whereas complete gestures span over a larger amount of time and have a significant temporal overlap with the homonym, no such temporal overlap is present between the gesture fragments and the homonyms (see Figure 1). Remember that due to the clipping procedure, the gesture fragments end, on average, 970 msec prior to the homonym IP. Thus, at the time the gesture fragment ends, there is no coexpressive speech unit with which it can be integrated. When effortful processing is induced by the task (Experiment 1), this time lag does not seem to be problematic. If, however, the task does not explicitly require participants to actively combine gesture and speech as in Experiment 2, the time lag between gesture and speech may be problematic, probably because the minimal amount of information present in the gesture fragments gets lost over time. Thus, an alternative explanation is that automatic integration of gesture fragments does not occur when a gesture and its corresponding speech unit do not have a sufficient amount of temporal overlap. It is important to note that such an alternative explanation is also in accordance with McNeill (1992), who suggested that it is the temporal overlap (i.e., simultaneity between gesture and speech) that enables a rather automatic and immediate integration of gesture and speech. In his semantic synchrony rule, McNeill states that the same “idea unit” (p. 27) must occur simultaneously in both gesture and speech in order to allow proper integration. In other words, he suggests that if gestures and speech are synchronous, they should be integrated in a rather automatic way. So far, however, there has been little empirical work on the effects of gesture–speech synchronization in comprehension (but see Treffner, Peter, & Kleidon, 2008).

In Experiment 3, we explored the temporal overlap hypothesis as formulated above. We synchronized the gesture fragments with the homonyms in such a way that the DPs of the gestures were aligned with the IPs of the homonym. The rest of the experiment remained exactly the same as in Experiment 2. Thus, again, the shallow task was used. If, as suggested by the temporal overlap hypothesis, synchronization is playing a crucial role during speech–gesture integration, one would predict ERP effects similar to those observed in Experiment 1, both in the immediate context of the homonym as well as further downstream at the target word. In contrast, if the integration of gesture fragments and speech is not automatic, gesture should not modulate the ERPs at either sentence position, as it was observed in Experiment 2.

EXPERIMENT 3: THE ROLE OF TEMPORAL SYNCHRONY FOR THE INTEGRATION OF GESTURE AND SPEECH

Methods

Participants

Thirty-eight native German-speaking participants were paid for their participation and signed a written informed consent. Six of them were excluded because of excessive artifacts. The remaining 32 participants (15 women, age range = 19–30 years, mean = 25.5 years) were right-handed (mean laterality coefficient = 93.9), had normal or corrected-to-normal vision, had no known hearing deficits, and did not participate in any of the previous experiments.

Stimuli

The 96 gesture videos used in Experiments 1 and 2 constituted the basis for the stimuli of Experiment 3. In order to establish a temporal synchrony between a gesture fragment and the corresponding speech unit, the DP of gesture was temporally aligned with the IP of the homonym, that is, the point in time at which the homonym was clearly recognized by listeners. The IPs had been determined previously using a gating paradigm (see Experiment 1). Interestingly, the onset of the preparation phase of the synchronized gesture fragments still precedes the onset of the homonym by an average of 160 msec. Thus, the gesture onset is still preceding the onset of the coexpressive speech unit as it is usually observed in natural conversation (McNeill, 1992; Morrel-Samuels & Krauss, 1992).

Procedure, ERP Recording, and Data Analysis

The procedure, as well as parameters for ERP recording, artifact rejection, and analysis, was identical to Experiment 2. Behavioral data were, as in Experiment 2, not used as rejection criterion. Overall, 25% of the trials were excluded from further analysis. Based on visual inspection, separate repeated measures ANOVAs with Gesture (D, S), Target word (D, S), ROI (1, 2,3, 4, 5), and Region (anterior, posterior) as within-subject factors were performed for time window of the homonym (100 to 400 msec) and the target word (300 to 500 msec). These time windows were identical to those used in Experiments 1 and 2. Additionally, we performed an ANOVA for an earlier time window at the position of the homonym (50 to 150 msec).

Results

Behavioral Data

Similar behavioral results as in Experiment 2 were observed. Participants responded correctly in 82% of all test trials. Again, the movement task was carried out less efficiently (74% correct) than in the word task (90% correct; Wilcoxon signed-rank test; z = −3.80; p < .001).

ERP Data: Homonym

As can be seen in Figure 4A, an increased negativity is elicited when the homonym is preceded by a subordinate gesture fragment as compared to a dominant one.5 As in Experiment 1, the earliness of the effect can be explained by the use of the homonym IPs as a trigger for the averages. A repeated measures ANOVA yielded a significant main effect of gesture [F(1, 31) = 6.09, p < .05], a two-way interaction of gesture and ROI [F(4, 124) = 8.09, p < .001], as well as a significant three-way interaction of gesture, region, and ROI [F(4, 124) = 4.07, p < .05; ɛ = .46]. These results suggest that the integration of a subordinate gesture fragment with a homonym is more difficult than the integration of a dominant one. Further step-down analyses revealed that the main effect of gesture was strongest over fronto-central sites [F(1, 31) = 10.46, p < .001].

Figure 4. 

ERPs as found in Experiment 3.

Figure 4. 

ERPs as found in Experiment 3.

ERP Data: Target Word

No significant Gesture × Target word interaction was found in the early time window (all Fs < 1.05, all ps > .34). For the N400 time window, however, the ANOVA revealed a significant interaction of gesture and target word [F(1, 31) = 7.72, p < .01]. Based on this interaction, the simple main effects of Gesture were tested separately for the two Target word conditions. At subordinate target words, the N400 was significantly larger after a dominant gesture compared to a subordinate one [F(1, 31) = 6.63, p < .05]. No such effect of gesture–target word congruency was found at dominant target words [F(1, 31) = 0.33, p = .57]. Thus, when gesture fragments and speech are synchronized, the integration of both sources of information seems to be more automatic/less effortful, at least for the subordinate word meaning.

GENERAL DISCUSSION

The present set of experiments explored gesture–speech integration and the degree to which integration depends on the temporal synchrony between both domains. In order to enhance the precision in measuring temporal aspects of gesture speech integration, we presented our participants with gesture fragments. To do so, we first assessed the minimally necessary amount of iconic gesture information needed to reliably disambiguate a homonym using a context-guided gating task. In Experiment 1, where gesture fragment and homonym were asynchronous and an explicit task was used, the ERPs triggered by homonym IPs revealed a direct influence of gesture during the processing of the ambiguous word. Subordinate gesture fragments elicited a more negative deflection compared to dominant gesture fragments, indicating that the integration of subordinate gesture fragments with the homonym is more effortful. The ERP data at the target words showed that the gesture fragments were not only integrated with speech but were also used to disambiguate the homonym. When a target word was incongruent with the meaning of the preceding gesture–homonym combination, a larger N400 was elicited as compared to when this meaning was congruent.

In order to explore the nature of the gesture–speech interaction, we used a more shallow task in Experiment 2. If gesture–speech integration is an automatic process, this task manipulation should have resulted in similar ERP patterns as found in Experiment 1, where participants were explicitly asked to judge the compatibility between gesture and speech. This was, however, not the case, as no significant ERP effects were observed in Experiment 2. One possible interpretation of this negative finding is that gesture–speech integration is not an automatic process when gesture fragments are used. The data of Experiment 3, however, suggest a different reason for the null finding of Experiment 2.

In Experiment 3, we used the shallow task again but also synchronized the gesture fragments and homonyms. This synchrony manipulation led to a robust negativity for the subordinate gestures at the homonym as well as to significant N400 effects at subordinate target words. The combined ERP data therefore suggest that when gesture and speech are in synchrony, their interaction is more or less automatic. When both domains are not in synchrony, effortful gesture-related memory processes are necessary to be able to combine the gesture fragment and speech context in such a way that the homonym is disambiguated correctly.

Gesture–Speech Synchrony

The present series of ERP experiments are, at least to the authors' knowledge, the first series of experiments which investigated the effect of synchrony or temporal overlap on gesture–speech integration. In particular, our design allowed us both to analyze the direct integration of a gesture fragment at the homonym as well as explore the indirect consequence of this integration at a later target word, which was presented a few words downstream in the sentence. Although widely recognized as being one of the crucial factors for gesture–speech comprehension, the temporal aspects of iconic gestures have been understudied so far. In 1992, McNeill stressed the significance of timing for gesture–speech integration by putting forward his semantic synchrony rule. He stated that the same semantic content has to occur simultaneously in gesture and speech in order to allow proper integration. This means that if both sources of information are synchronously present, the information should be integrated, and this integration is suggested to occur in an automatic fashion (McNeill et al., 1994). Our results point to a similar direction. When the iconic gesture fragments were synchronized with their coexpressive speech unit (i.e., the homonym), an effect of immediate iconic gesture–speech integration was found at the position of the homonym (Experiment 3). Subordinate gesture fragments elicited a larger negativity than dominant gesture fragments. This finding is in line with Simpson and Krueger (1991), who showed for ambiguous words presented in a neutral context that the dominant meaning was activated to a stronger degree than the subordinate meaning. Because of the mutual influence of homonym and gesture fragments, we cannot really tell how exactly the integration of iconic gestures and speech works, although a plausible scenario will be given below. We suggest that the ERP results at the homonym reflect an effect of word meaning frequency. A gesture fragment that is related to the more frequent dominant meaning of a homonym can be integrated easier with the homonym than a gesture fragment related to the less frequent subordinate meaning. In contrast to the above mentioned results, no significant ERP effect was found at the homonym when a gesture fragment and the corresponding homonym were asynchronous, thereby lacking temporal overlap (Experiment 2). We assume that only if the gesture fragment and homonym are synchronous, and thus, share an amount of temporal overlap, can they be automatically integrated into one single idea unit. If, however, there is no overlap and immediate automatic integration is not feasible, the information within the gesture fragments gets lost over time and cannot be integrated with the homonym. Thus, timing seems to be an important factor for gesture–speech integration. Most likely, it is not the absolute temporal synchrony that is crucial, but rather the temporal overlap between gesture and speech. Presumably, there is some kind of time window for iconic gesture–speech integration, as it has been found for other types of multimodal integration (e.g., McGurk Effect; Van Wassenhove, Grant, & Poeppel, 2007). Again, more experiments are clearly needed to substantiate such a conjecture.

Iconic Gestures and Memory

The ERP data of Experiment 1 showed that when the task explicitly required an integration of gesture and speech, participants can use effortful memory processes to overcome the integration problems related to nonoverlapping gesture and speech channels. We know from the behavioral test that the gesture fragments themselves are rather meaningless without context. Therefore, they have to be aligned with the homonym to become a single meaningful unit (i.e., related to the dominant or subordinate meaning of the homonym). One possible way to achieve this is to store the gestural content actively in working memory until the corresponding homonym has been encountered. At this point in time, stored gestural information and the corresponding speech unit are synchronous again and can be integrated. The type of memory involved in this process is less likely to be semantic in nature because the behavioral data showed that the gesture fragments are virtually meaningless when presented without a context. We therefore speculate that semantic interpretation of the gesture fragments is delayed until the IP of the homonym has been processed. Until then, the gesture fragments are thought to be stored in a nonsemantic (e.g., movement-based) format in working memory.6

Context Strength and Gesture–Speech Integration

As has been argued above, gesture and speech are integrated at the homonym position, which led to a disambiguation of the homonym, which can be measured further downstream in the sentence at the target word. In Experiment 1, this disambiguation was independent of word meaning frequency. In Experiment 3, however, only the subordinate targets showed a clear N400 effect. These results are a bit puzzling and need an explanation. Previous research on homonym disambiguation has shown that weak contexts only affect the subordinate meaning but not the dominant meaning of a homonym (e.g., Holle & Gunter, 2007; Martin et al., 1999; Simpson, 1981). This is exactly what seemed to happen in Experiment 3, where a shallower task was used. When participants are not pushed by the task to integrate gesture and speech, the meaning of the fragments is treated as weak context. In Experiment 1, however, the task required the participants to actively combine the information from the two domains. Because the task demands modified the perceived importance of the semantic relationship between gesture and speech, it also changed the weak gesture context into a strong one.

What Happens at the Homonym?—A Potential Scenario for Gesture–Speech Integration

It must be clear by now that the processes taking place at the position of homonym are quite complex. Let us use Experiment 1 as an example to come up with a plausible scenario. First, the gesture fragment is stored in a nonsemantic memory buffer. At this point in time, the fragment does not have a meaning yet, because there is no suitable context to which the fragment can be related. A meaningful interpretation is only possible after the homonym has been processed. This is the case at the IP. At this point in time, it seems that two things happen in parallel. On the one hand, the homonym activates both the dominant and the subordinate word meaning (cf. Swinney, 1979). On the other hand, the gesture fragment acquires meaning by interacting with the corresponding meaning of the homonym. The meaning of the homonym supported by gesture is kept active, while the unsupported meaning decays or is actively inhibited, resulting in a disambiguation of the homonym. It is quite obvious that gesture–speech integration in our case is a very complex process of mutual disambiguation between both channels. It is not just that gesture is only influencing speech or the other way around. Both information channels impact/need each other at the same time in order to disambiguate. This process of mutual influence between gesture and speech in order to generate a single semantic unit is the key part of any gesture–speech integration. To identify the precise timing and structure of this process is, without doubt, one of the major goals of future gesture research. The present results may provide a starting point for this quest.

Conclusion

The present set of experiments showed that when iconic gesture fragments and speech are in synchrony, their interaction is task-independent and, in this sense, automatic. In contrast, when an iconic gesture is not in synchrony with its related speech unit, more controlled, active memory processes are necessary to be able to combine the gesture fragment with its related speech unit in such a way that the homonym is disambiguated. Thus, when synchrony fails, memory can help.

Acknowledgments

We thank Ina Koch and Kristiane Werrmann for data acquisition, Sven Gutekunst for technical assistance, and Shirley-Ann Rüschemeyer and Jim Parkinson for editing the article from a native speaker perspective. Arjan van Eerden did the stimulus preparation and data collection of Experiment 3 for his bachelor thesis. We thank five anonymous reviewers for their helpful comments.

Reprint requests should be sent to Christian Obermeier, Max-Planck-Institute for Human Cognitive and Brain Sciences, PO Box 500 355, 04303 Leipzig, Germany, or via e-mail: obermeier@cbs.mpg.de.

Notes

1. 

Note that there are also so-called holds (prestroke hold, poststroke hold), whose main purpose is to ensure synchrony between gesture and corresponding speech unit. Because there were no such holds in the used stimulus material, we do not describe those gesture phases in more detail.

2. 

Because both studies are based on the same stimulus material, the stimulus example in Table 2 for the present study also illustrates the sentences used by Holle and Gunter.

3. 

The N400 is a negative deflection in the ERP of the encephalogram (EEG) with its peak around 400 msec and is hypothesized to reflect semantic processing (Hinojosa, Martin-Loeches, & Rubia, 2001). In particular, the N400 has been shown to be sensitive to the difficulty of integration of a stimulus (e.g., word, but also gesture) into a preceding context. The easier this can be done, the smaller is the N400 amplitude.

4. 

The participants were presented with a silent clip of the gesture fragment and had to write down a free proposal of the meaning of the fragment. Only 7% of all gestures were identified correctly (i.e., semantically related to the meaning of the target word or homonym) and again only 7% of these correct responses included the actual target word (i.e., 0.5% overall). The responses were assessed by two raters (interrater reliability = .80).

5. 

In contrast to Experiments 1 and 2, clear early ERP components can be seen in both gesture conditions in Experiment 3. These components are due to the offset of the gesture fragment and relate to the physical properties of the stimulus (cf. Donchin, Ritter, & McCallum, 1978). For the present purpose, only the negative modulation of the ERP is of importance.

6. 

An alternative explanation would be that the decontextualized gesture fragments are immediately semantically interpreted, and that the information is then stored in a semantic format in working memory. The explanation for the results of behavioral pretest would then be that the semantic interpretations elicited by the decontextualized gesture fragments are just too underspecified to reliably elicit a verbal description. However, we believe that this is a less likely explanation of our data.

REFERENCES

Clark
,
H. H.
(
1973
).
The language-as-fixed-effect fallacy: A critique of language statistics in psychological research.
Journal of Verbal Learning and Verbal Behavior
,
12
,
335
359
.
Donchin
,
E.
,
Ritter
,
W.
, &
McCallum
,
W. C.
(
1978
).
Cognitive psychophysiology: The endogenous components of the ERPs.
In E. Callaway, P. Teuting, & S. Koslow (Eds.),
Event-related brain potentials in man
(pp.
349
441
).
New York
:
Academic Press
.
Emmorey
,
K.
, &
Corina
,
D.
(
1990
).
Lexical recognition in sign language: Effects of phonetic structure and morphology.
Perceptual and Motor Skills
,
71
,
1227
1252
.
Friederici
,
A.
,
Gunter
,
T. C.
,
Hahne
,
A.
, &
Mauth
,
K.
(
2004
).
The relative timing of syntactic and semantic processes in sentence comprehension.
NeuroReport: For Rapid Communication of Neuroscience Research
,
15
,
165
169
.
Gaskell
,
M. G.
, &
Marslen-Wilson
,
W. D.
(
1997
).
Integrating form and meaning: A distributed model of speech perception.
Language and Cognitive Processes
,
12
,
613
656
.
Green
,
A.
,
Straube
,
B.
,
Weiss
,
S.
,
Jansen
,
A.
,
Willmes
,
K.
,
Konrad
,
K.
,
et al
(
2009
).
Neural integration of iconic and unrelated coverbal gestures: A functional MRI study.
Human Brain Mapping
,
30
,
3309
3324
.
Greenhouse
,
S. W.
, &
Geisser
,
S.
(
1959
).
On methods in the analysis of profile data.
Psychometrika
,
24
,
95
112
.
Grosjean
,
F.
(
1980
).
Spoken word recognition processes and the gating paradigm.
Perception & Psychophysics
,
28
,
267
283
.
Grosjean
,
F.
(
1983
).
How long is the sentence? Prediction and prosody in the on-line processing of language.
Linguistics
,
21
,
501
529
.
Grosjean
,
F.
(
1996
).
Gating.
Language and Cognitive Processes
,
11
,
597
604
.
Gunter
,
T. C.
,
Wagner
,
S.
, &
Friederici
,
A. D.
(
2003
).
Working memory and lexical ambiguity resolution as revealed by ERPs: A difficult case for activation theories.
Journal of Cognitive Neuroscience
,
15
,
643
657
.
Hadar
,
U.
, &
Pinchas-Zamir
,
L.
(
2004
).
The semantic specificity of gesture: Implications for gesture classification and function.
Journal of Language and Social Psychology
,
23
,
204
214
.
Hinojosa
,
J. A.
,
Martin-Loeches
,
M.
, &
Rubia
,
F. J.
(
2001
).
Event-related potentials and semantics: An overview and an integrative proposal.
Brain and Language
,
78
,
128
139
.
Holle
,
H.
, &
Gunter
,
T. C.
(
2007
).
The role of iconic gestures in speech disambiguation: ERP evidence.
Journal of Cognitive Neuroscience
,
19
,
1175
1192
.
Holle
,
H.
,
Gunter
,
T. C.
,
Rueschemeyer
,
S. A.
,
Hennenlotter
,
A.
, &
Iacoboni
,
M.
(
2008
).
Neural correlates of the processing of co-speech gestures.
Neuroimage
,
39
,
2010
2024
.
Holle
,
H.
,
Obleser
,
J.
,
Rueschemeyer
,
S. A.
, &
Gunter
,
T. C.
(
2010
).
Integration of iconic gestures and speech in left superior temporal areas boosts speech comprehension under adverse listening conditions.
Neuroimage
,
49
,
875
884
.
Jansen
,
E.
, &
Povel
,
D. J.
(
2004
).
The processing of chords in tonal melodic sequences.
Journal of New Music Research
,
33
,
31
48
.
Kelly
,
S. D.
,
Kravitz
,
C.
, &
Hopkins
,
M.
(
2004
).
Neural correlates of bimodal speech and gesture comprehension.
Brain and Language
,
89
,
253
260
.
Kelly
,
S. D.
,
Ward
,
S.
,
Creigh
,
P.
, &
Bartolotti
,
J.
(
2007
).
An intentional stance modulates the integration of gesture and speech during comprehension.
Brain and Language
,
101
,
222
233
.
Kita
,
S.
,
van Gijn
,
I.
, &
van der Hulst
,
H.
(
1998
).
Movement phase in signs and co-speech gestures, and their transcriptions by human coders.
Lecture Notes in Computer Science
,
1371
,
23
35
.
Krauss
,
R. M.
,
Morrel-Samuels
,
P.
, &
Colasante
,
C.
(
1991
).
Do conversational hand gestures communicate?
Journal of Personality and Social Psychology
,
61
,
743
754
.
Lausberg
,
H.
, &
Kita
,
S.
(
2003
).
The content of the message influences the hand choice in co-speech gestures and in gesturing without speaking.
Brain and Language
,
86
,
57
69
.
Lausberg
,
H.
, &
Slöetjes
,
H.
(
2008
).
Gesture coding with the NGCS–ELAN system.
In A. J. Spink, M. R. Ballintijn, N. D. Rogers, F. Grieco, L. W. S. Loijens, L. P. J. J. Noldus, et al. (Eds.),
Proceedings of Measuring Behavior 2008, 6th International Conference on Methods and Techniques in Behavioral Research
(pp.
176
177
).
Maastricht
:
Noldus
.
Levelt
,
W. J.
,
Richardson
,
G.
, &
la Heij
,
W.
(
1985
).
Pointing and voicing in deictic expressions.
Journal of Memory and Language
,
24
,
133
164
.
Marslen-Wilson
,
W.
(
1987
).
Functional parallelism in spoken word-recognition.
Cognition
,
25
,
71
102
.
Martin
,
C.
,
Vu
,
H.
,
Kellas
,
G.
, &
Metcalf
,
K.
(
1999
).
Strength of discourse context as a determinant of the subordinate bias effect.
Quarterly Journal of Experimental Psychology: Section A, Human Experimental Psychology
,
52
,
813
839
.
McNeill
,
D.
(
1992
).
Hand and mind: What gestures reveal about thought.
Chicago
:
University of Chicago Press
.
McNeill
,
D.
,
Cassell
,
J.
, &
McCullough
,
K.-E.
(
1994
).
Communicative effects of speech-mismatched gestures.
Research on Language and Social Interaction
,
27
,
223
237
.
Morrel-Samuels
,
P.
, &
Krauss
,
R. M.
(
1992
).
Word familiarity predicts temporal asynchrony of hand gestures and speech.
Journal of Experimental Psychology: Learning, Memory, and Cognition
,
18
,
615
622
.
Oldfield
,
R. C.
(
1971
).
The assessment and analysis of handedness: The Edinburgh inventory.
Neuropsychologia
,
9
,
97
113
.
Özyürek
,
A.
,
Willems
,
R. M.
,
Kita
,
S.
, &
Hagoort
,
P.
(
2007
).
On-line integration of semantic information from speech and gesture: Insights from event-related brain potentials.
Journal of Cognitive Neuroscience
,
19
,
605
616
.
Posner
,
M. I.
, &
Snyder
,
C. R. R.
(
1975
).
Attention and cognitive control.
In R. L. Solso (Ed.),
Information processing and cognition: The Loyola symposium.
Hillsdale, NJ
:
Erlbaum
.
Salasoo
,
A.
, &
Pisoni
,
B. P.
(
1985
).
Interaction of knowledge sources in spoken word identification.
Journal of Memory and Language
,
24
,
210
231
.
Schneider
,
W.
, &
Shiffrin
,
R. M.
(
1977
).
Controlled and automatic human information-processing: 1. Detection, search, and attention.
Psychological Review
,
84
,
1
66
.
Shiffrin
,
R. M.
, &
Schneider
,
W.
(
1977
).
Controlled and automatic human information-processing: 2. Perceptual learning, automatic attending, and a general theory.
Psychological Review
,
84
,
127
190
.
Simpson
,
G. B.
(
1981
).
Meaning dominance and semantic context in the processing of lexical ambiguity.
Journal of Verbal Learning and Verbal Behavior
,
20
,
120
136
.
Simpson
,
G. B.
, &
Krueger
,
M.
(
1991
).
Selective access of homograph meanings in sentence context.
Journal of Memory and Language
,
30
,
627
643
.
Swinney
,
D. A.
(
1979
).
Lexical access during sentence comprehension: (Re)consideration of context effects.
Journal of Verbal Learning and Verbal Behavior
,
18
,
645
659
.
Treffner
,
P.
,
Peter
,
M.
, &
Kleidon
,
M.
(
2008
).
Gestures and phases: The dynamics of speech–hand communication.
Ecological Psychology
,
20
,
32
64
.
Van Wassenhove
,
V.
,
Grant
,
K. W.
, &
Poeppel
,
D.
(
2007
).
Temporal window of integration in auditory–visual speech perception.
Neuropsychologia
,
45
,
598
607
.
Willems
,
R. M.
,
Özyürek
,
A.
, &
Hagoort
,
P.
(
2007
).
When language meets action. The neural integration of gesture and speech.
Cerebral Cortex
,
17
,
2322
2333
.
Wu
,
Y. C.
, &
Coulson
,
S.
(
2005
).
Meaningful gestures: Electrophysiological indices of iconic gesture comprehension.
Psychophysiology
,
42
,
654
667
.