This experiment investigates the integration of gesture and speech from a multisensory perspective. In a disambiguation paradigm, participants were presented with short videos of an actress uttering sentences like “She was impressed by the BALL, because the GAME/DANCE….” The ambiguous noun (BALL) was accompanied by an iconic gesture fragment containing information to disambiguate the noun toward its dominant or subordinate meaning. We used four different temporal alignments between noun and gesture fragment: the identification point (IP) of the noun was either prior to (+120 msec), synchronous with (0 msec), or lagging behind the end of the gesture fragment (−200 and −600 msec). ERPs triggered to the IP of the noun showed significant differences for the integration of dominant and subordinate gesture fragments in the −200, 0, and +120 msec conditions. The outcome of this integration was revealed at the target words. These data suggest a time window for direct semantic gesture–speech integration ranging from at least −200 up to +120 msec. Although the −600 msec condition did not show any signs of direct integration at the homonym, significant disambiguation was found at the target word. An explorative analysis suggested that gesture information was directly integrated at the verb, indicating that there are multiple positions in a sentence where direct gesture–speech integration takes place. Ultimately, this would implicate that in natural communication, where a gesture lasts for some time, several aspects of that gesture will have their specific and possibly distinct impact on different positions in an utterance.
The integration of gesture and its accompanying speech is typically explored either from a language or a communication perspective. This is slightly surprising because we are talking about a multisensory signal. To broaden our scope on gesture–speech integration, the present experiment explores this issue from a multisensory perspective and investigates whether there exists a time window of gesture–speech integration where both signals are merged in an automatic fashion.
Gesture research has shown that the timing between gesture and the word it refers to (the so-called lexical affiliate) is an important issue. In his seminal book on gestures, McNeill (1992) suggests that temporal synchrony of gesture and speech is a crucial factor for a proper (in his view automatic) integration of both streams of information. The fact that synchrony and timing are relevant for gesture–speech integration becomes clear when we look at the structure of a gesture, which typically consists of three consecutive time phases, namely, preparation, stroke, and retraction. The stroke phase is considered by most researchers to represent the essential part of the gesture. This gesture phase contains the largest amount of (semantic) information, and speakers have the tendency to produce stroke and lexical affiliate together (Levelt, Richardson, & Laheij, 1985). To date, most research exploring the timing of gesture–speech integration used so-called beat gestures (e.g., Leonard & Cummins, 2011; Treffner, Peter, & Kleidon, 2008). These rapid flicks of the hand differ significantly from the iconic gestures used in the present paper in several ways. First, they are the most speech-dependent cospeech gestures (for details, see McNeill, 1992). Second, beat gestures only have two phases (usually a downward movement to the apex of the gesture followed by an upward movement). Third, they do not contain any semantic information. The sole purpose of a beat gesture is to add prominence to a specific word in an utterance (e.g., Holle et al., 2012; Krahmer & Swerts, 2007), resembling a more general multisensory mechanism by which a salient signal from one modality can draw attention to a specific item in another modality (Talsma, Senkowski, Soto-Faraco, & Woldorff, 2010). Nevertheless, beat gesture studies resulted in a number of important findings, which might also generalize to other types of gestures. Using an avatar, Treffner and colleagues (2008) varied the timing between a beat gesture and a corresponding utterance (“put the book there now”). They found that the relative timing between the apex of the beat and the utterance influenced which word was perceived as most prominent. When the apex was aligned with “book,” “book” was perceived as prominent. When, however, the apex was gradually shifted toward the following word “there,” the perceived prominence shifted accordingly, indicating that gesture–speech integration is time sensitive and can vary as a consequence of the temporal relation between a gesture and an utterance. Using a more natural setting, Leonard and Cummins (2011) showed that beat–speech integration, like other audiovisual integration, occurs in an asymmetric timing window, that is, a perceiver tolerates more audio lag (gesture apex precedes the corresponding speech unit) than visual lag (gesture apex lags behind the corresponding speech unit). In general, the findings on beat gesture suggest that gesture–speech integration is a time-dependent process like other types of multisensory integration indicating that the synchrony between gesture and speech plays an important role in their integration.
The synchrony of gesture and speech indeed plays an important role in their ability to integrate. This was recently demonstrated in two ERP experiments on the integration of iconic gestures and speech (gesture–word integration: Habets, Kita, Shao, Özyürek, & Hagoort, 2011; gesture–sentence integration: Obermeier, Holle, & Gunter, 2011). For instance, the study by Obermeier et al. (2011) showed that the integration of gesture fragments and speech occurs in an automatic fashion when they were presented simultaneously. When this synchrony was lost because the gesture fragment preceded speech by 1000 msec (i.e., there was no temporal overlap between gesture fragment and lexical affiliate), effortful processing, induced via task instruction, was necessary for a successful integration.
Gesture research so far indicates that gesture–speech integration depends on temporal synchrony or overlap. This is in line with a steadily increasing amount of literature documenting the significance of timing in other types of multisensory integration (for recent reviews, see Chen & Vroomen, 2013; Vatakis & Spence, 2010). Besides spatial coincidence (e.g., Zampini, Shore, & Spence, 2003a; Soto-Faraco, Lyons, Gazzaniga, Spence, & Kingstone, 2002; Spence & Driver, 1997) and semantic congruency (e.g., Chen & Spence, 2010; Spence, 2007), timing or temporal synchrony is one of the crucial factors that allow us to bind multisensory input into one percept. This multisensory integration is not a simple process, because it is influenced by both physical nonneuronal factors (speed of light vs. speed of sound) as well as neuronal factors like the different processing time for visual and auditory input (e.g., Pöppel, Schill, & von Steinbüchel, 1990). In general, multisensory integration is enhanced when the signals are in synchrony (e.g., Calvert, Spence, & Stein, 2004; De Gelder & Bertelson, 2003). There are, however, numerous studies using either synchrony judgments or temporal order judgments that show that the integration process does not depend on exact synchrony but tolerates considerable cross-modal temporal variability within a certain time range. This so-called time window of integration has been found for rather simple multisensory stimuli (e.g., Zampini, Shore, & Spence, 2003b; McGrath & Summerfield, 1985; Hirsh & Sherrick, 1961) as well as more naturalistic and complex stimuli (e.g., Vatakis & Spence, 2006a, 2006c, 2008; van Wassenhove, Grant, & Poeppel, 2007; Hollier & Rimell, 1998; Munhall, Gribble, Sacco, & Ward, 1996; Dixon & Spitz, 1980). For instance, using simple flash–beep combinations, Zampini et al. (2003b) showed that visual and auditory information were perceived as asynchronous when their SOA exceeded 60–70 msec. Research using more complex, ecologically valid, and naturalistic stimuli (e.g., audiovisual speech, music, object–action) suggests, however, that humans tolerate much higher asynchronies (for a review, see Vatakis & Spence, 2010). One of the first studies investigating the impact of timing on multisensory speech and nonspeech stimuli was conducted by Dixon and Spitz (1980). They presented participants with continuous videos containing either an audiovisual speech stream or an object–action event, which were gradually desynchronized. Participants had to respond as soon as they perceived asynchrony between visual and auditory stream. Dixon and Spitz (1980) found that for the audiovisual speech stream, participants tolerated an auditory lag of 258 msec as well as a visual lag of 131 msec. For the object–action videos, however, the participants only tolerated an auditory lag of 188 msec and a visual lag of 75 msec. These data suggest that participants are less sensitive to asynchrony in audiovisual speech, a result that has also been obtained by other studies (Hollier & Rimell, 1998; Miner & Caudell, 1998). Note, however, that for several reasons1 these early studies might have overestimated people's tolerance for asynchrony in audiovisual speech processing. In fact, more recent and controlled studies provide a much clearer and precise picture on the impact of temporal synchrony on complex2 audiovisual integration (Vatakis & Spence, 2006a, 2006c, 2008; van Wassenhove et al., 2007; Grant, van Wassenhove, & Poeppel, 2004; Massaro, Cohen, & Smeele, 1996; Munhall et al., 1996; Steinmetz, 1996; Rihs, 1995; McGrath & Summerfield, 1985). For instance, using the McGurk fusion paradigm (e.g., McGurk & MacDonald, 1976), van Wassenhove et al. (2007) identified a time window with a width of approximately 200 msec that allowed for the fusion to occur. Specifically, the time ranged from −170 msec visual lip movement lead to +30 msec visual lip movement lag relative to syllable onset.
Overall, studies investigating more complex multisensory integration show some consistencies, but also some variance. It is consistently found that the time window width is in the range of several hundred milliseconds and that this window is usually asymmetric with bigger visual than auditory leads. Variances in the actual time window have been addressed by recent studies (Spence, 2010; Roseboom, Nishida, & Arnold, 2009). So far, it has been shown that the time window of integration for complex stimuli can be influenced by stimulus type (less sensitivity to audiovisual music events than speech or object action, Vatakis & Spence, 2007; narrower and more asymmetric time window for speech than no-speech stimuli, Maier, Di Luca, & Noppeney, 2010), stimulus complexity and properties (e.g., higher temporal sensitivity to less complex stimuli, Vatakis & Spence, 2006a, 2006c; familiarity, Petrini, Russell, & Pollick, 2009), the degree of unity between visual and auditory stream (more temporal sensitivity for mismatching than matching audiovisual information, Vatakis & Spence, 2007), the presentation medium (Vatakis & Spence, 2006b), the task (Stevenson & Wallace, 2013; van Atteveldt, Formisano, Goebel, & Blomert, 2007), and training (Stevenson, Wilson, Powers, & Wallace, 2013) and is also prone to interindividual differences (Stevenson, Zemtsov, & Wallace, 2012). Thus, it is not easy to predict how a potential time window for gesture–speech integration might look. On the basis of the complexity of semantic gesture–speech integration, we assume that the time window is likely to be relatively large and comparable to or even larger than, for instance, the time window found for the McGurk effect by van Wassenhove et al. (2007). The integration of lip movements and speech sounds based on physical properties is likely less complex than the semantic integration of gesture and speech, which operates on a higher cognitive level (for different types of multisensory integration, see Spence & Deroy, 2013).
Before exploring a hypothetical integration time window of gesture and speech, two things needed to be addressed. On the one hand, we needed to have a measure with a high temporal resolution, and on the other hand, the relevant visual and acoustic information needed to be restricted in their length to be able to shift both signals without getting any problematic overlap with other parts of the stimulus.
Because of its excellent timing properties, we used ERPs of the EEG as a measurement technique. Previous electrophysiological research has shown that difficult gesture–speech integration leads to an enhanced negativity around 400 msec after stimulus onset. This N400 component is elicited both in violation (Kelly, Healey, Özyürek, & Holler, 2012; Habets et al., 2011; Ibanez et al., 2011; Wu & Coulson, 2005, 2007, 2010, 2011; Ibanez et al., 2010; Kelly, Creigh, & Bartolotti, 2010; Cornejo et al., 2009; Kelly, Ward, Creigh, & Bartolotti, 2007; Özyürek, Willems, Kita, & Hagoort, 2007; Sheehan, Namy, & Mills, 2007; Kelly, Kravitz, & Hopkins, 2004) as well as in disambiguation paradigms (Obermeier, Dolk, & Gunter, 2012; Obermeier et al., 2011; Holle & Gunter, 2007) and is suggested to reflect semantic processing (for a recent review and detailed discussion, see Kutas & Federmeier, 2011). The N400 effects found in violation paradigms show that incongruent gestures negatively affect semantic processing (cf. Kelly et al., 2004). N400 effects in disambiguation paradigms show that the integration of an ambiguous word with a gesture related to the subordinate meaning of this homonym is more effortful than the integration of a gesture related to the dominant meaning (Obermeier et al., 2011).
Clearly, gesture–speech combinations as found “in everyday life” are not very well suited to investigate the time window issue. Typically, a full-length iconic gesture overlaps with large parts of the sentence. Even when the stroke phase is in the approximation of the lexical affiliate, the preparation and retraction phases are certainly not. Such undesirable overlap makes it impossible to manipulate the synchrony between an iconic gesture and speech without simultaneously changing the amount of gesture information available at a given point in the sentence. To circumvent this problem, we had to reduce the length of the gestures without impacting their semantic content. Using a context-guided gating procedure, we created gesture fragments containing enough information to disambiguate a homonym (see Obermeier et al., 2011). Note that these fragments themselves have no clear-cut meaning, as was shown in a rating study. The fragments only disambiguate when the context, in this case the ambiguous lexical affiliate, is known. They were small enough to manipulate the synchrony between gesture and speech in a reasonable way without the possible confound of multisensory semantic priming when the gesture fragment precedes speech (see General Discussion).
From a theoretical perspective, it is important to note that there is a clear difference to what integration refers to when comparing gesture–speech combinations and more classical multisensory situations. In many multisensory integration studies, the two signals are typically tightly and directly coupled. For instance, in the McGurk effect, there is no doubt that the lip movements and speech come from the same source and are physically coupled into a new percept (e.g., McGurk & MacDonald, 1976). In such cases, the time window refers to whether signals in both modalities are still being perceived as coming from the same effector and the merging within the time window refers to physical onset properties of the signals. However, as recently discussed by Spence and Deroy (2013), but also by Vatakis and Spence (2010), in addition to perceptual multisensory integration, there might also be multiple other levels of multisensory integration including integration on a semantic level. This notion can help us theorize what could happen in a potential time window of gesture–speech integration. Although in the case of iconic gesture–speech combinations, the multisensory signals are clearly produced by one person, the integration is based on semantics because there is no direct coupling in an effector way. There are almost indefinite ways of communicating a particular concept by means of gesture combined with speech. Indeed, models of gesture production suggest that both channels develop from the same concept (cf. de Ruiter, 2000; Kita, 2000), and the only thing they have in common is their conceptual relationship, that is, their semantics. This means that we have to adapt our idea of what happens in the time window of integration from the merging of physical properties in a classical sense toward a more semantically related function when exploring it in the domain of gesture and speech.
Because of the fact that within the time window of gesture–speech integration semantic information plays such an important role, it is clear that the synchrony manipulation used to reveal the time window cannot be defined in terms of the SOA, as is done in the classical multisensory experiments. Instead we need to find the point in time where the combination of gesture fragment and uttered lexical affiliate has an established meaning. For the lexical affiliate (in this case the homonym), this position is the so-called identification point (IP) of that word. The IP of a word is the point in time when the word becomes unique and is typically found using a gating procedure (Grosjean, 1996). Because the gesture fragments constitute the minimal available gesture information to disambiguate the homonym, the end of the fragment can be used as a synchronization point for the gesture information.
Before elaborating on the present experiment and hypotheses, we first describe what to expect in an experimental situation as described above and discuss the study of Obermeier et al. (2011), which used the same gesture fragments as the present experiment. In Obermeier et al. (2011), participants watched an actress perform gestures while she uttered sentences that contained a homonym followed by a target word that was related either to the more frequent dominant or the less frequent subordinate meaning. The gestures disambiguated the homonym. To investigate immediate gesture–speech integration, the gestures had to be radically shortened in length. Using a context-guided gating procedure, gesture fragments were obtained that could disambiguate the homonym (for details, see the Methods section). When gesture fragments and homonyms were presented asynchronously (−1000 msec visual lead; see Figure 1), there was only an effect of immediate gesture–speech integration at the homonym as well as a more global disambiguating effect at the target word later on when an explicit, controlled task was used (Experiment 1) and not when the task did not require explicit gesture–speech integration (the shallow task used in Experiment 2). When gesture fragment and homonym were synchronous, clear effects of immediate gesture–speech integration were found, even under shallow task conditions (Experiment 3). These results suggest that synchronous gesture–speech information can be integrated in a relatively automatic way whereas asynchronous gesture–speech information can only be integrated in an effortful way.
Thus, although the Obermeier et al. (2011) study highlights the significance of synchrony in gesture–speech integration, it tells us little about a potential time window of integration for gesture and speech. In the present experiment, we address this open issue by using four different temporal alignments between the homonym IP and the offset of the gesture fragment (see Figure 1). The IP of the homonym was either prior to (+120 msec), synchronous with (0 msec), or lagging behind the end of the gesture fragment (−600 msec/−200 msec). The different timings were chosen carefully on the basis of the multisensory integration literature and the constraints of the stimulus material. The synchronous condition was inserted to replicate our previous findings (Obermeier et al., 2011) and served as a control. It overlaps considerably with the corresponding speech unit, that is, the homonym, but still has a visual lead of 160 msec. The −200 msec condition tests the limits of a potential time window for gesture–speech integration with regard to visual lead. Note that the gesture fragment (including its most salient point, its offset) still overlaps with the homonym and should be associated with it (based on Wilbiks & Dyson, 2013). The −600 msec condition does not temporally overlap with the homonym, but it does overlap with the preceding verb. It is an empirical question whether the fragment in this condition is nevertheless locally integrated with the homonym, and if not, if it is integrated with the verb. The +120 msec is different from the other three conditions as it features the auditory lead of the homonym IP with regard to the gesture fragment disambiguation point (DP). The onset of both homonym and gesture fragment is almost simultaneous, which is not very common in natural communication. The interesting question is whether, on the one hand, there is still local semantic gesture–speech integration with the homonym as an auditory lag of +120 msec is quite challenging and, on the other hand, whether this integration is different from the 0 msec condition in terms of effect size or distribution, because in the case of +120 msec the homonym meaning is known prior to the most salient and disambiguating part of the gesture fragment. Thus, in contrast to the other timing conditions, we possibly have a case of gestures being disambiguated by speech than speech disambiguation by means of gestures.
Summing up, we hypothesized that a time window of gesture–speech integration exists and that we would find significant direct, local integration effects of gesture and homonym in the −200, 0, and +120 msec conditions. These conditions are in the range where complex multisensory integration takes place (cf. van Wassenhove et al., 2007). We also expected to find global disambiguation effects at the target words, which were presented more downstream the sentence. Specifically, we only expected these global effects to occur at subordinate target words, as the gesture fragments used constitute a weak disambiguation context (for details, see Obermeier et al., 2011). Whether local or global gesture–speech integration is still present when the asynchrony goes up to −600 msec is an empirical question.
In the present experiment, we analyzed the direct integration of gesture and speech information by measuring the ERPs time locked to the IP of the homonym and explored whether it depends on the temporal alignment of gesture and speech. Additionally, we scrutinized the global disambiguating effect of the gesture fragments on the sentence level by analyzing the ERPs to a sentence disambiguating target words later on in the speech signal.
Forty-one native German-speaking individuals were paid for their participation and signed a written informed consent according to the Ethics Committee of the University of Leipzig. Nine of them were excluded from further analyses either because of excessive eye artifacts (four participants), technical problems (four participants), or not showing up for the second measurement (one participant). The remaining 32 participants (16 women; 22–29 years, mean = 25.4 years) were right-handed (mean laterality coefficient = 95.4; Oldfield, 1971), had normal or corrected-to-normal vision, no known hearing deficits, and had not taken part in any previous experiments using the identical stimulus material.
The stimuli used in this study were gesture fragments identical to those used in previous studies in our lab (Obermeier et al., 2011, 2012). However, before describing the actual timing manipulation, we want to shortly summarize how we obtained the gesture fragments and why these gesture fragments are especially useful if one is interested in the temporal relation between gesture and speech.
The Original Stimulus Material
The basic gesture–speech material for this study derives from a series of experiments by Holle and Gunter (2007). They used 48 German homonyms each with a distinct dominant and subordinate meaning (for a more specific description of how dominant and subordinate meanings were determined, see Gunter, Wagner, & Friederici, 2003). For every homonym, two 2-sentence utterances were constructed, one containing a dominant target word, the other one containing a subordinate target word. Each utterance started with a short introductory sentence introducing a person, followed by a longer complex sentence describing an action performed by that person. The complex sentence consisted of a main clause comprising the homonym and a subsequent subclause, which included the target word. The sentences for the dominant and subordinate versions were identical prior to the target word. To obtain the gesture video material, we recorded a professional actress whose face was covered with a black nylon stocking. While uttering the sentences, she simultaneously performed a gesture that supported the sentence meaning. For more details about the preparation and recording of the original stimuli, please see Holle and Gunter (2007).
A context-guided gating (for details, see Obermeier et al., 2011; for more details about gating in general, see Grosjean, 1996) was used to determine the DP of every gesture. For this purpose, we presented 40 native speakers of German with the following experimental procedure: A typical gating trial started with a visual presentation of the homonym for 500 msec (e.g., ball), followed by presentation of the gated gesture video. After each video, participants had to determine whether the homonym referred to the dominant meaning or the subordinate meaning on the basis of the gesture information they had seen. To this end, three response alternatives were presented on the screen: (a) dominant meaning (e.g., game), (b) subordinate (e.g., dance), and (c) “next frame.” Participants were instructed to choose the latter alternative only until they thought they had some indication of the meaning targeted by the gesture. The increment size for each gate was one video frame corresponding to 40 msec, that is, each new gate contained 40 msec more of gesture information. The gating for each gesture started with the onset of the preparation phase and ended either with the stroke offset of the gesture or when a participant had given 10 consecutive correct responses. As very short video sequences are rather difficult to display and recognize, each gated gesture segment also contained 500 msec of still frame prior to gesture preparation onset. Therefore, the shortest gating sequence for each gesture corresponded to a stimulus duration of 540 msec (500 msec prior to the preparation onset + 40 msec for the first frame [gate] of the gesture). All gestures were pseudorandomly distributed between two lists, each containing 24 subordinate and 24 dominant gestures. For each homonym, only one of the two possible gestures was included in each list. Participants were only presented with one of the two lists. The relevant dependent measure of the gating was the DP, which was defined as the amount of gestural information necessary to identify a specific gesture as either being related to the dominant or the subordinate meaning of a homonym without a change in response thereafter. The mean DPs of the different gestures ranged from 2.22 to 19.63 frames (M = 9.88, SD = 3.6; one frame represents 40 msec) relative to preparation onset of the corresponding gesture, that is, the participants in the gating pretest needed to see approximately 400 msec of a particular gesture to disambiguate the related homonym.
Because we were interested in identifying the temporal window for the integration of the gesture fragments with the corresponding homonyms with as much temporal precision as possible, we also determined the IP of each homonym, which corresponds to the earliest point in time at which the homonym is identified. Again using a gating paradigm, spoken word fragments of increasing duration (increment size: 20 msec) were presented to 20 participants. Similar to the DP, the IP was defined as the gate where the participants correctly identified the homonym without any change in response thereafter. On average, the IP of the homonyms was at 260 msec. Importantly, these homonym IPs were used as triggers for the integration of the gesture fragments with the homonyms (see also Figure 1).
Note that an additional pretest showed that the meaning of the gesture fragments is imprecise and unclear without the context of the homonym (for details, see Obermeier et al., 2011). In this pretest, participants were presented with silent video clips of the gesture fragments and had to write down a free proposal of the meaning of the observed gesture fragment. Only 7% of all gestures were identified correctly without speech context (i.e., semantically related to the corresponding dominant or subordinate meaning of the homonym; for a similar result, see Habets et al., 2011). Only 7% of these correct responses actually contained the corresponding target word (i.e., 0.5% of all responses). The responses were assessed by two independent raters (interrater reliability = .80). We therefore conclude that the point in time at which an addressee knows the meaning of the gesture fragments and can disambiguate the homonym based on gestural information is the homonym IP. The big advantage of the present stimulus material is that we know exactly when the semantic integration of gesture fragment and speech should happen and can therefore identify the temporal window of semantic integration of gesture and speech with great precision.
On the basis of the results of the gating studies, the gesture stimuli for this study were constructed as follows. To investigate a potential temporal window of gesture–speech integration, we chose four different levels of temporal relation between gesture DP and homonym IP based on the multisensory integration literature (e.g., van Wassenhove et al., 2007; Vatakis & Spence, 2006a, 2006c; Grant et al., 2004; Dixon & Spitz, 1980) and our previous results (Obermeier et al., 2011): −600 msec (the homonym IP lags behind the gesture DP by 600 msec), −200 msec, 0 msec, and +120 msec (the gesture DP lags behind the homonym IP by 120 msec; for information about the stimulus properties, see Table 1). The reason for specifically choosing these four temporal relations was described in the Introduction.
|Gesture .||Speech .||DP −600 msec .||DP −200 msec .||DP 0 msec .||DP +120 msec .||Homonym Onset .||Homonym IP .||Target Word Onset .|
|D||D||2.49 (0.41)||2.89 (0.41)||3.09 (0.41)||3.21 (0.41)||2.84 (0.40)||3.09 (0.41)||3.78 (0.38)|
|D||S||2.49 (0.41)||2.89 (0.41)||3.09 (0.41)||3.21 (0.41)||2.84 (0.40)||3.09 (0.41)||3.80 (0.38)|
|S||D||2.49 (0.41)||2.89 (0.41)||3.09 (0.41)||3.21 (0.41)||2.84 (0.40)||3.09 (0.41)||3.78 (0.38)|
|S||S||2.49 (0.41)||2.89 (0.41)||3.09 (0.41)||3.21 (0.41)||2.84 (0.40)||3.09 (0.41)||3.80 (0.38)|
|Mean||2.49 (0.41)||2.89 (0.41)||3.09 (0.41)||3.21 (0.41)||2.84 (0.40)||3.09 (0.41)||3.79 (0.38)|
|Gesture .||Speech .||DP −600 msec .||DP −200 msec .||DP 0 msec .||DP +120 msec .||Homonym Onset .||Homonym IP .||Target Word Onset .|
|D||D||2.49 (0.41)||2.89 (0.41)||3.09 (0.41)||3.21 (0.41)||2.84 (0.40)||3.09 (0.41)||3.78 (0.38)|
|D||S||2.49 (0.41)||2.89 (0.41)||3.09 (0.41)||3.21 (0.41)||2.84 (0.40)||3.09 (0.41)||3.80 (0.38)|
|S||D||2.49 (0.41)||2.89 (0.41)||3.09 (0.41)||3.21 (0.41)||2.84 (0.40)||3.09 (0.41)||3.78 (0.38)|
|S||S||2.49 (0.41)||2.89 (0.41)||3.09 (0.41)||3.21 (0.41)||2.84 (0.40)||3.09 (0.41)||3.80 (0.38)|
|Mean||2.49 (0.41)||2.89 (0.41)||3.09 (0.41)||3.21 (0.41)||2.84 (0.40)||3.09 (0.41)||3.79 (0.38)|
Mean onset values are in seconds relative to the onset of the introductory sentence (SD in parentheses). D = dominant; S = subordinate.
For the stimulus construction, first the original gesture and speech streams were separated. Next, full-length gesture streams were substituted with gesture streams cut at the DP. To establish the four different levels of temporal synchrony between a gesture fragment and the corresponding speech unit (−600 msec, −200 msec, 0 msec, +120 msec), the DP of a gesture was shifted with regard to the corresponding homonym IP (see Figure 1). These manipulations resulted in the gesture fragments ending prior to the end of the speech stream. Therefore, the duration of the gesture stream had to be adjusted to the duration of the speech stream by the addition of stills from the corresponding blank video background. This procedure created the illusion of a speaker vanishing from the screen while the speech still continued for a short amount of time. For each of the four timing conditions, speech streams were then recombined with both the dominant as well as the subordinate gesture streams, resulting in a 4 × 2 × 2 design with Timing (−600 msec, −200 msec, 0 msec, +120 msec), Gesture (Dominant vs. Subordinate), and Target Word (Dominant vs. Subordinate) as within-subject factors (see Table 2). Each condition consisted of 48 different items, leading to a complete experimental set of 768 stimuli. The stimulus material was pseudorandomly distributed in such a way that all factors were balanced across the different experimental lists, leading to a total set of 16 different lists for each of the two sessions of the experiment.
Introductory sentence was identical for all four conditions. The first column indicates the meaning conveyed by the gesture fragment in combination with the homonym: dominant (D) or subordinate (S). The second column indicates the meaning conveyed by the target word in the sentence continuation (D or S). Each gesture–homonym combination was combined with both target words, leading to a 2 × 2 design. Each of these four combinations was presented in four different timing conditions (−600, −200, 0, and +120 msec). Literal translation is in italics. Cross-splicing was performed at the end of the main clause (i.e., in this case after the word “ball”).
Participants were seated in a dimly lit, soundproof booth, facing a computer screen. They were instructed to attend to the hand movements in the video and the accompanying speech. Specifically, the task instructions read as follows: “In this experiment, you will see a number of short videos with sound. During these videos, the speaker moves her arms. After some of the videos, you will be asked whether you saw a certain movement or heard a certain word in the previous video.” After each video, a visual prompt cue was presented. In 87.5% of all trials, the prompt cue informed participants that no further response was required. In 6.25% of all trials, the prompt cue indicated that the word task would follow. In the remaining 6.25% of the trials, the prompt signified that the movement task would follow.
Each trial started with a fixation cross for 2000 msec, which was followed by the gesture video. All videos were centered on a black background and extended for 8° visual angle vertically and 10° horizontally. In the case of a task trial, a question mark prompted the participants to respond via button press within 2000 msec. Afterwards, feedback was given for 1000 msec. In the case of a no-task trial, the message “next video” was presented for 2000 msec.
For every participant, there were two experimental sessions separated by 7–14 days. Each of these two sessions was separated into eight blocks of approximately 9 min each. Block order, as well as the presentation order of the items within each block, was varied in a pseudorandomized fashion according to one of the previously created 16 different experimental lists. Thus, each experimental list was presented to two participants. Each experimental session lasted approximately 90 min. The response key assignment was counterbalanced across participants.
The EEG was recorded from 59 Ag/AgCl electrodes (ElectroCap International, Eaton, OH). It was amplified using a PORTI-32/MREFA amplifier (DC to 135 Hz) and digitized at 500 Hz. Electrode impedances were kept below 5 kΩ. The left mastoid served as the reference channel. Vertical and horizontal EOG was measured for artifact rejection purposes.
EEG data were rejected offline by applying an automatic artifact rejection using a 200-msec sliding window on the EOG (±30 mV) and EEG channels (±40 mV). Overall, 21% of the trials were excluded from further analysis on the basis of these criteria. As the amount of behavioral data obtained in the present experiment was small (six responses per condition, three originating from the movement, and three originating from the word task), we decided against using the behavioral data as a rejection criterion for the ERP analyses.
To assess the multisensory integration of gesture and homonym, we calculated single-subject averages triggered to the homonym IP for each condition, because we assumed that this is the point in time when the meanings of gesture fragment and homonym are integrated (see Obermeier et al., 2011). The epochs lasted from 200 msec prior to the IP until 1000 msec afterward, including a 200-msec prestimulus baseline. Ten ROIs were defined: anterior left (AL): AF7, F5, FC5; anterior center–left (ACL): AF3, F3, FC3; anterior center (AC): AFz, Fz, FCz; anterior center–right (ACR): AF4, F4, FC4; anterior right (AR): AF8, F6, FC6; posterior left (PL): CP5, P5, PO7; posterior center–left (PCL): CP3, P3, PO3; posterior center (PC): CPz, Pz, POz; posterior center–right (PCR): CP4, P4, PO4; posterior right (PR): CP6, P6, PO8. On the basis of visual inspection, a time window ranging from 200 to 500 msec after the homonym IP was selected to analyze the integration of gesture and homonym. A repeated-measures ANOVA with Timing (−600 msec, −200 msec, 0 msec, +120 msec), Gesture (D, S), Region (anterior, posterior), and ROI (1, 2, 3, 4, 5) as within-subject factors was performed. For the analysis at the target word, reflecting a more global semantic integration effect on the sentence level, epochs were time-locked to the onset of the target words and lasted from 200 msec prior to the onset until 1000 msec after the onset. A 200-msec prestimulus baseline was applied. The identical 10 ROIs as in the homonym analysis were used. The standard N400 time window ranging from 300 to 500 msec after target word onset was selected to analyze the N400 effects. Again, a repeated-measures ANOVA with Timing (−600 msec, −200 msec, 0 msec, +120 msec), Gesture (D, S), Region (anterior, posterior), and ROI (1, 2, 3, 4, 5) as within-subject factors was performed. In all statistical analyses, Greenhouse–Geisser correction (Greenhouse & Geisser, 1959) was applied where necessary. In the respective cases, the uncorrected degrees of freedom (df), the corrected p values, and the correction factor ɛ are reported. We only report effects which involve the critical factors timing, gesture, or speech. Main effects of timing are not reported, as they are difficult to interpret because of the considerable differences in movement parameters between the timing conditions at the position of the homonym. A 10-Hz low-pass filter was used for presentation purposes only.
Overall, participants responded correctly in 93% of all test trials, indicating that participants paid attention to our stimuli. The difference between the movement and the word task was not significant (movement task: 92% correct; word task: 94% correct; F(1, 31) = 2.63, p > .13), leading to the assumption that participants equally paid attention to both gestures and speech.
As can be seen in Figure 2A, an increased negativity is elicited when the homonym is preceded by a subordinate gesture fragment as compared with a dominant one for some of the four timing conditions. The earliness of the effect can be explained by the use of the homonym IPs as a trigger for the averages. A repeated-measures ANOVA yielded a significant main effect of Gesture, F(1, 31) = 5.64, p < .05, and a significant four-way interaction of Timing, Gesture, Region, and ROI, F(12, 372) = 2.63, p < .05, ɛ = .50. Step-down analyses on the basis of the factor Region revealed significant three-way interactions of Timing, Gesture, and ROI at anterior, F(12, 372) = 2.64, p < .05, ɛ = .42, and posterior sites, F(12, 372) = 2.57, p < .05; ɛ = .38. Resolving these interactions based on the factor Timing led to the following results. There was no significant effect for the −600 msec condition at both anterior sites and posterior sites (anterior: all Fs < 2.40, all ps > .09, posterior: all Fs < 1.79, all ps > .17). In the −200 msec condition, there was a significant interaction of Gesture and ROI at both anterior and posterior sites (anterior: F(4, 124) = 5.17, p < .05, ɛ = .41; posterior: F(4, 124) = 4.37, p < .05, ɛ = .51). Further resolution of these interactions only revealed a significant effect of gesture for the two posterior left-hemispheric ROIs (PL: paired t(31) = 2.16, p < .05; PCL: paired t(31) = 2.58, p < .05). All other ROIs did not reveal significant effects (all paired ts > 1.15, all ps > .25). In the 0 msec or synchrony condition, there was a significant main effect of Gesture at both anterior and posterior sites (anterior: F(1, 31) = 7.21, p < .05; posterior: F(1, 31) = 5.28, p < .05) as well as a significant interaction of Gesture and ROI at posterior sites, F(4, 124) = 4.17, p < .05, ɛ = .46. Further resolution of this interaction resulted in significant main effects of gesture for the posterior left and central ROIs (PL: paired t(31) = 2.64, p < .05; PCL: paired t(31) = 2.92, p < .01; PC: paired t(31) = 2.22, p < .05). Similar to the 0 msec condition, there was also a main effect of Gesture at anterior sites in the +120 msec condition, F(1, 31) = 6.76, p < .05; however, there was no significant effect at posterior sites (all Fs < 1.50, all ps > .20). Thus, integration of gesture and homonym seems to be possible only in the −200 msec, 0 msec, and +120 msec conditions.
ERP Data—Target Word
As Figure 2B shows, there is a clearly enhanced negativity for incongruent as compared with congruent gesture cues at the subordinate target word, seemingly independent of the synchrony manipulation. In the classic N400 time window (300–500 msec after target word onset), the repeated-measures ANOVA revealed a significant main effect of Gesture, F(1, 31) = 8.13, p < .01, a significant main effect of Target word, F(1, 31) = 16.42, p < .001, as well as a significant interaction of Gesture and Target word, F(1, 31) = 21.09, p < .0001. On the basis of this interaction, the simple main effects of Gesture were tested separately for the two Target word conditions. At subordinate target words, the N400 was significantly larger after a dominant gesture compared with a subordinate one (paired t(31) = 5.17; p < .0001). No such effect of Gesture–Target word congruency was found at dominant target words (paired t(31) = 1.09, p > .28; see Figure 2C). Thus, independent of our timing manipulation, gesture fragments showed a disambiguating effect at the subordinate target word.
This experiment set out to explore whether the semantic integration of gesture fragments and speech is related to a particular time window of integration, as it is usually found for perceptual integration in the multisensory literature. Four different levels of temporal synchrony were used between the offset of a gesture fragment and the IP of the corresponding speech unit—a homonym. For the −200 msec, 0 msec, and +120 msec conditions, a significant ERP difference between the fragments related to the dominant and the subordinate meaning of the homonym was observed. This difference is an N400-like effect indicating that the integration of subordinate gesture fragments with the homonym is more difficult than the integration of dominant gesture fragments with the homonym, that is, the found effect is probably related to word meaning frequency similar to those found in previous gesture studies (Obermeier et al., 2011, 2012). No such an effect was present in the −600 msec condition. The topographically different distribution of the single effects is difficult to interpret as we are dealing with ERPs. However, one could speculate that the more anterior distribution in the +120 msec condition in comparison with the 0 msec condition relates to the fact that the participants have already identified the homonym before the DP of the gesture and can use this information to identify the meaning of the gesture fragment. In contrast to the time sensitivity of the homonym, there was no significant impact of synchrony on the more global disambiguation effect at the target word downstream in the sentence. All four conditions led to a significant N400 effect between incongruent and congruent gesture cues at the subordinate target word, but not at the dominant target word. The restricted sensitivity for the subordinate target words replicates previous findings using the identical stimulus material (Obermeier et al., 2011, 2012). In the following, we will first discuss the time window of integration between gesture and homonym in relation to other multisensory integration results. Then we will focus on the apparent discrepancy between the time-sensitive multisensory integration at the homonym and the time-insensitive disambiguation at the target word. Next we will examine what the discrepancy tells us about the nature (i.e., potential automaticity) of the multisensory integration between gesture and speech, and then we will infer some potential implications for real-life discourse.
A Time Window for Gesture–Speech Integration
The present findings suggest that there is indeed a temporal window for the semantic integration of gesture fragments and speech, that is, in this window, gesture and speech information are combined into a more precise informational concept (in our case a disambiguated homonym). On the basis of the present findings, this time window spans at least from −200 msec (audio lag) to +120 msec (gesture lag). These results extend the existing literature on gesture–speech processing and multisensory integration in several ways.
Previous gesture research has shown that timing of gesture and speech information plays a crucial role in their integration (Habets et al., 2011; Leonard & Cummins, 2011; Obermeier et al., 2011; Treffner et al., 2008). When gesture and speech are in close synchrony, their integration is seemingly automatic or obligatory (for more details on the notion of automaticity see below, but see also Kelly, Özyürek, & Maris, 2010). Furthermore, Habets et al. (2011) suggest that there is a temporal window for the integration of gestures and words ranging from −160 msec to 0 msec SOA. The present findings specify the findings of Habets et al. (2011) in important ways. First, in contrast to the SOA manipulation of Habets et al. (2011), we employed a manipulation with regard to the DP/IPs of gesture and speech, allowing us to specify the time window of semantic integration more precisely. Second, the use of the gesture fragments instead of full-length gestures allowed us to ensure that the amount of gestural information for the speech lag conditions was always identical. Third, we used complete utterances in contrast to words. If we compare the findings of the Habets et al. (2011) study with the present findings, it is clear that the time window for the integration of gesture and speech found in the present experiment goes beyond the time window for gestures and words found by Habets et al. (2011; −200 to +120 msec vs. −160 to 0 msec). One could argue that this difference is because of the different timing manipulations (SOA vs. IP) and is thus not comparable. To compare the present data with previous findings, the timing manipulations (−600 msec, −200 msec, 0 msec, +120 msec) were recalculated3 in terms of a SOA manipulation (−740 msec, −340 msec, −140 msec, and −20 msec4; see Figure 1). On a SOA basis, the temporal window of gesture fragment and speech integration ranges from −340 to −20 msec. Whereas the left boundary (−340 msec) of the time window might be accurate, the right boundary (−20 msec) has to be further specified in terms of a potential visual lag. As can be seen, the present findings become even more intriguing by the recalculation: The SOA-based time window of integration for our data would start at −340 msec, that is, the gesture starts 340 msec prior to the homonym and is nevertheless semantically integrated with that word.
This finding is of relevance for other multisensory integration research. If one compares the time window of integration for gesture and speech with the time windows reported in previous multisensory integration research (usually between a SOA of −260 msec and +130 msec), two aspects become apparent. On the one hand, the time window we found for semantic integration of gesture and speech based on the IP manipulation is very similar to those found in previous SOA-based multisensory research (e.g., Habets et al., 2011; van Wassenhove et al., 2007; Vatakis & Spence, 2006a, 2006c; Grant et al., 2004; Dixon & Spitz, 1980). On the other hand, it seems that the gesture–speech integration system can cope with much larger asynchronies than those reported in other audiovisual integration experiments. In the literature, it is hypothesized that the temporal window integration increases with complexity of the to-be-merged information (Vatakis & Spence, 2010; Vatakis, Navarra, Soto-Faraco, & Spence, 2007). If one assumes that gesture–speech integration is more complex (continuous changing visual and speech signal, additional semantic level) than many of the previously studied multisensory stimuli (e.g., syllables and lip movements; van Wassenhove et al., 2007), it is not surprising that the gesture–speech integration system can cope with larger asynchronies. Alternatively, addressees might be able to cope with larger asynchronies, because gesture and speech are produced with some asynchrony most of time (e.g., McNeill, 1992). Morrel-Samuels and Krauss (1992), for example, found that gestures are initiated approximately 1000 msec prior to the lexical affiliate. In contrast, the temporal coupling between lip and mouth movements (as for instance in the McGurk effect) is much stronger, most likely because it is based on pure physical features in contrast to gesture and speech, which are semantically linked. Therefore, it could be that we are habituated to a certain degree of asynchrony between gesture and speech and are thus able to cope with larger asynchronies.
Cross-modal Semantic Priming
One potential critique when exploring a time window of semantic integration relates to the possible confound of cross-modal semantic priming. In other words, if the leading signal (be it gesture or speech) has a semantic content, it could prime the information of the lagging signal. The present design, however, carefully avoids this problem. First, the gesture fragments themselves do not have a clear-cut meaning. A rating study, in which participants had to guess the meaning of the gesture fragments without any context, showed that they were not able to understand the correct meaning of the gesture fragments. Thus, cross-modal priming because of gesture leading speech is improbable. Second, although a homonym has a clear semantic representation, its content is a bit more complex compared with a nonambiguous word because both the dominant as well as the subordinate meaning are being accessed simultaneously up to at least 200 msec after stimulus presentation (Swinney, 1979). After approximately 700 msec, one of the meanings has been selected (cf. Van Petten & Kutas, 1987) either on the basis of the ongoing context or on the basis of word meaning frequency. This means that, in the +120 msec condition, both meanings of the homonym are activated before the gesture fragment ends. After the fragment ends, it gets its meaning through the integration with the homonym information and can be used to select the correct meaning of the homonym. Thus, again, even when the lagging gesture fragment gets it power to disambiguate through the integration with the homonym information, there is no way to explain such an effect as a form of cross-modal semantic priming.
How Automatic Is the Integration of Gesture and Speech?
The present data clearly show that there is a time window for gesture–speech integration, but what is the nature of this integration process? Both in gesture/language research as well as multisensory integration research, there has been quite some debate about the potential automaticity of integration. In gesture research, there have been claims that gesture–speech integration is automatic and cannot be avoided (McNeill, 1992) or is obligatory (Kelly, Creigh, et al., 2010). Some researchers have specified this stance by adding the temporal dimension and proposing that gesture–speech integration is relatively automatic within a certain temporal relation (or synchrony) between gesture and speech (Habets et al., 2011; Obermeier et al., 2011). In contrast, Holle and Gunter (2007) argue that gesture–speech integration is not obligatory but is influenced by situational factors (e.g., amount of meaningful gesture information). Similarly, in the multisensory literature, there is also a debate on the automaticity of multisensory integration. For instance, Talsma et al. (2010) suggest that attention and multisensory integration are linked in a tight and multifaceted way. In situations of low perceptual competition or high temporal synchrony, sensory events capture attention and the integration is rather automatic and bottom–up, whereas in other cases situational knowledge (as might be the case in Holle & Gunter, 2007) might influence the integration process via top–down driven spreading of attention to relevant stimuli. In the present experiment, there was not much perceptual competition within the stimulus material because we had only one audio and video stream and no incongruent stimuli. Additionally, we used a shallow task that did not force participants to integrate visual and auditory information. Therefore it is not unreasonable to suggest that within the time window the integration of gesture and speech is driven in a bottom–up, more or less automatic fashion (e.g., Talsma et al., 2010; van Atteveldt et al., 2007).
Integration and Sentence Processing: Multiple Time Windows?
The discussion up to now has been centered on the local gesture–speech integration, which takes place at the homonym. From a classical multisensory perspective, this is reasonable because this research typically concentrates on how two discrete events are integrated (e.g., Hirsh & Sherrick, 1961). From a more ecological multisensory (e.g., Vatakis & Spence, 2010) as well as gesture–speech perspective (e.g., Leonard & Cummins, 2011; Wu & Coulson, 2010; Loehr, 2007; McNeill, 1992), however, narrowing down to the word level is less optimal because, in a communicative situation, speech is usually much more complex than just a single word. We therefore have to explore how the time window of direct speech–gesture integration relates to the processing of a complete sentence. As stated above, the data at the homonym clearly indicate that the direct automatic integration of gesture and speech takes place between −200 msec and at least +120 msec. This is a relatively large time window, which is possibly due to the complexity of the integrated materials. The outcome of this integration can be seen when we consider the processing that takes place at the target word, which is either in accordance with the dominant or subordinate meaning of the homonym. The conditions that showed a significant integration effect on the homonym (i.e., −200, 0, and +120 msec) also showed clear effects of disambiguation on the target word, suggesting that speech–gesture integration disambiguated the homonym and therefore impacted target word processing. Intriguingly, the −600 msec condition also showed clear disambiguation effects on the target word without any sign of direct integration at the homonym. These data therefore suggest either that direct gesture–speech integration at the homonym is not necessary for the disambiguation of the target word or that the gesture information was integrated in a different way, possibly in an earlier time window.5 This idea is in line with a notion put forward for the temporal integration of beat gestures and speech by Leonard and Cummins (2011) that salient points in the time course of gestures (gesture anchors) are closely linked to their respective anchor points in the speech stream (which is most likely the corresponding speech unit; see McNeill, 1992), thus potentially allowing multiple integration of gesture and speech depending on anchor positions. To shed light on this mystery, we first need to look at the exact timing of our gesture and speech conditions. As can be seen in Figure 1, the gestures of the −200, 0, and +120 msec conditions all overlap with and have their DPs within the time period when the homonym is uttered.6 In contrast, the −600 msec condition overlaps with and has its DP within the utterance of the verb. This clear difference suggests that it is possible that automatic integration of a gesture requires some overlap with a particular speech unit and that the DP possibly plays a crucial role. To provide some circumstantial evidence for this conjecture, we looked at the time course of gesture–speech from the start of the critical sentence in our material and compared the integration effects of overlapping and nonoverlapping conditions. Before discussing the findings, it is important to note that our experiment was not specifically set up to explore this issue. To be able to do so, one needs to have two widely separate positions in a sentence at which the integration of gesture information is similarly easy or difficult, and neutral or ambiguous when no gesture information is present. Clearly our experiment does not have these characteristics at the verb/homonym positions. To have the largest possible separation between the integration sites, we chose to compare the −600 msec condition (complete overlap with the verb) with the +120 msec condition. Certainly, this last condition does not overlap with the verb. As can be seen in Figure 3, the −600 msec condition shows an enlarged negativity for the subordinate as compared with the dominant gestures already at the verb, starting at around −350 msec relative to the homonym onset which then extends into the homonym. In contrast, the +120 msec condition shows a similar effect (identical to the one reported in our analyses) but only at the homonym position starting at around +450 msec. Importantly, the effect in the −600 msec condition not only resembles the effect in the +120 msec but also starts after the meaning of the verb has been retrieved by the addressee (which is between 200 and 300 msec after verb onset; see, for instance, Grosjean, 1980). Only then, the gesture fragments can be integrated with the verb into a new meaning similar to the integration of gesture fragment and homonym. The exploratory analysis therefore points to the fact that the gesture information in the −600 msec condition was integrated at the verb and not at the homonym. This integration impacted the accumulated meaning on a sentence level leading to integration differences observed at the target word.
This post hoc analysis raises some issues related to the timing of gesture–speech integration. It suggests that there are multiple positions in a sentence where gesture–speech integration can take place (for a similar notion for not semantically related gesture–speech integration, see Leonard & Cummins, 2011; Treffner et al., 2008). At the moment, we hypothesize that those positions are at the content words of the sentence,7 but this needs further confirmation. The homonym data suggest that around such positions there exists a time window in which gesture–speech integration is carried out more or less automatically. Recent studies suggest that both the relative position of potential gesture anchors and speech anchors (Leonard & Cummins, 2011) and the linguistic characteristics of the speech signal, respectively (e.g., phonology, semantics; see Maier et al., 2010), impact the integration process. We assume that gesture–speech integration will usually occur in a visual-to-auditory form (Wilbiks & Dyson, 2013), that is, a gesture (respectively its anchor points) will be strongest bound to the closest following content word or speech anchor. We hypothesize that this process and the related time windows of integration are largely driven by the temporal dynamics of the speech signal. It is clear, however, that all these conjectures need further studies and substantiation.
Ultimately, such a hypothesis would implicate that in natural communication, where a gesture can last for a longer time, several salient aspects of a gesture will have their specific and possibly distinct impact on different positions in the sentence as time progresses. For example, within this concept, it is very well possible that the preparation phase of a gesture will have a completely different impact on speech compared with the stroke phase of the same gesture impacting speech more downstream in the sentence. Thus, in this scenario speech drives the parser to detect meaning in the gesture within a particular time boundary of each content word and integrate it more or less automatically. Note, that this hypothesis argues against the classical view that gesture is only integrated at its so-called lexical affiliate (McNeill, 1992).
Task-related Processes and Their Impact on the Time Window Boundaries
When the gesture and speech anchors are too far apart, integration of gesture and speech does not take place. Does this mean that integration is impossible or that we can push the cognitive system to integrate the asynchronous information by means of, for instance, task instructions? Evidence for this idea comes from a series of studies by Obermeier et al. (2011) that used the same materials as the present experiment. In their Experiments 1 and 3, the gesture fragments were presented outside the time window of integration with a −1000 msec lag from the IP of the homonym. In Experiment 1, the participants had to integrate gesture and speech to solve the task (explicit task). In Experiment 3, the same shallow task as in the present experiment was used. The explicit task led to clear effects of gesture–speech integration at the homonym, whereas no gesture–speech integration was found for the experiment using the shallow task. These data show that, even when the time window for more automatic integration is exceeded, integration of gesture and speech is possible but under cognitive, probably attentional control (for a review, see Talsma et al., 2010). If the task or the environmental situation (i.e., gesture–speech videos embedded in babble noise; see Obermeier et al., 2011) drives cognition to integrate gesture and speech, it can be done in an effortful way. Note that similar time window characteristics (within the time window automatic integration and outside the time window under cognitive control) can be found for other multisensory signals. van Atteveldt et al. (2007), for instance, showed that task-induced top–down processes lead to multisensory integration of asynchronous letter–sound pairings, which are otherwise not integrated.
These data suggest a temporal window for direct semantic gesture–speech integration, which ranges from at least −200 msec up to +120 msec. The auditory lag tolerated in gesture–speech integration is much larger than that found in previous multisensory integration research. One possible explanation relates to the higher complexity of the integration, and another relates to the loose coupling between the gesture and speech channel. Within this time window, this integration is relatively automatic. Additionally, the present experiment also hinted that there are possibly multiple positions in a sentence where direct gesture–speech integration can take place. Ultimately this would implicate that in natural communication, where gesture lingers on for some time, several aspects of that gesture will have their specific and possibly distinct impact on different positions in an utterance as time goes by.
We thank Arjan van Eerden and Kristiane Klein for their help during the data acquisition, Sven Gutekunst for technical assistance, Henning Holle for providing the original stimulus material and helpful discussions, Elizabeth Kelly for proofreading, and Angela D. Friederici for supporting our research. We would also like to thank three anonymous reviewers for their constructive comments on earlier versions of this manuscript.
Reprint requests should be sent to Christian Obermeier, Max Planck Institute for Human Cognitive and Brain Sciences, P.O. Box 500 355, 04303 Leipzig, Germany, or via e-mail: email@example.com.
Visual and auditory stimulation is not spatially coincident, gradual desynchronization leads to criterion shifts for synchrony judgments, etc.
Remember that, from a theoretical perspective, such a recalculation is less optimal because it assumes that at the beginning of the homonym its meaning is present.
Note that a recalculation based on video frames instead of time in milliseconds leads to slightly larger SOAs: −19 frames (corresponding to −760 msec), −9 frames (corresponding to −360 msec), −4 frames (corresponding to −160 msec), and −1 frame (corresponding to −40 msec).
In the case of our gesture fragments, the DP is the position where the gesture fragment gets its disambiguating power and is the most likely integration anchor.
Note that in the Obermeier et al. (2011) experiment, their −1000 msec condition did not show any impact on both the homonym and the target word. As can be seen in Figure 1, this condition overlapped with the sentence intervening pause and the pronoun of the second sentence. The DP was at the pronoun. Because the DP plays a crucial role in the integration as gesture anchor, these data suggest that a function word will not lead to gesture–speech integration.