Speakers plan the phonological content of their utterances before their release as speech motor acts. Using a finite alphabet of learned phonemes and a relatively small number of syllable structures, speakers are able to rapidly plan and produce arbitrary syllable sequences that fall within the rules of their language. The class of computational models of sequence planning and performance termed competitive queuing models have followed K. S. Lashley [The problem of serial order in behavior. In L. A. Jeffress (Ed.), Cerebral mechanisms in behavior (pp. 112–136). New York: Wiley, 1951] in assuming that inherently parallel neural representations underlie serial action, and this idea is increasingly supported by experimental evidence. In this article, we developed a neural model that extends the existing DIVA model of speech production in two complementary ways. The new model includes paired structure and content subsystems [cf. MacNeilage, P. F. The frame/content theory of evolution of speech production. Behavioral and Brain Sciences, 21, 499–511, 1998] that provide parallel representations of a forthcoming speech plan as well as mechanisms for interfacing these phonological planning representations with learned sensorimotor programs to enable stepping through multisyllabic speech plans. On the basis of previous reports, the model's components are hypothesized to be localized to specific cortical and subcortical structures, including the left inferior frontal sulcus, the medial premotor cortex, the basal ganglia, and the thalamus. The new model, called gradient order DIVA, thus fills a void in current speech research by providing formal mechanistic hypotheses about both phonological and phonetic processes that are grounded by neuroanatomy and physiology. This framework also generates predictions that can be tested in future neuroimaging and clinical case studies.
Here we present a neural model that describes how the brain may represent and produce sequences of simple, learned speech sounds. This model addresses the question of how, using a finite inventory of learned speech motor actions, a speaker can produce arbitrarily many utterances that fall within the phonotactic and linguistic rules of her language. At the phonological level of representation, the model implements two complementary subsystems, corresponding to the structure and content of planned speech utterances within a neurobiologically realistic framework that simulates interacting cortical and subcortical structures. This phonological representation is hypothesized to interface between the higher level conceptual and morphosyntactic language centers and the lower level speech motor control system, which itself implements only a limited set of learned motor programs. In the current formulation, syllable-sized representations are ultimately selected through phonological encoding, and these activate the most appropriate sensorimotor programs, commanding the execution of the planned sound. Construction of the model was guided by previous theoretical work as well as clinical and experimental results, most notably a companion fMRI study (Bohland & Guenther, 2006).
Much theoretical research has focused on the processes involved in language production. One approach has been to delineate abstract stages through which a communicative concept is subjected to linguistic rules and ultimately transformed into a series of muscle activations used for speech production (Garrett, 1975). This approach has led to the development of the influential Nijmegen model (Levelt, Roelofs, & Meyer, 1999), which casts the speech system as a set of hierarchical processing stages, each of which transforms an input representation of a certain form (at a certain linguistic “level”) to an output representation in a different “lower level” form. The current work addresses the proposed phonological encoding and phonetic encoding stages and interfaces with an existing model, the Directions Into Velocities of Articulators (DIVA) model of speech production (Guenther, Ghosh, & Tourville, 2006; Guenther, Hampson, & Johnson, 1998; Guenther, 1995), which describes the stage of articulation. The present model, called gradient order DIVA (GODIVA), describes the ongoing parallel representation of a speech plan as it cascades through the stages of production. Although we do not address higher level linguistic processes, the proposed architecture is designed to be extensible to address these in future work.
A limitation of the DIVA model, which accounts for how sensorimotor programs for speech sounds1 can be learned and executed, is that it contains no explicit planning representations outside the activation of a single speech sound's stored representation, nor does it address the related issue of appropriately releasing planned speech sounds to the motor apparatus (referred to here as initiation). GODIVA adds abstract phonological representations for planned speech sounds and their serial order and simulates various aspects of serial speech planning and production. Furthermore, the model follows recent instantiations of the DIVA model (Guenther, 2006; Guenther et al., 2006) by proposing specific neuroanatomical substrates for its components.
Models of Serial Behavior
At the heart of any system for planning speech must be a mechanism for representing items to be spoken in the correct order. A number of concrete theoretical proposals have emerged to model serial behaviors (see Bullock, 2004; Rhodes, Bullock, Verwey, Averbeck, & Page, 2004; Houghton & Hartley, 1995). Associative chaining theories postulate that serial order is stored through learned connections between cells representing successive sequence elements and that each node's activation in turn causes activation of the subsequent node, enabling sequential readout. In their simplest form, however, these models cannot learn to unambiguously readout different sequences defined over the same component items. Wickelgren's (1969) speech production model addressed this problem by introducing many context-sensitive allophones (e.g., /kæt/ for phoneme /æ/ when preceded by /k/ and followed by /t/) as nodes through which a serial chain could proceed. However, such a model does not capture the relationship between the same phonemes in different contexts and suffers a combinatorial explosion in the requisite number of nodes when allowing sequences that can overlap by several consecutive phonemes. More recent neural network models (Beiser & Houk, 1998; Elman, 1990; Lewandowsky & Murdock, 1989; Jordan, 1986) proposed revisions that rely on a series of sequence-specific internal states that must be learned to allow sequence recall. Although such networks overcome the central problem above, they provide no basis for novel sequence performance and have difficulty simulating cognitive error data due to the fact that if a “wrong link” is followed in error, there is no means to recover and correctly produce the remaining items (Henson, Norris, Page, & Baddeley, 1996).
Alternatively, strict positional models represent serial order by the use of memory “slots” that signify a specific ordinal position in a sequence. Sequence performance then simply involves stepping through the series of slots (always in the same order) and executing each associated component item. Unfortunately, there is no obvious neural mechanism to allow the insertion of an arbitrary memory (or memory pointer) into a particular “slot.” Such models either require the ability to “label” a positional node with a particular representation or component item or require a set of all possible representations to be available at all serial positions, which is often infeasible. Recent models within this lineage hypothesize order to be accounted for by some contextual signal such as the state of an oscillatory circuit or other time-varying function (Brown, Preece, & Hulme, 2000; Vousden, Brown, & Harley, 2000; Burgess & Hitch, 1999; Henson, 1998). Recall then involves “replaying” this contextual signal that, in turn, preferentially activates the items associated with the current state of the signal. Such models require the ability to form associations between context signal and component item through one-shot learning to allow for novel sequence performance.
Lashley (1951) deduced that associative chaining models could not sufficiently describe the performance of sequences including those comprising speech and language and that serial behavior might instead be performed based on an underlying parallel planning representation. Grossberg (1978a, 1978b) developed a computational theory of short-term sequence memory in which items and their serial order are stored via a primacy gradient using the simultaneous parallel activation of a set of nodes, where relative activation levels of the content-addressable nodes code their relative order in the sequence. This parallel working memory plan, which can be characterized as a spatial pattern in a neuronal map, can be converted to serial performance through an iterative competitive choice process in which (i) the item with the highest activation is chosen for performance, (ii) the chosen item's activation is then suppressed, and (iii) the process is repeated until the sequence reaches completion. These types of constructions have been collectively termed competitive queuing (CQ) models (Bullock & Rhodes, 2003; Houghton, 1990). Figure 1 illustrates the basic CQ architecture. Recently, this class of CQ models has received substantial support from direct neurophysiological recordings in monkeys (Averbeck, Chafee, Crowe, & Georgopoulos, 2003) and from chronometric analyses of seriation errors (Farrell & Lewandowsky, 2004). A CQ-compatible architecture forms the basis of various modules used in GODIVA.
Although the majority of serial order theories have addressed STM without explicit treatment of linguistic units, some models have been introduced to account for the processing of such representations for word production. Such models generally follow the theoretical work of Levelt (1989), Garrett (1975), and others. Existing linguistic models typically seek to address either patterns observed in speech error data or chronometric data concerning speaker RTs under certain experimental manipulations. Error data have highlighted the importance of considering speech planning and production at multiple levels of organization, specifically including word, syllable, phoneme, and feature as possible representational units. These are conceptually hierarchical, with higher units comprising one or more lower.
MacNeilage (1998) proposed a close link between the syllable unit and the motor act of open-closed jaw alternation. In this proposal, a behaviorally relevant motor frame roughly demarcates syllable boundaries. Such a motor frame might be useful in delineating and learning individual “chunks” that contribute to a finite library of “performance units.” However, the phonological syllable does not only circumscribe a set of phonemes but also appears useful to provide a schema describing the abstract “rules” governing which phonemes can occur in each serial position within the syllable (e.g., Fudge, 1969). To this end, the syllable can be broken into, at least, an onset and a rime, the latter of which contains subelements nucleus and coda. Syllables contain phonemes, which are categorical and exhibit a many-to-one relationship between acoustic signals and perceptual labels, with all realizations of a particular phoneme being cognitively equivalent. The GODIVA model assumes the reality of phonemic categories, but its framework is amenable to alternative discrete categorizations. Although our proposal is for a segmental model that lacks explicit representation of either articulatory or acoustic features, we allowed room for the implicit representation of featural similarity, which has been shown to have a significant influence on speech error patterns (MacKay, 1970) and production latencies (Rogers & Storkel, 1998).
Nearly all previous theories of phonological encoding or serial speech planning have proposed some form of factorization of the structure and the phonological content of an utterance, often in the form of syllable- or word-sized structural frames and phoneme-sized content.2 Such a division is motivated, in part, by the pattern of errors observed in spontaneously occurring slips of the tongue. MacKay (1970), in his study of spoonerisms, or phoneme exchange errors (e.g., saying “heft lemisphere” instead of the intended “left hemisphere”), noted the prominence of the syllable position constraint, in which exchanges are greatly biased to occur between phonemes occupying the same positional “slot” in different planned syllables. This constraint appears to be the strongest pattern observed in speech errors. Shattuck-Hufnagel (1979), for example, found that 207 of 211 exchange errors involved transpositions to and from similar syllabic positions. More recently, Vousden et al. (2000) found that approximately 90% of consonant movement errors followed this constraint. Treiman and Danis (1988) also noted that during nonword repetition, most errors are phonemic substitutions that preserve syllable structure. Such exchanges also follow a transposition distance constraint (MacKay, 1970), in that phonemes are more likely to exchange between neighboring rather than distant syllables. Beyond speech error data, priming studies have demonstrated effects in speech production based purely on CV structure (while controlling for content) at the syllable and word level (Meijer, 1996; Sevald, Dell, & Cole, 1995). Together, such data argue for factorizing abstract frames from phonemic content, an organizational principle that is exploited in the GODIVA model at the level of syllables.
A shortcoming of previous theoretical psycholinguistic proposals has been their general failure to account for how linguistic behavior can emerge from specific neural structures (Nadeau, 2001). GODIVA makes use of information-processing constructs similar to those proposed elsewhere but embeds these in a biologically realistic architecture with specific hypotheses about cortical and subcortical substrates. These hypotheses are based on integrating the sparse available data from clinical and functional imaging studies and from inferences drawn from nonspeech sequencing tasks in other species, under the assumption that similar mechanisms should underlie linguistic processes and other complex serial behaviors. We emphasize that the use of artificial neural network architectures alone does not establish biological plausibility; rather, it is essential to explicitly consider what is known about the functional architecture of specific systems. In so doing, the GODIVA framework offers the ability to treat additional data sets that cannot be directly addressed by previous models. In particular, region-level effects in functional imaging and lesion studies can be related to specific model components. Although a review of the possible roles of various brain areas in speech planning and production is beyond the scope of this article, here we have elaborated on certain anatomical and physiological considerations involved in the new model development.
The left prefrontal cortex, specifically in and surrounding the ventral inferior frontal sulcus (IFS), showed increased activity in a memory-guided speaking task when the serial complexity of the utterance was increased (Bohland & Guenther, 2006). This is consistent with a hypothesis that this region contains a representation of a forthcoming speech plan. A similar region in left dorsal inferior frontal gyrus has been suggested to be involved in sequencing discrete units, including phonemes (Gelfand & Bookheimer, 2003), and in phonological encoding tasks (Papoutsi et al., 2009; Chein & Fiez, 2001; Burton, Small, & Blumstein, 2000; Poldrack et al., 1999).
In a nonspeech sequencing study in macaque monkeys, Averbeck et al. (2003) and Averbeck, Chafee, Crowe, and Georgopoulos (2002) recorded single-cell activity from the right hemisphere prefrontal cortex during a sequential shape copying task. These recording sites were within approximately 5 mm of the ventral portion of the arcuate sulcus, which has been proposed to be homologous to the IFS in humans (Rizzolatti & Arbib, 1998). Activities of cell ensembles coding for specific segments in the shape were recorded during a delay period before the first stroke and throughout the performance of the stroke sequence. In the delay period, a cotemporal representation of all of the forthcoming segments was found, and the relative activity in each neuron ensemble predicted the relative priority (i.e., order) in which the segments were performed. After execution of each segment, the activation of its ensemble representation was strongly reduced, and the other ensembles' activations increased. Such item-specific primacy gradients were observed even after sequences were highly practiced, to the point of demonstrating “coarticulation.” Further analyses have shown a partial normalization of total activation distributed among the representation for planned items (Averbeck et al., 2002; Cisek & Kalaska, 2002). This agrees with predictions based on planning layer dynamics in CQ models (Grossberg, 1978a, 1978b). Because total activity growth is a decelerating and saturating function of the number of planned items in a sequence, relative activation levels become more difficult to distinguish, and more readily corrupted by noise (Page & Norris, 1998), in longer sequences. These properties help explain why there is a low limit (see Cowan, 2000) to the number of items that can be planned in working memory and recalled in the correct order. Taken together, these electrophysiological findings provide compelling evidence for CQ-like dynamics in the prefrontal cortex, in a location near the possible homologue for human IFS. The GODIVA model posits that forthcoming phonemes are planned in parallel in or around the left hemisphere IFS, consistent with evidence from fMRI.
Medial Frontal Cortex
The medial frontal cortex, consisting of the more posterior SMA and the more anterior pre-SMA (Picard & Strick, 1996; Matsuzaka, Aizawa, & Tanji, 1992), has been implicated in sequencing and speech production tasks. Lesions to the medial wall cause speech problems (Pai, 1999; Ziegler, Kilian, & Deger, 1997; Jonas, 1981, 1987), usually resulting in reduced propositional speech with nonpropositional speech (e.g., counting, repeating words) largely intact. Other problems include involuntary vocalizations, echolalia, lack of prosody, stuttering-like output, and variable speech rate. Both Ziegler et al. (1997) and Jonas (1987) have suggested that the SMA plays a role in sequencing and self-initiation of speech sounds but that it is unlikely that these areas code for specific speech sounds.
In monkey studies of complex nonspeech tasks, sequence-selective cells have been identified in both SMA and pre-SMA (Shima & Tanji, 2000) that fire during a delay period before the performance of a specific sequence of movements in a particular order. This study also identified interval-selective cells, mostly in the SMA, that fired in the time between two particular component movements. Rank-order-selective cells have also been found, primarily in the pre-SMA, whose activity increased before the nth movement in the sequence, regardless of the particular movement (see also Clower & Alexander, 1998). Finally, Shima and Tanji (2000) found that only 6% of pre-SMA cells were selective to particular movements as opposed to 61% of cells in the SMA, indicating a cognitive-motor functional division.
Bohland and Guenther (2006) described differences between anterior and posterior medial areas, with the pre-SMA increasing activity for more complex syllable frames (e.g., CCCV3 vs. CV) and with the SMA increasing activity during overt speaking trials relative to preparation-only trials. These findings suggest differential roles for the pre-SMA and SMA in speech, which has also been demonstrated elsewhere (Alario, Chainay, Lehericy, & Cohen, 2006). In the present proposal, we hypothesized that the pre-SMA encodes structural “frames” at an abstract level (cf. MacNeilage, 1998), whereas the SMA serves to initiate or release planned speech acts. We view these specific hypotheses as only tentative, although consistent with available evidence, and suggest that further experiments are needed to determine both the localization and the level of representation of any such abstract representations.
Basal Ganglia Loops
Interactions between the cortex and the BG are organized into multiple loop circuits (Middleton & Strick, 2000; Alexander & Crutcher, 1990; Alexander, DeLong, & Strick, 1986). The BG are involved in sequencing motor acts (e.g., Harrington & Haaland, 1998), and abnormalities in these regions variously impact speech production (Murdoch, 2001; Kent, 2000), with some patients having particular difficulty fluently progressing through a sequence of articulatory targets (Ho, Bradshaw, Cunnington, Phillips, & Iansek, 1998; Pickett, Kuniholm, Protopapas, Friedman, & Lieberman, 1998). Damage to the caudate is also associated with perseverations and paraphasias (Kreisler et al., 2000), and both structural and functional abnormalities have been observed in the caudate in patients with inherited verbal dyspraxia characterized by particular difficulties with complex speech sequences (Watkins et al., 2002; Vargha-Khadem et al., 1998). The architecture within BG loops is intricate (e.g., Parent & Hazrati, 1995), but here we adopt a simplified view to limit the model's overall complexity. The striatum, comprising the caudate nucleus and the putamen, receives inputs from different cortical regions. The striatum is dominated by GABA-ergic medium spiny projection neurons (Kemp & Powell, 1971), which are hyperpolarized and normally quiescent, requiring convergent cortical input to become active (Wilson, 1995). Also found in the striatum, but less prevalent (only 2–3% of striatal cells in rats, but perhaps as many as 23% in humans; Graveland, 1985), are heterogeneous interneurons, many of which exhibit resting firing rates and receive cortical, thalamic, and dopaminergic input (Tepper, Koos, & Wilson, 2004; Kawaguchi, 1993). Some of these cells are GABA-ergic and suitable to provide feedforward surround inhibition (Plenz & Kitai, 1998; Jaeger, Kita, & Wilson, 1994). Medium spiny neurons send inhibitory projections to cells in the pallidum including the globus pallidus internal (GPi) segment, which are tonically active and inhibitory to cells in the thalamus, which in turn project back to cortex (e.g., Deniau & Chevalier, 1985). Hikosaka and Wurtz (1989) found that BG output neurons enable voluntary saccades by means of a pause in the normally tonic inhibition delivered to spatially specific targets in the superior colliculus and motor thalamus.
Mink (1996) and Mink and Thach (1993) outlined a conceptual model of BG function, suggesting that BG loops are used to selectively enable a motor program for output among competing alternatives. Such action selection models (also Gurney, Prescott, & Redgrave, 2001a, 2001b; Kropotov & Etlinger, 1999; Redgrave, Prescott, & Gurney, 1999), which suggest that the BG do not generate movements but rather select and enable them, are relevant for sequence performance. Mink, for instance, suggested that each component movement must be selected whereas other potential movements (e.g., those to be performed later in the sequence) must be inhibited. In response to such early models' omission of distinct planning and executive layers of cortex, Brown, Bullock, and Grossberg (2004) described a computational neural model (TELOS) for the control of saccades, in which cortico-BG loops make voluntary behavior highly context sensitive by acting as a “large set of programmable gates” that control planning layers' abilities to fully activate executive (output) layers, which drive action. In TELOS, the striatal stage of the gating system receives cortical inputs from cue-sensitive planning cells in various gateable cortical zones. By opening gates when it senses coherence among multiple cortical representations, the striatum promotes the most apt among competing cortical plans while also deferring execution of a plan until conditions favor its achieving success. The TELOS theory also proposed that BG loops divide the labor with higher resolution, laminar target structures in the frontal cortex (and superior colliculus). The BG output stages lack sufficient cell numbers to provide high-resolution action specifications; however, they can enable one frontal zone much more (than another), and within the favored zone, competition between representations ensures the required specificity of output. Here we simplified the BG circuit model but make essential use of the ideas of action selection by gating circuits that enable plan execution only when the striatum detects that multiple criteria have been satisfied.
Interface with Speech Motor Control
DIVA is a neural network model of speech motor control and acquisition that offers unified explanations for a large number of speech phenomena including motor equivalence, contextual variability, speaking rate effects, and coarticulation (Guenther et al., 1998; Guenther, 1995). DIVA posits the existence of a speech sound map (SSM) module in the left ventral premotor cortex and/or posterior inferior frontal gyrus pars opercularis that contains cell groups coding for well-learned speech sounds. SSM representations are functionally similar to a mental syllabary (Levelt & Wheeldon, 1994; Crompton, 1982), suggested by Levelt et al. (1999) to consist of a “repository of gestural scores for the frequently used syllables of the language” (p. 5). Using alternative terminology, SSM representations can be thought of as sensorimotor chunks or programs, learned higher order representations of frequently specified spatio-temporal motor patterns. As noted above, the GODIVA model follows these proposals as well as MacNeilage (1998) in placing the syllable as a key unit for speech motor output, but our general approach is amenable to output units of other sizes that exhibit repeating structural patterns (e.g., words or morphemes).
A syllable frequency effect, in which frequently encountered syllables are produced with a shorter latency than uncommon (but well-formed) syllables, has been reported by several researchers (Cholin, Levelt, & Schiller, 2006; Laganaro & Alario, 2006; Alario et al., 2004; Carreiras & Perea, 2004; Levelt & Wheeldon, 1994). Although Levelt and Wheeldon (1994) argued that the syllable frequency effect implied the use of stored syllable motor programs, it proved difficult to rule out that the effect could be due to phonological processing. Laganaro and Alario (2006) used a delayed naming task with and without an interfering articulatory suppression task to provide strong evidence that this effect is localized to the phonetic rather than phonological encoding stage, consistent with the role of the SSM module in DIVA.
In the extended GODIVA model, the SSM forms the interface between the phonological encoding system and the phonetic/articulatory system. Sensorimotor programs for frequently encountered syllables can be selected from the SSM in full, whereas infrequent syllables must be performed from smaller (e.g., phoneme-sized) targets. We note that, although this points to two routes for syllable articulation, in neither case does GODIVA posit a bypass of the segmental specification of forthcoming sounds in a phonological queue. Some dual-path models (Varley & Whiteside, 2001) predict that whereas novel or rare sequences are obliged to use an “indirect” process that requires working memory for sequence assembly, high-frequency sequence production instead uses a “direct” route that bypasses assembly in working memory and instead produces sequences as “articulatory gestalts.” Contrary to such dual-path models, Rogers and Spencer (2001) argued that sequence assembly in a working memory buffer is obligatory even for production of high-frequency words. Besides exchange error data, a key basis for their argument was the finding that “speech onset latencies are consistently … longer when the onsets of successive words are phonologically similar” (p. 71), even when the successive words are high frequency. This transient inhibitory effect is an expected “aftereffect” of a fundamental CQ property: active suppression of a chosen item's representation. Consistently, although both the Nijmegen model and the GODIVA model have provisions for modeling differences in the production of high- versus low-frequency sequences, neither make the problematic assumption that automatization eventually entails bypassing assembly. That assumption is incompatible with the exquisite control of vocal performance that speakers/singers retain for even the highest frequency syllables.
A detailed description of the DIVA model is available elsewhere (Guenther, 2006; Guenther et al., 2006). Below, we specified a computational neural model that extends DIVA to address the planning and initiation of sequences of connected speech and the brain regions likely to be involved in those processes.
Here we provided a high-level overview of the GODIVA model, followed by the formal specification. A comprehensive illustration of the theoretical model is shown in Figure 2. In this figure, boxes with rounded edges refer to components that have been addressed in the DIVA model (Guenther, 2006; Guenther et al., 2006). Here we further conceptualized some brain areas attributed to the phonetic/articulatory levels of processing but focus on higher level processing, providing implementation details only for boxes drawn with dotted borders. The model contains dual CQ-like representations of the forthcoming utterance at the phonological level, hypothesized to exist in the pre-SMA and IFS regions, with selection mediated by a BG “planning loop.” Selected phonological codes interface with an elaborated SSM to select best matching sensorimotor programs for execution. Because available data are sparse, the GODIVA model should only be considered as a starting point: one possible interpretation of existing data. The key advancements that we hope to achieve in this proposal are (i) to tie information processing accounts of phonological processes in speech production to hypothetical neural substrates and (ii) to bring together models of articulation and theories of speech planning and preparation. Explicit calls for bridging this latter gap have been made in the study of communication disorders (McNeil, Pratt, & Fossett, 2004; Ziegler, 2002).
The “input” to the GODIVA model during ordinary speech production is assumed to arrive from higher level lexical/semantic and/or syntactic processing areas, possibly including the inferior or ventrolateral prefrontal regions of the cortex, or from posterior regions in repetition tasks. In most cases, these inputs are thought to code lexical items (words) or short phrases and arrive sequentially as incremental processing is completed by the higher level modules. These inputs initiate the activation of two parallel and complementary representations for a forthcoming utterance: a phonological content representation hypothesized to exist in the left hemisphere IFS and a structural frame representation hypothesized to exist in the pre-SMA. Both representations constitute planning spaces or forms of working memory, where representative neurons or populations of neurons maintain a cortical code for the potential phonemes (in the IFS) or abstract syllable frames (in the pre-SMA) that define the utterance. In GODIVA, these representations simultaneously, cotemporally code for multiple forthcoming phonemes and syllable frames by use of primacy gradients, in which relative activation levels code for the serial order in which the items are to be produced. These gradients over plan cells are maintained for a short duration through recurrent dynamics and can robustly handle new inputs as they arrive without disruption of ongoing performance, up to a certain item capacity determined by the signal-to-noise ratio of the representation. Both the IFS and the pre-SMA plan layers thus take the form of “item and order memories” (Grossberg, 1978a, 1978b) or, equivalently, planning layers in CQ circuits (Bullock & Rhodes, 2003).
The model's production process begins when the most active frame in the pre-SMA planning layer is selected within a second set of pre-SMA cells, the choice layer. The division of cortical representations into plan and choice layers within a columnar architecture is repeated throughout the model (see Figure 2). Activation of a pre-SMA choice cell initiates the firing of a chain of additional pre-SMA cells, each corresponding to an abstract position (but not a specific phoneme) in the forthcoming syllable. These pre-SMA cells give input to a BG-mediated planning loop, which serves as an input gate to the choice layer in the IFS region, effectively controlling access to the output phonological representation, which drives activation in the planning layer of the SSM, a component of the DIVA model storing phonetic representations, which is further elaborated here. This planning loop specifically enables topographic zones in the IFS choice layer that correspond to appropriate syllable positions for the immediately forthcoming syllable. Strong competition among IFS choice cells in each positional zone results in a single “winning” representation within each active positional zone. As in standard CQ-based models, any IFS and pre-SMA choice cells that become active (“win”) selectively suppress the planning representations to which they correspond.
IFS choice cells form cortico-cortical synapses with cell populations in the SSM that, following the hypotheses of the DIVA model, enable the “readout” of motor programs as well as auditory and somatosensory expectations for simple learned speech sounds. The SSM is hypothesized to occupy the left posterior inferior frontal gyrus (opercular region) and adjoining left ventral premotor cortex (Guenther, 2006); we will use the term frontal operculum as shorthand for this region. Learning of the IFS → SSM synapses is suggested to occur somewhat late in development, after a child has developed well-defined phonetic/phonological perceptual categories for his or her language. These tuned synapses (which are defined algorithmically in the model) allow the set of winning choice cells in the IFS choice layer to activate a set of potential “matching” sensorimotor programs represented by SSM plan cells, with better matching programs receiving higher activations. Because one IFS choice cell is active for each position in the forthcoming syllable, this projection transforms a phonological syllable into a speech motor program.
SSM plan cells give input to SSM choice cells, which provide output to hypothesized lower level motor units. Competition via recurrent inhibition among SSM choice cells allows a single sensorimotor program to be chosen for output to the motor apparatus. We postulate an additional BG loop (motor loop in Figure 2) that handles the appropriate release of planned speech sounds to the execution system. The chosen SSM output cell is hypothesized to activate motor plan cells primarily in the left-hemisphere motor cortex that, together with inputs from the SMA, bid for motor initiation. A new motor program will be initiated only upon completion of the previous program. The uncoupling of the selection of motor programs from the timing of initiation allows the system to proceed with selection before the completion of the previous chosen program. This provides a simple mechanism to explain differences between preparation and production and between covert and overt speech.
The GODIVA model is defined as a system of differential equations describing the activity changes of simulated neuron populations through time. Equations were numerically integrated in MATLAB using a Runge–Kutta method for noise-free simulations and the Euler method when noise was added. Table 1 provides a legend for the symbols used in the following equations to define cell groups. To reduce complexity, cortico-cortical inhibitory projections, which likely involve a set of intervening interneurons between two sets of excitatory neurons, are modeled as a single inhibitory synapse from a cell that can also give excitatory projections. Note that the present model is “hand wired.” That is, synaptic weights that are assumed to be tuned through learning are set algorithmically within the range of values that learning must achieve for proper operation. Possible methods by which these weights can be learned are suggested in the Discussion section. In this version of the model, we have not fully explored the parameter space to provide particularly optimal settings but reported simulations use the same parameter set (excepting noise).
|External input to IFS||up|
|External input to pre-SMA||uf|
|IFS phonological content plan cells||p|
|IFS phonological content choice cells||q|
|Pre-SMA frame plan cells||f|
|Pre-SMA frame choice cells||g|
|Pre-SMA positional chain cells||h|
|Planning loop striatal projection cells||b|
|Planning loop striatal interneurons||b|
|Planning loop GPi cells||c|
|Planning loop anterior thalamic cells||d|
|External input to IFS||up|
|External input to pre-SMA||uf|
|IFS phonological content plan cells||p|
|IFS phonological content choice cells||q|
|Pre-SMA frame plan cells||f|
|Pre-SMA frame choice cells||g|
|Pre-SMA positional chain cells||h|
|Planning loop striatal projection cells||b|
|Planning loop striatal interneurons||b|
|Planning loop GPi cells||c|
|Planning loop anterior thalamic cells||d|
Phonological Content Representation in IFS
The IFS representation consists of two layers, one containing plan cells and one containing a corresponding set of choice cells. A plan cell and a corresponding choice cell represent a simplified cortical column. Figure 3 illustrates two such IFS columns from a single positional zone as well as their theoretical inputs and outputs. The idealized IFS columns are hypothesized to be tuned to a particular phoneme and to a particular abstract syllable position. The IFS map can thus be thought of as a two-dimensional grid, where each row corresponds to a particular phoneme and each column to a particular syllable position (see Figure 4). Seven syllable positions are included in the model. These correspond to a generic syllable template, such as that introduced by Fudge (1969) and used in the model of verbal STM introduced by Hartley and Houghton (1996). The vast majority of English syllables can be represented in this template by assigning particular phonemes to particular template slots.4 In GODIVA, the middle (fourth) position is always used to represent the syllable nucleus (vowel), preceding consonants are represented in preceding positional zones, and succeeding consonants in succeeding positional zones.5 Within a particular positional zone (corresponding to the long axis in Figure 4), an activity gradient across plan cells defines the serial order of the phonemic elements. For example, Figure 4 schematizes the representation of the planned utterance “go.di.v” (“go diva”) in the IFS phonological planning layer. Competitive interactions in the IFS map model are restricted to within-position interactions; in essence, therefore, this representation can be thought of as having multiple competitive queues, one for each syllable position. The model includes representations for 53 phonemes (30 consonants and 23 vowels) derived from the CELEX lexical database (Baayen, Piepenbrock, & Gulikers, 1995).
In Equation 1, there are two primary sources of excitatory input. First, uijp corresponds to a “word”6 representation that gives parallel excitation to the IFS plan layer. This input is assumed to arrive from one or more of three brain areas not explicitly treated in the model:
a higher level linguistic processing area involved in the morphosyntactic processing of an internally generated communicative concept, likely also in left prefrontal cortex;
a phonological processing region in the parietal cortex that can load the modeled phonological output system when the task is, for instance, reading or repetition; or
the inferior right-hemisphere cerebellum, which is hypothesized to assist in “fast-loading” of phonological content into this buffer (Rhodes & Bullock, 2002).
This transient input instantiates a gradient across IFS plan units that represents the ordered set of phonemes in the input “word.” The input is multiplicatively gated by a term α that can be used to ensure that the activity of cells receiving new inputs corresponding to words to be spoken later does not exceed the activity level of cells representing sounds to be spoken sooner, thus maintaining the correct order of planned speech sounds (e.g., Bradski, Carpenter, & Grossberg, 1994). The second excitatory input to cell pij is from itself. The constant θp is a noise threshold set to some small value and [ ]+ indicates half-wave rectification, a function that returns the value of its argument (e.g., pij − θp) if positive, otherwise zero. Recurrent self-excitation allows this layer to maintain a loaded plan over a short duration even in the absence of external inputs.
Cell pij is inhibited by other plan cells coding different phonemes at the same syllable position. The inhibitory inputs are weighted by entries in the adjacency matrix W. In the simplest case (used in simulations here), entry Wik is 1 for i ≠ k and 0 for i = k. This matrix can be modified to change the strength of phoneme–phoneme interactions, allowing for a partial explanation of phonemic similarity effects (see Discussion). Cell pij also receives strong inhibition from cell qij, its corresponding cell (in the same column) in the IFS choice layer. This input is thresholded by θq and amplified via a faster-than-linear activation function, y(x) = x2 (Grossberg, 1973). This function can be thought of as a nonlinear neural response (e.g., spike rate varies nonlinearly with membrane potential) inherent to choice cells. The same activation function also guides self-excitatory activity among the choice cells in Equation 2.
Structural “Frame” Representations in Pre-SMA
Cells in the pre-SMA are hypothesized to serve as representations of structural frames that code abstract structure at a level above the phonemic content represented in the IFS. Although alternative representations are also plausible, in the current proposal, pre-SMA cells code for common syllable types and for their abstract “slots” or positions. For example, the model pre-SMA contains cells that code for the syllable type CVC as well as for C in onset position, V in nucleus position, and C in coda position. Acquisition of this set of representations is feasible because of linguistic regularities; most languages use a relatively small number of syllable types. An analysis of frequency of usage tables in the CELEX lexical database (Baayen et al., 1995) revealed that just eight syllable frames account for over 96% of syllable productions in English.
The pre-SMA frame representations are activated in parallel with the IFS phonological content representation. Like the IFS planning layer, multiple pre-SMA frame cells can be active simultaneously in the plan layer. The relative activation levels of pre-SMA plan cells encode the serial order of the forthcoming syllable frames, with more activity indicating that a frame will be used earlier. The model thus represents a speech plan in two parallel and complementary queues, one in the IFS and one in the pre-SMA. This division of labor helps to solve a combinatorial problem that would result if all possible combinations of frame and content required their own singular representation. Such a scheme would require tremendous neural resources in comparison to the method proposed, which separates the representational bases into two relatively small discrete sets. The syllable frames [CV], [CVC], [VC], [CVCC], [CCV], [CCVC], and [VCC], the most common in English according to the CELEX database, were implemented. To allow for repeating frame types in a forthcoming speech plan, the model included multiple “copies” of each syllable frame cell.
The model pre-SMA contains not only cells that code for the entire abstract frame of a forthcoming syllable but also chains of cells that fire in rapid succession because they code for the individual abstract serial (phoneme-level) positions within the syllable frame. These two types of cells, one type that codes for an entire sequence (in this case a sequence of the constituent syllable positions within a syllable frame) and another type that codes for a specific serial position within that sequence, are similar to cells that have been identified in the pre-SMA in monkey studies (Shima & Tanji, 2000; Clower & Alexander, 1998). In the GODIVA model, the selection of a syllable frame cell (e.g., activation of a pre-SMA choice cell) also initiates the firing of the chain of cells coding its constituent structural positions (but not specific phonemes). The structure and the operation of the pre-SMA in the model are schematized in Figure 5.
For a single syllable, the temporal activity pattern in the pre-SMA proceeds as follows. First, a single choice cell is activated, corresponding to the most active syllable frame among a set of pre-SMA plan cells; upon the instantiation of this choice, the corresponding pre-SMA plan cell is suppressed. Next, the choice cell activates the first position cell in the positional chain corresponding to this syllable type. This cell and the subsequent cells in the positional chain give their inputs to zones in the caudate that have a one-to-one correspondence with positions in the syllable template and, equivalently, gateable zones in the IFS. These cortico-striatal projections form the inputs to the BG planning loop, which eventually enables the selection of the forthcoming syllable's constituent phonemes in the IFS choice field. When the positional chain has reached its completion, the last cell activates a corresponding cell in the SMA proper, which effectively signals to the motor portion of the circuit that the planning loop has prepared a new phonological syllable.
Cortico-striato-pallido-thalamo-cortical Planning Loop
Speech Sound Map
The SSM is a component of the DIVA model (Guenther et al., 1998, 2006; Guenther, 1995) that is hypothesized to contain cells that, when activated, “readout” motor programs and sensory expectations for well-learned speech sounds. In DIVA, tonic activation of an SSM cell (or ensemble of cells) is required to readout the stored sensory and motor programs throughout the production of the sound. To properly couple the system described herein with the DIVA model, GODIVA must provide this selective, sustained excitation to the appropriate SSM cells.
Response Release via the “Motor Loop”
The initiation or release of chosen speech motor programs for overt articulation is hypothesized to be controlled by a second loop through the BG, the motor loop. The proposal that two loops through the BG, one mediated by the head of the caudate nucleus and one mediated by the putamen, are important in cognitive and motor aspects of speech production, respectively, was supported by intraoperative stimulation results demonstrating dysarthria and articulatory deficits when stimulating the anterior putamen and higher level deficits including perseveration when stimulating the head of the caudate (Robles, Gatignol, Capelle, Mitchell, & Duffau, 2005). In GODIVA, the motor loop receives convergent input from the SMA and motor cortex and gates choice (or execution) cells in the motor cortex (Figure 2). In keeping with established views of distinct BG–thalamo-cortical loops, the motor loop receives inputs at the putamen, whereas the planning loop receives inputs from “higher level” prefrontal regions at the caudate nucleus (Alexander & Crutcher, 1990; Alexander et al., 1986). The motor loop gives output to the ventrolateral thalamus, as opposed to the ventral anterior thalamic targets of the model's planning loop.
Currently, this motor loop in the GODIVA model is not specified with the same level of detail as the previously discussed planning and selection mechanisms in the model. To achieve the same level of detail, it will be necessary to fully integrate the circuits described above with the existing DIVA model. For the sake of clarity and tractability, we leave such integration for future work while focusing on the new circuitry embodied by the GODIVA model. Nevertheless, a conceptual description of these mechanisms is possible and follows from the general architecture of the higher level portions of the model. Specifically, the activation of an SSM choice cell representing the forthcoming speech motor program is hypothesized to activate plan cells in the left motor cortex. These plan cells do not directly drive movement of the articulators, just as plan cell activity in other modules in GODIVA does not directly drive activity beyond that cortical region. Instead, overt articulation in the model requires the enabling of motor cortex choice cells via the BG-mediated motor loop. To “open the gate” and initiate articulation, the motor loop requires convergent inputs from motor cortex plan cells and from the SMA proper. This notion is based on three major findings from Bohland and Guenther (2006), which are consistent with other reports in the literature. These results are (i) that overt articulation involves specific additional engagement of the SMA-proper, (ii) that the putamen is particularly involved when speech production is overt, and (iii) that whereas the left hemisphere motor cortex may become active for covert speech or speech preparation, overt speech engages the motor cortex in both hemispheres.
Figure 8 provides a summary of the process by which the model produces a sequence of syllables.
Computer simulations were performed to verify the model's operation for a variety of speech plans. Figures 9 and 10 show the time courses of activity in several key model components during the planning and production of the syllable sequence “go.di.v” given two different assumptions about the model's initial performance repertoire. Figure 11 illustrates a typical phonological error made when noise is added to the model.
Performance of a Sequence of Well-learned Syllables
In the first simulation, the model's task is to produce this sequence assuming that each individual syllable (“go,” “di,” and “v”) has been learned by the speaker and thus a corresponding representation is stored in the model's SSM. Sensorimotor programs for these syllables must be acquired by the DIVA portion of the circuit; this learning process is described elsewhere (Guenther et al., 2006; Guenther, 1995). In this simulation, the 1,000 most frequent syllables from the CELEX database (which include the three syllables to be performed here) are represented in the SSM. The “input” to this simulation is a graded set of parallel pulses, applied at the time indicated by the first arrow in each panel of Figure 9. This input activates the two complementary gradients in the IFS plan layer (Figure 9A and B) and in the pre-SMA plan layer (not shown). This mimics the input signals that are hypothesized to arise from higher order linguistic areas. These inputs create an activity gradient across the /g/, /d/, and /v/ phoneme cells in Syllable Position 3 (onset consonant) and a gradient across the /o/, /i/, and // phoneme cells in Syllable Position 4 (vowel nucleus) in the IFS plan layer as well as a gradient across three “copies” of the [CV] frame cell in the pre-SMA. Figure 9A and B shows that the activation levels of the phonemes in these positional zones rise from the initial state of 0 and begin to equilibrate with each cell taking on a distinct activation level, thus creating the activity gradients that drive sequence performance.
After the first frame representation is activated in the pre-SMA choice layer, Positional Zones 3 and 4 are enabled in rapid succession in the IFS choice layer. This allows the most active phoneme in each IFS positional zone to become active. Figure 9C and D reveals this choice process, which results in the strong, sustained activation of the phonemes /g/ and /o/ in IFS Choice Zones 3 and 4, respectively, with the activation of Zone 4 occurring slightly later than activation in Zone 3. Immediately after the choice of /g/ and /o/ (Figure 9C and D), the representations for these phonemes in the IFS plan layer (Figure 9A and B) are rapidly suppressed. IFS plan layer activity then re-equilibrates, leaving only two active phonemes in each zone, now with a larger difference in their relative activation levels.
The cotemporal activation of cells coding /g/ and /o/ in the IFS choice layer (Figure 9C and D) drives activity in the model's SSM plan layer (Figure 9E). Multiple cells representing sensorimotor programs for syllables and phonemes become active, each partially matching the phonological sequence represented in the IFS choice layer. The most active of these SSM plan cells codes for the best matching syllable (in this case “go”). This most active syllable becomes active also in the SSM choice layer (Figure 9F). As soon as “go” becomes active in Figure 9F, its constituent phonemes in the IFS choice layer (Figure 9C and D) are suppressed. The resulting lack of activity in the IFS choice layer then enables the choice of the next CV syllable frame in the pre-SMA, allowing the model to begin preparing the syllable “di” (up to the stage of activating potential SSM matches in the SSM plan cells) while it is still producing the syllable “go” (compare Figure 9C–E with Figure 9F). The syllable “di” however, can only be chosen in the SSM choice layer (Figure 9F) upon the receipt of a nonspecific suppression signal from the articulatory control circuit. This inhibitory signal transiently quenches all activity in the SSM choice layer, which can be seen by the rapid decrease in activation of the cell coding for “go” in Figure 9F. Upon removal of this suppression signal, “di,” the most active SSM plan representation, is chosen in the SSM choice layer. This entire process iterates until there are no remaining active cells in the pre-SMA or IFS plan layers. It can be seen from Figure 9F that the syllable motor programs corresponding to the desired syllables receive sustained activation, one at a time, in the proper order. This is precisely what is required to interface GODIVA with the DIVA model, which can then be used to control a computer-simulated vocal tract to realize the desired acoustic output for each syllable.
Performance from Subsyllabic Targets
We have emphasized a speaker's ability to represent and to produce arbitrary syllable sequences that fall within the rules of her language. By planning in the phonological space encompassed by the IFS and pre-SMA categorical representations, the GODIVA model does not rely on having acquired phonetic or motor knowledge for every syllable it is capable of planning and/or producing. Instead, the model is capable of producing unfamiliar syllables by activating an appropriate sequence of subsyllabic phonetic programs. This point is addressed in a simulation that parallels the one described above but makes different assumptions about the initial state of the model's SSM.
Figure 10 shows the model again producing the syllable sequence “go.di.v,” but in this simulation the syllables “go” and “v” have each been removed from the model's SSM. The system thus no longer has the requisite motor programs for these syllables, and it must effect production using a sequence of smaller stored programs corresponding to the syllables' individual phonemes. Figure 10F shows that the model activates SSM choice cells coding the constituent phonemes, in the correct order, for the first and third syllables of the planned utterance. The SSM program associated with the second syllable, “di,” remains as a possible match in the SSM and is correctly chosen for production.
Figure 10C–E demonstrates the model's operation when individual syllables must be created from phonemic motor programs. A comparison of Figure 10C and D reveals that the IFS choice cell for the first phoneme (/g/ of the syllable “go”) is suppressed before suppression of the phoneme /o/. This is because the inhibition of IFS choice cells is dictated by which sensorimotor program is chosen in the SSM choice layer. Because no SSM cell matches “go” exactly, the best matching cell (as determined by the dot product of IFS choice layer activity with each SSM plan cell's stored synaptic weights; see Equation 13) codes for the phonetic representation of the phoneme /g/. This cell is activated in the SSM choice field (see Figure 10F) and inhibits only the phonological representation of /g/ in the IFS choice layer (Figure 10C). Because the phoneme /o/ remains active in IFS choice field zone 4 (Figure 10D), the preparation of the next syllable cannot yet begin. Instead, SSM plan cell activity (Figure 10E) is automatically regulated to code for degree of match to the remaining phonological representation in the IFS choice field (in this case the single phoneme /o/). Once the nonspecific quenching signal arrives at the SSM choice field to indicate impending completion of the motor program for /g/, the program for /o/ can be chosen. Finally, the entire IFS choice field (in both Zones 3 and 4; Figure 10C and D) is inactive, allowing the pre-SMA to choose the next syllable frame and continue the sequencing process.
Production of Phonological Errors
Figure 11 shows the time course of activity within Zone 4 of the IFS during another simulation of the intended sequence “go.di.v,” which results in a phonological error due to the addition of noise. In this example σf = 1.0 (see Equation 1), corresponding to a large Gaussian noise source giving independent input to each cell at each time step. Here, after the correct choice of the onset phoneme in the first syllable, noise is able to drive the IFS plan for /v/ (blue) to a higher activation level than the plan for /d/ (red) before its selection in the IFS choice layer. Despite the improper choice of the onset phoneme for the second syllable, the system continues to behave as usual, ultimately resulting in production of the syllable “vi” in place of the intended “di” (SSM cell activations not shown for brevity). Activation of /v/ in the IFS choice field causes the corresponding /v/ plan cell to be quenched in the IFS plan layer, leaving /d/ as the remaining significantly active onset plan. The /d/ plan is subsequently chosen and paired with the vowel //, resulting in the completion of the syllable sequence “go.vi.d,” an example of a simple exchange error that obeys syllable position constraints as noted in previous studies (e.g., Shattuck-Hufnagel, 1979; MacKay, 1970). It should be noted that whether this error follows a word position constraint is determined merely by the placement of word boundaries (e.g., go diva vs. godi va). In the present treatment, we do not explicitly deal with word-level representations but suggest that substitutions across word boundaries will be produced, provided multiple word representations can be activated simultaneously in the IFS, which we strongly suggest is the case.
We have presented a neurobiologically based model that begins to describe how syllable sequences can be planned and produced by adult speakers. The above simulations demonstrate the model's performance for sequences of both well-learned and uncommon syllables. Although the model attempts to index syllable-sized performance units, its underlying planning representation consists of categorical phonemes and abstracted syllable frames. Although we have not focused on modeling the rich patterns in speech error data, these representations naturally account for the most typical slips of the tongue. GODIVA builds on much previous theoretical work, beginning with the seminal contributions of Lashley (1951). Lashley's ideas can be viewed as a precursor to CQ proposals (Bullock, 2004; Bullock & Rhodes, 2003; Houghton & Hartley, 1995; Houghton, 1990; Grossberg, 1978a, 1978b), which are used in multiple places within GODIVA. The encoding of serial order by a primacy gradient is a fundamental prediction of CQ-style models that has received experimental support (Averbeck et al., 2002, 2003). Such order-encoding activity gradients underlie the choice of the model's name (Gradient Order DIVA; GODIVA). The GODIVA modules specified here operate largely “above” the DIVA model in the speech production hierarchy; these modules act to select and activate the proper sensorimotor programs and to initiate the production of chosen speech sounds. Online motor control for the individual speech motor programs as well as their acquisition is the function of the DIVA model itself, which has been described in detail elsewhere (Guenther, 1994, 1995, 2006; Guenther et al., 1998, 2006). DIVA also is responsible for coarticulation, which can be absorbed into learned programs, and can also cross the boundaries of individual “chunks.”
That GODIVA was not developed in isolation, but rather as a continuation of an existing model of the neural circuits for speech and language production, is an important characteristic. Although future effort will be required to more fully integrate GODIVA with DIVA, here we have laid the groundwork for a comprehensive computational and biologically grounded treatment of speech sound planning and production. Each component of the GODIVA model, after previous efforts with DIVA (Guenther, 2006; Guenther et al., 2006), has hypothesized cortical and/or subcortical correlates. GODIVA thus appears to be the first treatment of the sequential organization and production of speech sounds that is described both formally and with detailed reference to known neuroanatomy and neurophysiology.
Representations of Serial Order
Although the CQ architecture plays a fundamental role in GODIVA, it is not the only ordinal representation used. The IFS representation combines elements of both CQ and positional models. Specifically, the minor axis of this two-dimensional map (Figure 4) is proposed to code for abstract serial position within a syllable. The inclusion of cells that code for both a particular phoneme and a particular syllable position may seem unappealing; the use of multiple “copies” of nodes coding for a single phoneme has often been criticized for failing to encapsulate any relationship between the nodes (e.g., Dell, 1986). In the proposed IFS representation that relationship is captured, to an extent, because “copies” of the same phoneme will always appear in close topographic proximity so long as the two-dimensional grid is mapped contiguously onto the cortical sheet. Additionally, and perhaps more importantly, this position-specific representation, which was motivated here and elsewhere by the strong syllable position constraint in speech errors, is computationally useful. Because IFS cells interact only within a positional zone, the IFS field can be thought to contain multiple queues. The capacity of a single queue (i.e., a planning layer) in CQ models is limited by noise; as additional elements are added, the difference between activation levels of any two elements to be performed successively is reduced. With the addition of zero-mean Gaussian noise, the probability of serial order errors at readout also becomes larger with additional elements. By dividing a phonological plan among multiple queues, the effect of noise is less destructive than it would be in a single queue, and the overall planning capacity is effectively increased. Although we have implemented a system with effectively seven queues, based on a generic syllable template, we remain agnostic to the precise details of the actual phonological map but suggest that the general principles outlined here present a plausible representational scheme in view of the sparse existing evidence. The idea of serial position-specific representations, while useful and supported by behavioral data, is less appealing for modeling simple list memory, general movement planning, and many other sequential behaviors because the number of “slots” is often ambiguous and the number of possible items that must be available at any serial position can be quite large. The phonotactic constraints of a language, however, reduce the set of possible phonemes at any given position.
GODIVA also includes “serial chain” representations within the pre-SMA module. The inclusion of these specific chains does not, however, invite all of the same criticisms that pertain to associative chaining as an exhaustive theory of serial order. This is because the total number of sequences that are encoded in this manner is small, corresponding to the number of abstract structural syllable frames available to the speaker (as discussed above, just eight syllable types account for almost all productions), and the order of the elements within a particular frame type is fixed. This leads to some general guiding principles that appear useful in modeling hierarchical sequential behavior. When sequence production must be generative,7 associative chaining quickly becomes problematic, and the use of CQ-type activation gradients to encode order is preferred. When a small set of sequences becomes highly stereotyped, however, readout by serial or “synfire” chains (e.g., Pulvermüller, 1999, 2002; Abeles, 1991) can offer greater efficiency. The GODIVA model thus leverages these different representations when sequencing demands differ. Similarly, Dell (1986) speculated that a principle explanation for the appearance of speech errors in normal speech is the speaker's need for productivity or generativity. To produce novel linguistic sequences, it is necessary to “fill” slots in a sequence, and this allows the possibility of error due to difficulties with the “filling-in” mechanism(s). Dell argues that the set of possible phonemes is closed (after acquisition), whereas the set of possible phoneme combinations is open. CQ provides an efficient and physiologically plausible mechanism for representing this open set of combinations, which is problematic for other proposals. In this framework, it is intuitive that the units that slip during sequence production should be the units that form the bases in the CQ planning layer, in this case phonemes.
Development of the Speech and Language System
Our model is built around an assumed division of syllabic frames from phonemic content, but this system must be learned by speakers. MacNeilage (1998) has proposed that speech evolved the capability to program syllabic frames with phonological content elements and that every speaker learns to make use of this capacity during his or her own period of speech acquisition. Developing speakers follow a trajectory that may give rise to this factorization of frame and content. At approximately 7 months, children enter a canonical babbling stage in which they rhythmically alternate an open and a closed vocal tract configuration while phonating, resulting in repeated utterances such as “ba.ba.ba.ba.” MacNeilage and Davis (1990) have suggested that these productions represent “pure frames.” These reduplicated babbles dominate the early canonical babbling stage but are largely replaced by variegated babbling at around 10–13 months. This stage involves modifications of the consonant and vowel sounds in babbles, resulting in syllable strings such as “ba.gi.da.bu.” MacNeilage and Davis suggest that this stage may represent the earliest period of content development.
Locke (1997) presented a theory of neurolinguistic development involving four stages: (1) vocal learning, (2) utterance acquisition, (3) structure analysis and computation, and (4) integration and elaboration. Locke (p. 273) suggests that in Stage 2, “every utterance [children] know is an idiom, an irreducible and unalterable ‘figure of speech.’” This irreducibility was supported by the finding that very young children make far fewer slips of the tongue than adult speakers (Warren, 1986). It is only at the onset of Stage 3, around 18 to 20 months, that children gain the ability to “analyze” the structure of their utterances, recognizing, for example, recurring elements. This stage may provide the child with the representations needed for phonology, enabling generativity and the efficient storage of linguistic material. Importantly, at around 18 months of age, the rate of word acquisition in children may quadruple (Goldfield & Reznick, 1990). The timing of this explosion in a child's available vocabulary also roughly coincides with development in the perceptual system at approximately 19 months, at which time children can effectively discriminate the phonetic categories in their language (Werker & Pegg, 1992).
We take the position that the stages of speech acquisition up to and including variegated babbling are particularly important for tuning speech-motor mappings such as those described by the DIVA model of speech production (Guenther et al., 1998; Guenther, 1995). These stages also provide a “protosyllabary” of motor programs that are “purely motoric,” having little to no linguistic significance (Levelt et al., 1999). A later stage, perhaps Locke's (1997) Stage 3, leads to development of phonological representations that can become associated with the phonetic programs that realize those speech sounds. This allows the learning speaker to insert content items into common learned syllable frames, thus offering an explanation for the rapid increase in the vocabulary at this time. Furthermore, this representation of the common sound elements in a speaker's language should remain largely unchanged after learning and can be used by the adult speaker to interface both words and nonwords with a more plastic speech motor system. In a sense, this representation provides a basis for representing any utterance in the language. The GODIVA model describes the speech system after the development of this stage and leverages this basis to allow generative production of novel sound sequences.
Comparison with Other Computational Models
The WEAVER (and later WEAVER++) model (Levelt et al., 1999; Roelofs, 1997) is broadly a computer implementation of the Nijmegen model. In WEAVER, a selected morpheme activates nodes representing its constituent phonemes and a metrical structure, which specifies the number of syllables and stress pattern. The order of the activated phonemes is assumed to be encoded by links between the morpheme and the phoneme nodes; likewise, links between phoneme nodes and nodes that represent phonetic syllables (e.g., motor programs) are also “labeled” with positional information (indicating onset, nucleus, or coda). Although WEAVER(++) is an important formalization of an influential language production model and shares certain similarities with GODIVA, its focus is somewhat different. Although GODIVA makes specific proposals about representations for order and their instantiations in neural circuits, the WEAVER model's use of rule-based labeling of nodes and links is difficult to reconcile in terms of potential brain mechanisms. The flow of information in the model is also not explicitly linked to regions and pathways in the cortex; thus, the ability to make inferences about neural function based on this model is limited. The GODIVA model is intended to bridge this gap between theoretical information processing and the neural substrates that implement such processes.
Dell's (1986) spreading activation model offers a formal explanation for various speech error data and represents the archetypal “frame-based” model. The proposal uses representations at several hierarchically organized linguistic levels such that nodes at one level are activated by nodes one level higher. Representations of the forthcoming utterance are built through a process of tagging the most active nodes at each level, largely in parallel. Nodes are then labeled with linguistic categories; in phonological encoding, for example, phonemes are labeled as onset, nucleus, or coda position in a syllable. A syllable frame, or ordered set of categories, is used to tag the most active nodes within the appropriate categories. The frame thus dictates not which elements are tagged but which are eligible to be tagged, much like in GODIVA. Dell's model formalized several theoretical proposals that had been proposed to explain speech errors within a network architecture, rendering the theories somewhat more biological but leaving possible anatomical substrates unspecified.
Hartley and Houghton (1996) described a CQ-based model that also exploits the frame-content division to explain learning and recall of unfamiliar nonwords in verbal STM. Individual syllables are represented in terms of their constituent phonemes and the “slots” that they use in a generic syllable template adapted from Fudge (1969). A pair of nodes is allocated for each syllable presented for recall, representing the syllable onset and rime, and temporary associative links are formed between these node pairs and both the appropriate syllable template slots and the phoneme content nodes for each syllable presented. During recall, an endogenous control signal imparts a gradient across syllable nodes, with the immediately forthcoming syllable receiving highest activation (see also Burgess & Hitch, 1992). The most active syllable pair is chosen for output and gives its learned input to the syllable template and phoneme nodes. As each syllable slot becomes activated (iteratively), phoneme nodes also become activated, with the most active nodes generally corresponding to phonemes from forthcoming syllables that occupy the same slot. The most active phoneme node is then chosen for “output,” its activity suppressed, and so on until the sequence is completed. The model advances earlier proposals and does not require multiple versions of each phoneme for different serial positions. This requirement is obviated by using a single-shot learning rule to make appropriate associations between position and phonemes; it is not clear, however, how such learning would be used in self-generated speech.
Vousden et al. (2000) presented a similarly motivated model that is derived from a previous proposal for serial recall (Brown et al., 2000). The model postulates the existence of a dynamic, semiperiodic control signal (the phonological context signal) that largely drives its operation. A major goal of Vousden et al. was to eliminate the necessity for syllable position-specific codes for phonemes. Although appealing, this “simplification” requires a rather complex control signal derived from a large set of oscillators. The signal is designed to have autocorrelation peaks at specific temporal delays, reflected by the pool of low-frequency oscillators. In the reported simulations, this periodicity occurs every three time steps, which allows each state in a single period to be associated with an onset, a nucleus, or a coda phoneme. Recall of a sequence depends on learning a large set of weight matrices that encode associations between the context signal and a matrix constituting the phoneme representation, which is potentially problematic for novel or self-generated sequences. At recall, the context vector is reset to its initial state and “played back,” resulting in a gradient of activations across phonemes for each successive contextual state. The typical CQ mechanisms are then used to allow sequence performance. Several concerns arise from the model's timing, association, and recall processes; see the critiques of this class of models in Lewandowsky, Brown, Wright, and Nimmo (2006) and Agam, Bullock, and Sekuler (2005).
One of the weaknesses of CQ theories concerns representing elements that repeat in a sequence. Because cells code for items and those cells' activity levels code for the items' serial order, it is problematic to represent the relative order of the same item occurring twice or more in the planned sequence. GODIVA uses perhaps the simplest strategy to handle repeating elements, by including multiple “copies” of each representative cell in the IFS and pre-SMA plan layers. With this addition, order is maintained simply by using a different copy of the requisite phoneme or frame cell for each occurrence of that phoneme or frame in the sequence. The sequence “pa.ta.ka” would thus require three different copies of the /a/ phoneme cell in Positional Zone 4 of the IFS. The implementation of this strategy requires some additional ad hoc machinery. Specifically, the model's external input, when targeting a phoneme cell in the IFS or frame cell in the pre-SMA, must activate a cell of that type that is currently inactive.
When entire syllables (performance units), on the other hand, are to be repeated by the model (e.g., “ta.ta.ta”), a different assumption is made. On the basis of RT data from Schönle, Hong, Benecke, and Conrad (1986) as well as fMRI observations described by Bohland and Guenther (2006), it appears that producing the same syllable N times is fundamentally different from producing N different syllables. We therefore assumed that planning a sequence such as “ta.ta.ta” only requires the phonological syllable “ta” to be represented in the complementary IFS and pre-SMA planning layers once. An additional mechanism is proposed to iterate the production portion of the circuit N times without the need to specify the phonological representation again each time.
A General Framework
Although the current proposal does not model higher level aspects of language production, the general architecture appears to have potential for reuse throughout the language system. The organization of BG circuits into largely parallel loops (Alexander & Crutcher, 1990; Alexander et al., 1986) offers one possible substrate for the cascaded processing stages that enable linguistic selections from competing alternatives; these selections (cf. choice layer activations) can then activate lower level representations through cortico-cortical pathways (as IFS choice cells, for example, activate SSM plan cells). Such loops may be nested to account for various levels of language production (e.g., Ward, 1994; Garrett, 1975). The GODIVA model architecture also offers an account for how learned structural patterns can be combined with an alphabet of “content” items in a biologically realistic circuit. In the same way that an abstract CV structure combines with representative phoneme units, syntactical structure might, for instance, combine with word units from different grammatical categories (cf. different positional zones). There is evidence that BG loops might indeed take part in selection mechanisms for higher level aspects of language. Damage to portions of the caudate can give rise to semantic paraphasia (Kreisler et al., 2000), a condition marked by the wrongful selection of words, in which the selected word has meaning related to the target word. Crinion et al. (2006) have also suggested that the caudate may subserve selection of words from a bilingual lexicon.
Relevance in the Study of Communication Disorders
Many authors have stressed the usefulness of comprehensive models in the study of communication disorders (e.g., McNeil et al., 2004; Ziegler, 2002; Van der Merwe, 1997). Current speech production models have largely failed to shed light on disorders such as apraxia of speech (AOS) because “theories of AOS encounter a dilemma in that they begin where the most powerful models of movement control end and end where most cognitive neurolinguistic models begin” (Ziegler, 2002). The GODIVA model is the first step in an attempt to bring the DIVA model (the “model of movement control”) into a broader neurolinguistic setting. In so doing, the hope is that communication disorders such as AOS and stuttering can be better understood in terms of pathological mechanisms within the model that can be localized to brain regions through experimentation. As an example, in GODIVA, the symptoms of AOS, particularly groping and difficulty reaching appropriate articulations, might be explained by at least two mechanistic accounts. The first possibility is that the motor programs for desired sounds are themselves damaged. In the model, this amounts to damage to the SSM (lateral premotor cortex/BA44) or its projections to the motor cortex. An alternative explanation could be that these sensorimotor plans are intact, but the mechanism for selecting the appropriate plan is defective. This would occur in the model with damage to connections between the IFS choice layer and the SSM. A major focus of future research within this modeling framework should be the consideration of speech disorders.
Expected Effects of Model “Lesions”
One of the major reasons for hypothesizing specific neural correlates for model components (cf. existing psycholinguistic models without such specificity) is to make predictions about the effects that focal lesions might have on normal speech function. Although detailed simulations will need to be presented in future work, we can make some preliminary predictions presently. First, specific lesions to the left lateral pFC (in the area of IFS) will likely impact phonological encoding at the phoneme level. This may result in phonemic paraphasias, including substitutions, anticipations, and perseverations, which are observed in some Broca's aphasics. Because choice of syllable frames in the pre-SMA “starts” the production process, damage here could result in reductions in self-initiated speech (Jonas, 1981, 1987) but also may result in “frame deficiencies”—perhaps taking the form of reducing complex frame types to simpler ones. Damage to the BG planning loop may impact selection and notably the timing of selection processes in phonological encoding, which is consistent with some observations (Kreisler et al., 2000; Pickett et al., 1998). Finally, damage to the SSM or the interface between the IFS choice field and the SSM (e.g., damage to cortico-cortical projections) should lead to problems in realizing a phonetic plan or diminished ability to choose a phonetic plan; these deficits would be in line with observations in patients with AOS (Hillis et al., 2004).
Other Experimental Predictions
Any model of a system as complex as that considered here will eventually be found to have significant flaws. One of the most useful aspects of any model that can be simulated under various conditions is to generate predictions that can be tested experimentally. Through the generation of testable predictions, the model may be proven invalid, but new proposals will arise from this knowledge that further our understanding of the system. The GODIVA model makes many such predictions. For example, GODIVA predicts that the set of IFS choice layer to SSM plan layer connections implements a selection process whereby the strength of input to an SSM plan cell depends on how strongly the speech sound corresponding to that cell matches the currently planned syllable in IFS. This leads to the prediction that when many cells in the SSM code for sounds that partially match the syllable planned in IFS, the overall activation of the SSM will be larger than when there are few partial matches. More broadly speaking, planning and producing syllables with dense phonological neighborhoods are predicted to result in greater activation of the SSM than planning and producing syllables with sparse neighborhoods. This type of prediction is readily testable using fMRI or PET. A continued program of model development combined with targeted experimentation will be critical to better understanding the speech system.
The model and the conceptual framework we have presented here should be viewed as a preliminary proposal that will require considerable expansion to fully treat the myriad issues involved in planning and producing fluent speech. These include expansion of the model to address processing at the level of words and higher, which we believe can be incorporated gracefully in the existing framework. In future work, we plan to more fully address the rich patterns of observed speaking errors in normal and aphasic speakers, which may require further examination of the proposed syllable “template” and set of available syllable frames. Further, it is of interest to more closely examine how treatment of speech sequences (at the level of syllables or multisyllabic words) changes as they go from completely novel to familiar, to highly automatized, yet at every stage maintain the ability to be modulated by the speaker in terms of rate, emphasis, or intonation. We plan to explore the probable role of the cerebellum in phonological and phonetic processes (e.g., Ackermann, 2008), including a role in on-line sequencing, for example, by fast parallel loading of phonological memory buffers (cf. Rhodes & Bullock, 2002) and in the coordination and regulation of precise temporal articulation patterns. The cerebellum may also be involved in the generation of prosody (Spencer & Slocomb, 2007), along with other structures, and future instantiations of the GODIVA model should strive to explain how prosody and stress can be encoded at the phonological and phonetic levels.
The authors thank Jason Tourville, Satrajit Ghosh, and Oren Civier for valuable comments. Support was provided by NIH R01 DC007683 and NIH R01 DC002852 (F. H. Guenther, PI) and NSF SBE-0354378 (S. Grossberg, PI).
Reprint requests should be sent to Jason W. Bohland, Boston University, Sargent College of Health & Rehabilitation Sciences, Department of Health and Sciences, 635 Commonwealth Avenue, Boston, MA 02215, or via e-mail: firstname.lastname@example.org.
In DIVA, a speech sound can be a phoneme, a syllable, an entire word, etc. For the purposes of the current model, we made the simplifying assumption that speech sounds comprise syllables and individual phonemes.
An analogous division of a linguistic plan into frames and content can easily be envisaged at higher linguistic levels, but treatment of such issues is beyond the scope of this article.
This notation is used throughout to indicate syllable type. C indicates consonant, V vowel. For example, a CCCV syllable (e.g., “stra”) is composed of three consonants followed by a vowel.
The current model treats successive consonants in an onset or coda cluster as independent; however, error data support the notion that consonant clusters may be additionally bound to one another (though not completely). Future work will elaborate the syllable frame specification to account for such data.
Due to the phonotactic rules of English, not all phonemes are eligible at all positions. For simplicity, this notion was not explicitly incorporated in the model, but its implications suggest future work.
The term word is used loosely to indicate a portion of a planned utterance that is at least as large as a syllable. This could represent a real word, a morpheme, or a pseudoword, for example.
Here, generative is used to mean that, for the behavior in question, the generation of novel and perhaps arbitrary sequences is crucial. In speech, combining words or syllables into novel sequences is commonplace.