Recurrent Neural Networks in Linguistic Theory: Revisiting Pinker and Prince (1988) and the Past Tense Debate

Can advances in NLP help advance cognitive modeling? We examine the role of artificial neural networks, the current state of the art in many common NLP tasks, by returning to a classic case study. In 1986, Rumelhart and McClelland famously introduced a neural architecture that learned to transduce English verb stems to their past tense forms. Shortly thereafter in 1988, Pinker and Prince presented a comprehensive rebuttal of many of Rumelhart and McClelland’s claims. Much of the force of their attack centered on the empirical inadequacy of the Rumelhart and McClelland model. Today, however, that model is severely outmoded. We show that the Encoder-Decoder network architectures used in modern NLP systems obviate most of Pinker and Prince’s criticisms without requiring any simplification of the past tense mapping problem. We suggest that the empirical performance of modern networks warrants a reexamination of their utility in linguistic and cognitive modeling.


Introduction
In their famous 1986 opus, Rumelhart and McClelland (R&M) describe a neural network capable of transducing English verb stems to their past tense.The strong cognitive claims in the article fomented a veritable brouhaha in the linguistics community and eventually led to the highly influential rebuttal of Pinker and Prince (1988) (P&P).P&P highlighted the extremely poor empirical performance of the R&M model, and pointed out a number of theoretical issues with the model, which they suggested would apply to any neural network, contemporarily branded connectionist approaches.Their critique was so successful that many linguists and cognitive scientists to this day do not consider neural networks a viable approach to modeling linguistic data and human cognition.
In the field of natural language processing (NLP), however, neural networks have experienced a renaissance.With novel architectures, large new data sets available for training, and access to extensive computational resources, neural networks now constitute the state of the art in many NLP tasks.However, NLP as a discipline has a distinct practical bent and more often concerns itself with the large-scale engineering applications of language technologies.As such, the field's findings are not always considered relevant to the scientific study of language (i.e., the field of linguistics).Recent work, however, has indicated that this perception is changing, with researchers, for example, probing the ability of neural networks to learn syntactic dependencies like subject-verb agreement (Linzen et al., 2016).
Moreover, in the domains of morphology and phonology, both NLP practitioners and linguists have considered virtually identical problems, seemingly unbeknownst to each other.For example, both computational and theoretical morphologists are concerned with how different inflected forms in the lexicon are related and how one can learn to generate such inflections from data.Indeed, the original R&M network focuses on such a generation task, namely, generating English past tense forms from their stems.R&M's network, however, was severely limited and did not generalize correctly to held-out data.In contrast, state-of-the art morphological generation networks used in NLP, built from the modern evolution of recurrent neural networks (RNNs) explored by Elman (1990) and others, solve the same problem almost perfectly (Cotterell et al., 2016).This level of performance on a cognitively relevant problem suggests that it is time to consider further incorporating network modeling into the study of linguistics and cognitive science.
Crucially, we wish to sidestep one of the issues that framed the original debate between P&P and R&M-whether or not neural models learn and use "rules."From our perspective, any system that 651 Transactions of the Association for Computational Linguistics, vol.6, pp.651-666, 2018.Action Editor: Sebastian Padó.
Submission batch: 2/2018; Revision batch: 5/2018; Published 12/2018.c 2018 Association for Computational Linguistics.Distributed under a CC-BY 4.0 license.picks up systematic, predictable patterns in data may be referred to as rule-governed.We focus instead on an empirical assessment of the ability of a modern state-of-the-art neural architecture to learn linguistic patterns, asking the following questions: (i) Does the learner induce the full set of correct generalizations about the data?Given a range of novel inputs, to what extent does it apply the correct transformations to them? (ii) Does the behavior of the learner mimic humans?Are the errors human-like?
In this work, we run new experiments examining the ability of the Encoder-Decoder architecture developed for machine translation (Bahdanau et al., 2014;Sutskever et al., 2014) to learn the English past tense.The results suggest that modern nets absolutely meet the first criterion, and often meet the second.Furthermore, they do this given limited prior knowledge of linguistic structure: The networks we consider do not have phonological features built into them and must instead learn their own representations for input phonemes.The design and performance of these networks invalidate many of the criticisms in Pinker and Prince (1988).We contend that, given the gains displayed in this case study, which is characteristic of problems in the morpho-phonological domain, researchers across linguistics and cognitive science should consider evaluating modern neural architectures as part of their modeling toolbox.
This paper is structured as follows.Section 2 describes the problem under consideration, the English past tense.Section 3 lays out the original Rumelhart and McClelland model from 1986 in modern machine-learning parlance, and compares it to a state-of-the-art Encoder-Decoder architecture.A historical perspective on alternative approaches to modeling, both neural and nonneural, is provided in Section 4. The empirical performance of the Encoder-Decoder architecture is evaluated in Section 5. Section 6 provides a summary of which of Pinker and Prince's original criticisms have effectively been resolved, and which ones still require further consideration.Concluding remarks follow.

The English Past Tense
Many languages mark words with syntacticosemantic distinctions.For instance, English marks the distinction between present and past tense verbs, for example, walk and walked.English verbal mor-  phology is relatively impoverished, distinguishing maximally five forms for the copula to be and only three forms for most verbs.In this work, we consider learning to conjugate the English verb forms, rendered as phonological strings.As it is the focus of the original R&M study, we focus primarily on the English past tense formation.Both regular and irregular patterning exist in English.Orthographically, the canonical regular suffix is -ed, which, phonologically, may be rendered as one of three phonological strings: [-Id], [-d], or [-t].The choice among the three is deterministic, depending only on the phonological properties of the previous segment.English selects [-Id] where the previous phoneme is a [t]  (go →went)), or exist in sub-regular islands defined by processes like ablaut (e.g., sing →sang) that may contain several verbs (Nelson, 2010); see Table 1.
Single vs. Dual Route.A frequently discussed cognitive aspect of past tense processing concerns whether or not irregular forms have their own processing pipeline in the brain.Pinker and Prince (1988) proposed separate modules for regular and irregular verbs; regular verbs go through a general, rule-governed transduction mechanism, and exceptional irregulars are produced via simple memory look-up.1While some studies (e.g., Marslen-Wilson and Tyler, 1997;Ullman et al., 1997) provide corroborating evidence from speakers with selective impairments to regular or irregular verb production, others have called these results into doubt (Stockall and Marantz, 2006).From the perspective of this paper, a complete model of the English past tense should cover both regular and irregular transformations.The neural network approaches we advocate for achieve this goal, but do not clearly fall into either the single or dualroute category-internal computations performed by each network remain opaque, so we cannot at present make a claim whether two separable computation paths are present.

Acquisition of the Past Tense
The English past tense is of considerable theoretical interest because of the now well-studied acquisition patterns of children.As first shown by Berko (1958) in the so-called wug-test, knowledge of English morphology cannot be attributed solely to memorization.Indeed, both adults and children are fully capable of generalizing the patterns to novel words (e.g., [w2g] →[w2gd] (wug →wugged)).During acquisition, only a few types of errors are common; children rarely blend regular and irregular forms-for example, the past tense of come is either produced as comed or came, but rarely camed (Pinker, 1999).
Acquisition Patterns for Irregular Verbs.It is widely claimed that children learning the past tense forms of irregular verbs exhibit a "U-shaped" learning curve.At first, they correctly conjugate irregular forms (e.g., come →came), then they regress during a period of overregularization producing the past tense as comed as they acquire the general past tense formation.Finally, they learn to produce both the regular and irregular forms.Plunkett and Marchman, however, observed a more nuanced form of this behavior.Rather than a macro U-shaped learning process that applies globally and uniformly to all irregulars, they noted that many irregulars oscillate between correct and overregularized productions (Marchman, 1988).These oscillations, which Plunkett and Marchman refer to as a micro U-shape, further apply at different rates for different verbs (Plunkett and Marchman, 1991).Interestingly, although the exact pattern of irregular acquisition may be disputed, children rarely overirregularize, that is, misconjugate a regular verb as if it were irregular, such as ping →pang.

1986 vs. Today
In this section, we compare the original R&M architecture from 1986 to today's state-of-the-art neural architecture for morphological transduction, the Encoder-Decoder model.
3.1 Rumelhart and McClelland (1986) For many linguists, the face of neural networks to this day remains the work of R&M.Here, we describe in detail their original architecture, using modern machine learning parlance whenever possible.Fundamentally, R&M were interested in designing a sequence-to-sequence network for variable-length input using a small feed-forward network.From an NLP perspective, this work constitutes one of the first attempts to design a network for a task reminiscent of popular NLP tasks today that require variable-length input (e.g., partof-speech tagging, parsing, and generation).We can describe R&M's representations using the modern linear-algebraic notation standard among researchers in neural networks.First, we assume that the language under consideration contains a fixed set of phonemes Σ, plus an edge symbol # marking the beginning and end of words.Then, we construct the set of all Wickelphones Φ and the set of all Wickelfeatures F by enumeration.The first layer of the R&M neural network consists of two deterministic functions: (i) φ : Σ * → B |Φ| and (ii) f : B |Φ| → B |F | , where we define B = {−1, 1}.The first function φ maps a phoneme string to the set of Wickelphones that fire, as it were, on that string; for example,

Wickelphones and
The output subset of Φ may be represented by a binary vector of length |Φ|, where a 1 means that the Wickelphone appears in the string and a −1 that it does not. 2 The second function f maps a set of Wickelphones to its corresponding set of Wickelfeatures.
Pattern Associator Network.Here we define the complete network of R&M.We denote strings of phonemes as x ∈ Σ * , where x i is the i th phoneme in a string.Given source and target phoneme strings x (i) , y (i) ∈ Σ * , R&M optimize the following objective, a sum over the individual losses for each of the i = 1, ..., N training items: a bias term, and π = φ • f is the composition of the Wickelphone and Wickelfeature encoding functions.Using modern terminology, the architecture is a linear model for a multi-label classification problem (Tsoumakas and Katakis, 2006): The goal is to predict the set of Wickelfeatures in the target form y (i) given the input form x (i) using a point-wise perceptron loss (hinge loss without a margin); that is, a binary perceptron predicts each feature independently, but there is one set of parameters {W , b}.The total loss incurred is the sum of the per-feature loss, hence the use of the L 1 norm.The model is trained with stochastic sub-gradient descent (the perceptron update rule) (Rosenblatt, 1958;Bertsekas, 2015) with a fixed learning rate. 3Later work augmented the architecture with multiple layers and nonlinearities (Marcus, 2001, Table 3.3).
Decoding.Decoding the R&M network necessitates solving a tricky optimization problem.Given an input phoneme string x (i) , we then must find the corresponding y ∈ Σ * that minimizes where threshold is a step function that maps all non-positive reals to −1 and all positive reals to 1.In other words, we seek the phoneme string y that shares the most features with the maximum a posteriori decoded binary vector.This problem is intractable, and so R&M provide an approximation.For each test stem, they hand-selected a set of likely past-tense candidate forms, for example, good candidates for the past tense of break would be {break, broke, brake, braked}, and choose the form with Wickelfeatures closest to the network's output.This manual approximate decoding procedure is not intended to be cognitively plausible.
Architectural Limitations.R&M used Wickelphones and Wickelfeatures in order to help with generalization and limit their network to a tractable size.However, this came at a significant cost to the network's ability to represent unique strings-the encoding is lossy: Two words may have the same set of Wickelphones or features.The easiest way to see this shortcoming is to consider morphological reduplication, which is common in many of the world's languages.P&P provide an example from the Australian language of Oykangand, which distinguishes between algal 'straight' and algalgal 'ramrod straight'; both of these strings have the identical Wickelphone set respectively.These two words differ only by an instance of metathesis, or swapping the order of nearby sounds.The use of Wickelphone representations imposes the strong claim that they have nothing in common phonologically, despite sharing all phonemes.P&P suggest this is unlikely to be the case.As one point of evidence, the metathesis of the kind that differentiates [slIt] and [sIlt] is a common diachronic change.In English, for example, [horse] evolved from [hross], and [bird] from [brid] (Jesperson, 1942).

Encoder-Decoder Architectures
The NLP community has recently developed an analogue to the past-tense generation task originally considered by R&M: morphological paradigm completion (Durrett and DeNero, 2013;Ahlberg et al., 2015;Cotterell et al., 2015;Nicolai et al., 2015;Faruqui et al., 2016).The goal is to train a model capable of mapping the lemma (stem in the case of English) to each form in the paradigm.In the case of English, the goal would be to map a lemma, for example, walk, to its past-tense word walked as well as its gerund and third person present singular, walking and walks, respectively.This task generalizes the R&M setting in that it requires learning more mappings than simply lemma to past tense.
By definition, any system that solves the more general morphological paradigm completion task must also be able to solve the original R&M task.As we wish to highlight the strongest currently available alternative to R&M, we focus on the state of the art in morphological paradigm completion: the Encoder-Decoder network architecture (ED) (Cotterell et al., 2016).This architecture consists of two RNNs coupled together by an attention mechanism.The encoder RNN reads each symbol in the input string one at a time, first assigning it a unique embedding, then processing that embedding to produce a representation of the phoneme given the rest of the phonemes in the string.The decoder RNN produces a sequence of output phonemes one at a time, using the attention mechanism to peek back at the encoder states as needed.Decoding ends when a halt symbol is output.Formally, the ED architecture encodes the probability distribution over forms where g is a non-linear function (in our case it is a multi-layer perceptron), s i is the hidden state of the decoder RNN, y = (y 1 , . . ., y N ) is the output sequence (a sequence of N = |y| characters), and finally c i is an attention-weighted sum of the the encoder RNN hidden states h i , using the attention weights α k (s i−1 ) that are computed based on the previous decoder hidden state: In contrast to the R&M network, the ED network optimizes the log-likelihood of the training data, that is, M i=1 log p(y (i) | x (i) ) for i = 1, ..., M training items.We refer the reader to Bahdanau et al. (2014) for the complete architectural specification of the specific ED model we apply in this paper.
Theoretical Improvements.Although there are a number of possible variants of the ED architecture (Luong et al., 2015),4 they all share several critical features that make up for many of the theoretical shortcomings of the feed-forward R&M model.The encoder reads in each phoneme sequentially, preserving identity and order and allowing any string of arbitrary length to receive a unique representation.Despite this encoding, a flexible notion of string similarity is also maintained, as the ED model learns a fixed embedding for each phoneme that forms part of the representation of all strings that share the phoneme.When the network encodes [sIlt] and [slIt], it uses the same phoneme embeddings-only the order changes.Finally, the decoder permits sampling and scoring arbitrary length fully formed strings in polynomial time (forward sampling), so there is no need to determine which string a non-unique set of Wickelfeatures represents.However, we note that decoding the 1-best string from a sequence-to-sequence model is likely NP-hard (1-best string decoding is even hard for weighted finite-state transducers [Goodman, 1998]).
Multi-Task Capability.A single ED model is easily adapted to multi-task learning (Caruana, 1997;Collobert et al., 2011), where each task is a single transduction (e.g., stem → past).Note that R&M would need a separate network for each transduction (e.g., stem → gerund and stem → past participle).In fact, the current state of the art in NLP is to learn all morphological transductions in a paradigm jointly.The key insight is to construct a single network p(y | x, t) to predict all inflections, marking the transformation in the input string-that is, we feed the network the string "w a l k <gerund>" as input, augmenting the alphabet Σ to include morphological descriptors.We refer to reader to Kann and Schütze (2016)

Related Work
In this section, we first describe direct follow-ups to the original 1986 R&M model, using various neural architectures.Then we review competing nonneural systems of context-sensitive rewrite rules in the style of the Sound Pattern of English (SPE) (Halle and Chomsky, 1968), as favored by Pinker and Prince.
4.1 Follow-ups to Rumelhart and McClelland (1986) Over the Years Following R&M, a cottage industry devoted to cognitively plausible connectionist models of inflection learning sprouted in the linguistics and cognitive science literature.We provide a summary listing of the various proposals, along with broadbrush comparisons, in Table 2.Although many of the approaches apply more modern feed-forward architectures than R&M, introducing multiple layers connected by nonlinear transformations, most continue to use feed-forward architectures with limited ability to deal with variablelength inputs and outputs and remain unable to produce and assign probability to arbitrary output strings.
MacWhinney and Leinbach (1991), Plunkett andMarchman (1991, 1993), and Plunkett and Juola (1999) map phonological strings to phonological strings using feed-forward networks, but rather than turning to Wickelphones to imprecisely represent strings of any length, they use fixed-size input and output templates, with units representing each possible symbol at each input and output position.For example, Plunkett andMarchman (1991, 1993) simplify the past-tense mapping problem by only considering a language of artificially generated words of exactly three syllables and a limited set of constructed past-tense formation patterns.MacWhinney and Leinbach (1991) and Plunkett and Juola (1999) additionally modify the input template to include extra units marking particular transformations (e.g., past or gerund), enabling their network to learn multiple mappings.
Some proposals simplify the problem even further, mapping fixed-size inputs into a small finite set of categories, solving a classification problem rather than a transduction problem.(Nakisa and Hahn, 1996;Hahn and Nakisa, 2000) classify German noun stems into their appropriate plural inflection classes.Plunkett and Nakisa (1997) do the same for Arabic stems.Hoeffner (1992), Hare and Elman (1995), and Cottrell and Plunkett (1994) also solve an alternative problem-mapping semantic representations (usually one-hot vectors with one unit per possible word type, and one unit per possible inflection) to phonological outputs.As these networks use unstructured semantic inputs to represent words, they must act as memories-the phonological content of any word must be memorized.This precludes generalization to word types that were not seen during training.
Of the proposals that map semantics to phonology, the architecture in Hoeffner (1992) is unique in that it uses an attractor network rather than a feed-forward network, with the main difference being training using Hebbian learning rather than the standard backpropagation algorithm.Cottrell and Plunkett (1994) present an early use of a simple recurrent network (Elman, 1990) to decode output strings, making their model capable of variable length output.Bullinaria (1997) includes one of the few models proposed that can deal with variable length inputs.They use a derivative of the NETtalk pronunciation model (Sejnowski and Rosenberg, 1987) that would today be considered a convolutional network.Each input phoneme in a stem is read independently along with its left and right context phonemes within a limited context window (i.e., a convolutional kernel).Each kernel is then mapped to zero or more output phonemes within a fixed template.Because each output fragment is independently generated, the architecture is limited to learning only local constraints on output string structure.Similarly, the use of a fixed context window also means that inflectional patterns that depend on long-distance dependencies between input phonemes cannot be captured.
Finally, the model of Westermann and Goebel (1995) is arguably the most similar to a modern ED architecture, relying on simple recurrent networks to both encode input strings and decode output strings.However, the model was intended to explicitly instantiate a dual route mechanism and contains an additional explicit memory component to memorize irregulars.Despite the addition of this memory, the model was unable to fully learn the mapping from German verb stems to their participle forms, failing to capture the correct form for strong training verbs, including the copular sein → gewesen.As the authors note, this may be due to the difficulty of training simple recurrent networks, which tend to converge to poor local minima.Modern RNN varieties, such as long short-term memory (LSTM) networks in the ED model, were specifically designed to overcome these training limitations (Hochreiter and Schmidhuber, 1997).

Non-neural Learners
P&P describe several basic ideas that underlie a traditional, symbolic rule learner.Such a learner produces SPE-style rewrite rules that may be applied to deterministically transform the input string into the target.Rule candidates are found by comparing the stem and the inflected form, treating the portion that changes as the rule that governs the transformation.This is typically a set of non-copy edit operations.If multiple stem-past pairs share similar changes, these may be collapsed into a single rule by generalizing over the shared phonological features involved in the changes.For example, if multiple stems are converted to the past tense by the addition of the suffix [-d], the learner may create a collapsed rule that adds the suffix to all stems that end in a [+voice] sound.Different rules may be assigned weights (e.g., probabilities) derived from how many stem-past pairs exemplify the rules.These weights decide which rules to apply to produce the past tense.
Several systems that follow this rule-based template have been developed in NLP.Although the SPE itself does not impose detailed restrictions on rule structure, these systems generate rules that can be compiled into finite-state transducers (Kaplan and Kay, 1994;Ahlberg et al., 2015).These systems generalize well, but even the most successful variants have been shown to perform significantly worse than state-of-the-art neural networks at morphological inflection, often with a >10 percentage point differential in accuracy on held-out data (Cotterell et al., 2016).
In the linguistics literature, the most straightforward, direct, machine-implemented instantiation of the P&P proposal is, arguably, the Minimal Generalization Learner (MGL) of Albright and Hayes (2003) (c.f., Allen and Becker, 2015;Taatgen and Anderson, 2002).This model takes a mapping of phonemes to phonological features and makes feature-level generalizations like the post-voice [-d] rule described earlier.For a detailed technical description, see Albright and Hayes (2002).We treat the MGL as a baseline in §5.
Unlike Taatgen and Anderson (2002), who explicitly account for dual route processing by including both memory retrieval and rule application submodules, Albright and Hayes (2003) and Allen and Becker (2015) rely on discovering and correctly weighting (using weights learned by log-linear regression) highly stem-specific rules to account for irregular transformations.
Within the context of rule-based systems, several proposals focus on the question of rule generalization, rather than rule synthesis.That is, given a set of predefined rules, the systems implement metrics to decide whether rules should generalize to novel forms, depending on the number of exceptions in the data set.Yang (2016) defines the 'tolerance principle,' a threshold for exceptionality beyond which a rule will fail to generalize.O'Donnell (2011) treats the question of whether a rule will generalize as one of optimal Bayesian inference.

Evaluation of the ED Learner
We evaluate the performance of the ED architecture in light of the criticisms P&P levied against the original R&M model.We show that, in most cases, these criticisms no longer apply. 5he most potent line of attack P&P use against the R&M model is that it simply does not learn the English past tense very well.Although the nondeterministic, manual, and non-precise decoding procedure used by R&M makes it difficult to obtain exact accuracy numbers, P&P estimate that the model only prefers the correct past tense form for about 67% of English verb stems.Furthermore, many of the errors made by the R&M network are unattested in human performance.For example, the model produces blends of regular and irregular past-tense formation (e.g., eat → ated) that children do not produce unless they mistake ate for a present stem (Pinker, 1999).Furthermore, the R&M model frequently produces irregular past tense forms when a regular formation is expected (e.g., ping → pang).Humans are more likely to overregularize.These behaviors suggest that the R&M model learns the wrong kind of generalizations.As shown subsequently, the ED architecture seems to avoid these pitfalls, while a P&P-style non-neural baseline.

Experiment 1: Learning the Past Tense
In the first experiment, we seek to show: (i) the ED model successfully learns to conjugate both regular and irregular verbs in the training data, and generalizes to held-out data at convergence and (ii) the pattern of errors the model exhibits is compatible with attested speech errors.
CELEX Data Set.Our base data set consists of 4,039 verb types in the CELEX database (Baayen et al., 1993).Each verb is associated with a present tense form (stem) and past tense form, both in IPA.Each verb is also marked as regular or irregular (Albright and Hayes, 2003).A total of 168 of the 4,039 verb types were marked as irregular.We assigned verbs to train, development, and test sets according to a random 80-10-10 split.Each verb appears in exactly one of these sets once.This corresponds to a uniform distribution over types because every verb has an effective frequency of 1.
In contrast, the original R&M model was trained and tested (data was not held out) on a set of 506 stem/past pairs derived from Kučera and Francis (1967).A total of 98 of the 506 verb types were marked as irregular.
Types vs. Tokens.In real human communication, words follow a Zipfian distribution, with many irregular verbs being exponentially more common than regular verbs.Although this condition is more true to the external environment of language learning, it may not accurately represent the psychological reality of how that environment is processed.A body of psycholinguistic evidence (Bybee, 1995(Bybee, , 2001;;Pierrehumbert, 2001) suggests that human learners generalize phonological patterns based on the count of word types they appear in, ignoring the token frequency of those types.Thus, we chose to weigh all verb types equally for training, effecting a uniform distribution over types as described above.
Hyperparameters and Other Details.Our architecture is nearly identical to that used in Bahdanau et al. (2014), with hyperparameters set following Kann and Schütze (2016, §4.1.1).Each input character has an embedding size of 300 units.The encoder consists of a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) with two layers.There is a dropout value of 0.3 between the layers.The decoder is a unidirectional LSTM with two layers.Both the encoder and decoder have 100 hidden units.Training was done using the Adadelta procedure (Zeiler, 2012) with a learning rate of 1.0 and a minibatch size of 20.We train for 100 epochs to ensure that all verb forms in the training data are adequately learned.We decode the model with beam search (k = 12).The code for our experiments is derived from the OpenNMT package (Klein et al., 2017).We use accuracy as our metric of performance.We train the MGL as a non-neural baseline, using the code distributed with Albright and Hayes (2003) with default settings.
Results and Discussion.The non-neural MGL baseline unsurprisingly learns the regular pasttense pattern nearly perfectly, given that it is imbued with knowledge of phonological features as well as a list of phonologically illegal phoneme sequences to avoid in its output.However, in our testing of the MGL, the preferred past-tense output for all verbs was never an irregular formulation.This was true even for irregular verbs that were observed by the learner in the training set.One might say that the MGL is only intended to account for the regular route of a dual route system.However, the intended scope of the MGL seems to be wider.The model is billed as accurately learning "islands of subregularity" within the past tense system, and Albright and Hayes use the model to make predictions about which irregular forms of novel verb stems are preferable to human speakers (see the subsequent discussion of wugs).Single-Task (MGL) 96.0 96.0 94.5 99.9 100.0 100.0 0.0 0.0 0.0 Single-Task (Type) 99.8 † 97.4 95.1 99.9 99.2 98.9 97.6 † 53.3 † 28.6 † Multi-Task (Type) 100.0 † 96.9 95.1 100.0 99.5 99.7 99.2 † 33.3 † 28.6 † Table 3: Results on held-out data in English past tense prediction for single-and multi-task scenarios.The MGL achieves perfect accuracy on regular verbs, and 0 accuracy on irregular verbs.† indicates that a neural model's performance was found to be significantly (p < 0.05) from the MGL.
In contrast, the ED model, despite no builtin knowledge of phonology, successfully learns to conjugate nearly all the verbs in the training data, including irregulars-no reduction in scope is needed.This capacity to account for specific exceptions to the regular rule does not result in overfitting.We note similarly high accuracy on held-out regular data-98.9% to 99.2% at convergence depending on the condition.We report the full accuracy in all conditions in Table 3.The † indicates when a neural model's performance was found to be significantly different (p < 0.05) from the MGL according to a χ 2 test.The ED model achieves near-perfect accuracy on regular verbs, and irregular verbs seen during training, as well as substantial accuracy on irregular verbs in the dev and test sets.This behavior jointly results in better overall performance for the ED model when all verbs are considered.Figure 1 shows learning curves for regular and irregular verbs types in different conditions.
An error analysis of held-out data shows that the errors made by this network do not show any of the problems of the R&M architecture.There are no blend errors of the eat → ated variety.Indeed, the only error the network makes on irregulars is overregularization (e.g., throw → throwed).In fact, the overregularization-caused lower accuracy that we observe for irregular verbs in development and test is expected and desirable; it matches the human tendency to treat novel words as regular, lacking knowledge of irregularity (Albright and Hayes, 2003).
Although most held-out irregulars are regularized, as expected, the ED model does, perhaps surprisingly, correctly conjugate a handful of irregular forms it has not seen during training-five in the test set.However, three of these are prefixed versions of irregulars that exist in the training set (retell → retold, partake → partook, withdraw  → withdrew).One (sling → slung) is an analogy to similar training words (fling, cling).The final conjugation, forsake → forsook, is an interesting combination, with the prefix "for," but an unattested base form "sake" that is similar to "take." 6rom the training data, the only regular verb with an error is compartmentalized, whose past tense is predicted to be "compartmentalized," with a spurious vowel change that would likely be ironed out with additional training.Among the regular verbs in the development and test sets, the errors also consisted of single vowel changes (the full set of these errors was "thin" → "thun," "try" → "traud," "institutionalize" → "instititionalized," and "drawl" → "drooled").
Overall then, the ED model performs extremely well, a far cry from the ≈67% accuracy of the R&M model.It exceeds any reasonable standard of empirical adequacy, and shows human-like error behavior.
Acquisition Patterns.R&M made several claims that their architecture modeled the detailed acquisition of the English past tense by children.The core claim was that their model exhibited a macro U-shaped learning curve as in §2.Irregulars were initially produced correctly, followed by a period of overregularization preceding a final correct stage.However, P&P point out that R&M only achieve this pattern by manipulating the input distribution fed into their network.They trained only on irregulars for a number of epochs,  before flooding the network with regular verb forms.R&M justify this by claiming that young children's vocabulary consists disproportionately of irregular verbs early on, but P&P cite contrary evidence.A survey of child-directed speech shows that the ratio of regular to irregular verbs a child hears is constant while they are learning their language (Slobin, 1971).Furthermore, psycholinguistic results suggest that there is no early skew towards irregular verbs in the vocabulary children understand or use (Brown, 1973).Although we do not wish to make a strong claim that the ED architecture accurately mirrors children's acquisition, only that it ultimately learns the correct generalizations, we wanted to see if it would display a child-like learning pattern without changing the training inputs fed into the network over time-that is, in all of our experiments, the data sets remained fixed for all epochs, unlike in R&M.We do not clearly see a macro U-shape, but we do observe Plukett and Marchman's predicted oscillations for irregular learning-the so-called micro U-shaped pattern.As shown in Table 4, individual verbs oscillate between correct production and overregularization before they are fully mastered.
Wug Testing.As a further test of the MGL as a cognitive model, Albright and Hayes created a set of 74 nonce English verb stems with varying levels of similarity to both regular and irregular verbs.For each stem (e.g., rife), they picked one regular output form (rifed), and one irregular output form (rofe).They used these stems and potential past-tense variants to perform a wug test with human participants.For each stem, they had 24 participants freely attempt to produce a past

English
Network MG Regular (rife ∼ rifed, n=58) 0.48 0.35 Irregular (rife ∼ rofe, n=74) 0.45 0.36 tense form.They then counted the percentage of participants who produced the pre-chosen regular and irregular forms (production probability).The production probabilities for each pre-chosen regular and irregular form could then be correlated with the predicted scores derived from the MGL.
In Table 5, we compare the correlations based on their model scores, with correlations comparing the human scores to the output probabilities given by an ED model.As the wug data provided with Albright and Hayes (2003) use a different phonetic transcription than the one we used, we trained a separate ED model for this comparison.Model architecture, training verbs, and hyperparameters remained the same.Only the transcription used to represent input and output strings was changed to match Albright and Hayes (2003).Following the original paper, we correlate the probabilities for regular and irregular transformations separately.
We apply Spearman's rank correlation, as we don't necessarily expect a linear relationship.We see that the ED model probabilities are slightly more correlated than the MGL's scores.

Experiment 2: Joint Multi-Task Learning
Another objection levied by P&P is R&M's focus on learning a single morphological transduction: stem to past tense.Many phonological patterns in a language, however, are not restricted to a single transduction-they make up a core part of the phonological system and take part in many different processes.For instance, the voicing assimilation patterns found in the past tense also apply to the third person singular: we see the affix -s rendered as [-s] after voiceless consonants and [-z] after voiced consonants and vowels.P&P argue that the R&M model would not be able to take advantage of these shared generalizations.Assuming a different network would need to be trained for each transduction (e.g., stem to gerund and stem to past participle), it would be impossible to learn that they have any patterns in common.However, as discussed in §3.2, a single ED model can learn multiple types of mapping, simply by tagging each input-output pair in the training set with the transduction it represents.A network trained in such a way shares the same weights and phoneme embeddings across tasks, and thus has the capacity to generalize patterns across all transductions, naturally capturing the overall phonology of the language.Because different transductions mutually constrain each other (e.g., English in general does not allow sequences of identical vowels), we actually expect faster learning of each individual pattern, which we test in the following experiment.
We trained a model with an architecture identical to that used in Experiment 1, but this time to jointly predict four mappings associated with English verbs (past, gerund, past participle, thirdperson singular).
Data.For each of the verb types in our base training set from Experiment 1, we added the three remaining mappings.The gerund, past-participle, and third-person singular forms were identified in CELEX according to their labels in Wiktionary (Sylak-Glassman et al., 2015).The network was trained on all individual stem → inflection pairs in the new training set, with each input string modified with additional characters representing the current transduction (Kann and Schütze, 2016): take <PST> → took, but take <PTCP> → taken. 7 Results.Table 3 and Figure 1 show the results.Overall, accuracy is >99% after convergence on train.Although the difference in final performance is never statistically significant compared to singletask learning, the learning curves are much steeper, so this level of performance is achieved much more quickly.This provides evidence for our intuition that cross-task generalization facilitates individual task learning due to shared phonological patterning (i.e., jointly generating the gerund hastens pasttense learning).

Summary of Resolved and Outstanding Criticisms
In this paper, we have argued that the Encoder-Decoder architecture obviates many of the criti-7 Without input annotation to mark the different mappings the network must learn, it would treat all input/output pairs as belonging to the same mapping, with each inflected form of a single stem as an equally likely output variant associated with that mapping.It is not within the scope of this network architecture to solve problems other than morphological transduction, such as discovering the range of morphological paradigm slots.cisms P&P levied against R&M.Most importantly, the empirical performance of neural models is no longer an issue.The past tense transformation is learned nearly perfectly, compared to an approximate accuracy of 67% for R&M.Furthermore, the ED architecture solves the problem in a fully general setting.A single network can easily be trained on multiple mappings at once (and appears to generalize knowledge across them).No representational cludges such as Wickelphones are required-ED networks can map arbitrary length strings to arbitrary length strings.This permits training and evaluating the ED model on realistic data, including the ability to assign an exact probability to any arbitrary output string, rather than "representative" data designed to fit in a fixed-size neural architecture (e.g., fixed input and output templates).Evaluation shows that the ED model does not appear to display any of the degenerate error-types P&P note in the output of R&M (e.g., regular/irregular blends of the ate → ated variety).
Despite this litany of successes, some outstanding criticisms of R&M still remain to be addressed.On the trivial end, P&P correctly point out that the R&M model does not handle homophones: write → wrote, but right → righted.This is because it only takes the phonological make-up of the input string into account, without concern for its lexical identity.This issue affects the ED models we discuss in this paper as well-lexical disambiguation is outside of their intended scope.However, even the rule learner that P&P propose does not have such functionality.Furthermore, if lexical markings were available, we could incorporate them into the model just as with different transductions in the multi-task set-up (i.e., by adding the disambiguating markings to the input).
More importantly, we need to limit any claims regarding treating ED models as proxies for child language learners.P&P criticized such claims from R&M because they manipulated the input data distribution given to their network over time to effect a U-shaped learning curve, despite no evidence that the manipulation reflected children's perception or production capabilities.We avoid this criticism in our experiments, keeping the input distribution constant.We even show that the ED model captures at least one observed pattern of child language development-Plukett and Marchman's predicted oscillations for irregular learning, the micro U-shaped pattern.However, we did not observe a macro U-shape, nor was the micro effect consistent across all irregular verbs.More study is needed to determine the ways in which ED architectures do or do not reflect children's behavior.Even if nets do not match the development patterns of any individual, they may still be useful if they ultimately achieve a knowledge state that is comparable to that of an adult or, possibly, the aggregate usage statistics of a population of adults.
Along this vein, P&P note that the R&M model is able to learn highly unnatural patterns that do not exist in any language.For example, it is trivial to map each Wickelphone to its reverse, effectively creating a mirror-image of the input, for example, brag →garb.Although an ED model could likely learn linguistically unattested patterns as well, some patterns may be more difficult to learn than others-for example, they might require increased time-to-convergence.It remains an open question for future research to determine which patterns RNNs prefer, and which changes are needed to account for over-and underfitting.Indeed, any sufficiently complex learning system (including rule-based learners) would have learning biases that require further study.
There are promising directions from which to approach this study.Networks are in a way analogous to animal models (McCloskey, 1991), in that they share interesting properties with human learners, as shown empirically, but are much easier and less costly to train and manipulate across multiple experiments.Initial experiments could focus on default architectures, as we do in this paper, effectively treating them as inductive baselines (Gildea and Jurafsky, 1996) and measuring their performance given limited domain knowledge.Our ED networks, for example, have no built-in knowledge of phonology or morphology.Failures of these baselines would then point the way towards the biases required to learn human language, and models modified to incorporate these biases could be tested.

Conclusion
We have shown that the application of the ED architecture to the problem of learning the English past tense obviates many, though not all, of the objections levied by P&P against the first neural network proposed for the task, suggesting that the criticisms do not extend to all neural models, as P&P imply.Compared with a non-neural baseline, the ED model accounts for both regular and irregular past tense formation in observed training data and generalizes to held-out verbs, all without built-in knowledge of phonology.Although not necessarily intended to act as a proxy for a child learner, the ED model also shows one of the development patterns that has been observed in children, namely, a micro U-shaped (oscillating) learning curve for irregular verbs.The accurate and substantially human-like performance of the ED model warrants consideration of its use as a research tool in theoretical linguistics and cognitive science.

658
Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00247by guest on 01 May 2024 all regular irregular train dev test train dev test train dev test

Figure 1 :
Figure 1: Single-task vs. multi-task.Learning curves for the English past tense.The x-axis is the number of epochs (one complete pass over the training data) and the y-axis is the accuracy on the training data (not the metric of optimization).

Table 1 :
Examples of inflected English verbs.

Table 2 :
A curated list of related work, categorized by aspects of the technique.Based on a similar list found inMarcus (2001, page 82).

Table 4 :
Here we evince the oscillating development of single words in our corpus.each stem, e.g., CLING, we show the past form that produced at change points to show the diversity of alternation.Beyond the last epoch displayed, each verb was produced correctly.

Table 5 :
Spearman's ρ of human wug production probabilities with MG scores and ED probability estimates.