Abstract

The Calibrated Error-Driven Ranking Algorithm (CEDRA; Magri 2012) is shown to fail on two test cases of phonologically conditioned variation from Boersma and Hayes 2001. The failure of the CEDRA raises a serious unsolved challenge for learnability research in stochastic Optimality Theory, because the CEDRA itself was proposed to repair a learnability problem (Pater 2008) encountered by the original Gradual Learning Algorithm. This result is supported by both simulation results and a detailed analysis whereby a few constraints and a few candidates at a time are recursively “peeled off” until we are left with a “core” small enough that the behavior of the learner is easy to interpret.

1 Introduction

This article offers a method of diagnosis applicable to proposed learning algorithms for stochastic Optimality Theory (OT; Prince and Smolensky 2004, Boersma 1997, 1998). The core intuition (which stems from Tesar and Smolensky’s (1998) seminal learnability analysis) is that we can tackle a complex test case with many constraints and many candidates because the structure of (deterministic or stochastic) OT allows us to recursively “peel off” a few constraints and a few candidates at a time, until we are left with a “core” that is small enough to immediately reveal the behavior of the learner.

Using this method, we scrutinize four different ranking algorithms. They differ along two dimensions: the amount of constraint promotion performed (none, small, or large), and which loser-preferring constraints are demoted (all of them or just the undominated ones). We apply these four algorithms to two test cases from Boersma and Hayes (henceforth BH) 2001: Ilokano segmental phonology and Finnish genitive plural allomorphy. We find that of the four algorithms, only the original Gradual Learning Algorithm (GLA; Boersma 1997, 1998) as employed by BH can learn the relevant patterns. Among the rival algorithms that fail is the Calibrated Error-Driven Ranking Algorithm (CEDRA) proposed by Magri (2012). The failure of the CEDRA raises a serious unsolved challenge for learnability research in stochastic OT, because the CEDRA itself was proposed to repair a learnability problem (Pater 2008) encountered by the original GLA. Thus, at present there is no algorithm for stochastic OT that works in all cases.

The article is organized as follows. Section 2 briefly reviews the various implementations of OT stochastic error-driven learning considered in the article. Section 3 illustrates our technique for the analysis of stochastic error-driven learners on the Ilokano metathesis test case. The analysis leads to a proof that the GLA always succeeds on this test case, without the need to run any simulations. Section 4 looks at the Finnish test case and shows that the glitch in the GLA’s performance in BH’s simulations is due to a failure of the grammatical analysis, not to a shortcoming of the learner. The proposed analyses straightforwardly explain why the GLA’s variants considered here, and in particular the CEDRA, fail on these test cases. Section 5 concludes by discussing the implications for the theory of error-driven learning in stochastic OT. The presentation is kept informal, with details relegated to an appendix available online (https://www.mitpressjournals.org/doi/suppl/10.1162/ling_a_00328).

2 OT Stochastic Error-Driven Learning

Boersma (1997, 1998) introduces the following stochastic variant of OT. We are given a set of candidate pairs of underlying and surface forms together with a constraint set consisting of n constraints C1, . . . , Cn. Each constraint Ck is assigned a ranking value θk. These ranking values are collected together into a ranking vector θ = (θ1, . . . , θn). Let 1, . . . , n be n numbers sampled independently from each other according to the same underlying continuous distribution Ɗ. These numbers are collected together into a stochastic vector = (1, . . . , n). The sum θk + k of each ranking value θk with the corresponding stochastic value k is called a stochastic ranking value. These sums are collected together into the stochastic ranking vector θ + = (θ1 + 1, . . . , θn + n). Since the numbers h , k are sampled according to a distribution Ɗ that is continuous, the probability that two stochastic ranking values θh + h and θk + k are identical is equal to zero. The stochastic ranking vector θ + thus represents the unique constraint ranking that ranks a constraint Ch above another constraint Ck if and only if the stochastic ranking value θh + h of the former is larger than the stochastic ranking value θk + k of the latter. The stochastic OT grammar

graphic
corresponding to a ranking vector θ and a continuous distribution Ɗ (given a candidate set and a constraint set) is the function from underlying to surface forms defined as follows: whenever it is called on an underlying form /x/, it samples the components of independently from each other according to the distribution Ɗ and it returns the surface form [y] such that (the unique constraint ranking represented by) the stochastic ranking vector θ + prefers the candidate pair (/x/, [y]) to any other candidate (/x/, [z]) that pairs that underlying form /x/ with a different surface form [z].

Boersma assumes Ɗ to be a Gaussian distribution with zero mean and small variance. Since the tails of the Gaussian distribution decrease exponentially fast, the stochastic value k is bounded between −Δ and +Δ with high probability (which of course depends on the threshold Δ). Thus, whenever the distance between two ranking values θh and θk is large (namely, larger than 2Δ), the condition θh > θk is equivalent to the condition θh + h > θk + k with high probability. In other words, the original ranking vector θ and the corresponding stochastic ranking vector θ + agree on the relative ranking of the two constraints Ch and Ck. From the analytical perspective adopted in this article, it is nonetheless convenient to make each stochastic value k deterministically bounded between −Δ and +Δ, rather than bounded with high probability.1 The analyses developed in this article extend with high probability to the Gaussian case.

Error-driven learning within this framework takes the form of the stochastic Error-Driven Ranking Algorithm (EDRA) in (1). This learner maintains a current stochastic OT grammar, represented through a current ranking vector θ. Following BH, these current ranking values are all initialized to the same value—say, 100 for concreteness. These initial ranking values are then updated by looping through the four steps (1a–d).

(1)

graphic

At step (1a), the algorithm receives a piece of data consisting of an underlying form /x/ together with the corresponding surface realization [y] according to the target grammar. At step (1b), the algorithm computes the candidate [z] predicted to be the winner for the underlying form /x/ by the stochastic OT grammar

graphic
corresponding to the current ranking vector θ (and a certain distribution Ɗ used to sample the stochastic values). If the predicted winner [z] coincides with the intended winner [y], the current ranking vector has performed impeccably on the current piece of data. The EDRA thus has nothing to learn from the current piece of data, loops back to step (1a), and waits for more data. If instead the predicted winner [z] differs from the intended winner [y], the current ranking vector is updated at step (1d).

The learner focuses on the comparison between the intended winner [y] and the incorrectly predicted winner [z]. The latter is usually referred to as the current loser form. The failure of the current stochastic ranking vector suggests that the constraints that prefer the current loser [z] over the intended winner [y] (namely, the constraints that assign fewer violations to the former than to the latter) are currently ranked too high, while the constraints that prefer the intended winner [y] over the current loser [z] are ranked too low. The reranking rule used by the EDRA at step (1d) tries to remedy these shortcomings: it promotes winner-preferring constraints by a certain promotion amount p; and it demotes loser-preferring constraints by a certain demotion amount d. What matters for the behavior of the algorithm is the ratio between p and d (not their individual size). The demotion amount d can thus be set equal to 1 for concreteness.2

Different reranking rules differ with respect to two choice points. The first choice point concerns the promotion component of the reranking rule: how should we choose the promotion amount p? Various options have been explored in the literature, summarized in the second column of table 1. The gradual EDCD reranking rule (Error-Driven Constraint Demotion; Tesar and Smolensky 1998) assumes a null promotion amount p = 0.3 The GLA reranking rule (Gradual Learning Algorithm; Boersma 1997, 1998) assumes that the promotion amount equals the demotion amount—namely, p = d = 1. Finally, the CEDRA (Calibrated EDRA; Magri 2012) assumes a promotion amount calibrated on the number w of winner-preferring constraints promoted through the identity

graphic
, whereby the promotion amount p ends up always being smaller than the demotion amount d = 1.

Table 1

Four reranking rules for the stochastic EDRA

Reranking rulePromotion amountWhich constraints are demoted
(Gradual) Error-Driven Constraint Demotion (EDCD; Tesar and Smolensky 1998p = 0 Only the undominated loser-preferrers 
Gradual Learning Algorithm (GLA; Boersma 1997, 1998p = 1 All loser-preferrers 
Minimal Gradual Learning Algorithm (minGLA; Boersma 1997, 1998p = 1 Only the highest loser-preferrer 
Calibrated Error-Driven Ranking Algorithm (CEDRA; Magri 2012
graphic
 
Only the undominated loser-preferrers 
Reranking rulePromotion amountWhich constraints are demoted
(Gradual) Error-Driven Constraint Demotion (EDCD; Tesar and Smolensky 1998p = 0 Only the undominated loser-preferrers 
Gradual Learning Algorithm (GLA; Boersma 1997, 1998p = 1 All loser-preferrers 
Minimal Gradual Learning Algorithm (minGLA; Boersma 1997, 1998p = 1 Only the highest loser-preferrer 
Calibrated Error-Driven Ranking Algorithm (CEDRA; Magri 2012
graphic
 
Only the undominated loser-preferrers 

The second choice point concerns the demotion component: which loser-preferring constraints are demoted? (All proposals in the literature agree on treating all the winner-preferring constraints on a par.) Various options have been explored in the literature, summarized in the third column of table 1. The GLA demotes all loser-preferring constraints. EDCD and the CEDRA instead demote only those loser-preferring constraints that need to be demoted—namely, those that are currently undominated, in the sense that their stochastic ranking values are not smaller than the stochastic ranking value of some winner-preferring constraint. Finally, the minGLA demotes only the unique undominated loser-preferring constraint that is currently ranked highest—namely, the one that has the largest stochastic ranking value.4

3 The Ilokano Metathesis Test Case

This section analyzes the performance of the stochastic EDRA on BH’s Ilokano metathesis test case (based on data and analysis from Hayes and Abad 1989).

3.1 Description of the Test Case

The Ilokano metathesis test case is summarized in (2) and (3). It features four underlying forms listed in (2a), which are assumed to have the same frequency. The corresponding candidates are listed in (2b) together with their probabilities. The three underlying forms /paʔlak/, /ʔajo-en/, and /basa-en/ admit a unique optimal candidate surface form (with probability 1). The underlying form /taʔo-en/ instead displays variation between the two surface forms [taʔ.wen] and [taw.ʔen], which BH assume to be equally frequent.

(2)

graphic

BH assume the constraints listed in (3) and show that the data are properly accounted for by the ranking conditions displayed. The solid lines connect constraints whose ranking values need to be separated by a large distance, so as not to be swappable by the stochastic component. The dotted line instead indicates that the ranking values of the constraints LINEARITY and *ʔ] must be equal in order to model the free variation displayed by /taʔo-en/.

(3)

graphic

3.2 Simulation Results

We have run the stochastic EDRA on the Ilokano metathesis test case (2)–(3) with the four reranking rules in table 1.5Table 2 reports the final ranking vector learned for each of the four reranking rules.6 The quality of these final ranking vectors is evaluated in table 3. The first two columns list all pairs of an underlying form and a corresponding candidate together with the actual probability of the corresponding mapping. The four remaining columns provide the frequency of each mapping predicted by the four final ranking vectors listed in table 2.7

Table 2
Ranking values learned by the stochastic EDRA with the four reranking rules listed in table 1 on the Ilokano metathesis test case described in (2)–(3)
GLAminGLAEDCDCEDRA
IDENTIO[low] 142.0 IDENTIO[low] 154.0 ONSET 100.0 ONSET 113.0 
MAXIO(V) 140.0 *LOWGLIDE 152.0 MAXOO(ʔ) 100.0 IDENTIO[low] 111.3 
ONSET 138.0 ONSET 152.0 MAXIO(V) 100.0 MAXIO(V) 111.0 
*LOWGLIDE 138.0 MAXIO(V) 150.0 IDENTIO[low] 100.0 *LOWGLIDE 108.7 
*[ʔC 114.0 MAXOO(ʔ) 120.0 *[ʔC 100.0 *[ʔC 103.0 
MAXOO(ʔ) 110.0 *[ʔC 120.0 *LOWGLIDE 100.0 MAXOO(ʔ) 100.0 
DEPIO(ʔ) 98.0 DEPIO(ʔ) 106.0 DEPIO(ʔ) 50.0 DEPIO(ʔ) 64.0 
LINEARITY 67.0 LINEARITY 81.1 IDENTIO[syl] 10.0 IDENTIO[syl] 22.0 
*ʔ] 67.0 *ʔ] 80.9 *ʔ] −897.9 *ʔ] −304.3 
MAXIO(ʔ) 24.0 IDENTIO[syl] 72.0 LINEARITY −898.0 LINEARITY −304.6 
IDENTIO[syl] 24.0 MAXIO(ʔ) 38.0 MAXIO(ʔ) −900.8 MAXIO(ʔ) −309.1 
GLAminGLAEDCDCEDRA
IDENTIO[low] 142.0 IDENTIO[low] 154.0 ONSET 100.0 ONSET 113.0 
MAXIO(V) 140.0 *LOWGLIDE 152.0 MAXOO(ʔ) 100.0 IDENTIO[low] 111.3 
ONSET 138.0 ONSET 152.0 MAXIO(V) 100.0 MAXIO(V) 111.0 
*LOWGLIDE 138.0 MAXIO(V) 150.0 IDENTIO[low] 100.0 *LOWGLIDE 108.7 
*[ʔC 114.0 MAXOO(ʔ) 120.0 *[ʔC 100.0 *[ʔC 103.0 
MAXOO(ʔ) 110.0 *[ʔC 120.0 *LOWGLIDE 100.0 MAXOO(ʔ) 100.0 
DEPIO(ʔ) 98.0 DEPIO(ʔ) 106.0 DEPIO(ʔ) 50.0 DEPIO(ʔ) 64.0 
LINEARITY 67.0 LINEARITY 81.1 IDENTIO[syl] 10.0 IDENTIO[syl] 22.0 
*ʔ] 67.0 *ʔ] 80.9 *ʔ] −897.9 *ʔ] −304.3 
MAXIO(ʔ) 24.0 IDENTIO[syl] 72.0 LINEARITY −898.0 LINEARITY −304.6 
IDENTIO[syl] 24.0 MAXIO(ʔ) 38.0 MAXIO(ʔ) −900.8 MAXIO(ʔ) −309.1 
Table 3
Probabilities of each underlying/surface form mapping predicted by the four final ranking vectors reported in table 2 
ActualGLAminGLAEDCDCEDRA
(/paʔlak/, [pa.lak]) .75 .91 
(/paʔlak/, [paʔ.lak]) .12 .04 
(/paʔlak/, [pal.ʔak]) .12 .05 
(/paʔlak/, [pa.ʔlak]) 
(/ʔajo-en/, [ʔaj.wen]) 
(/ʔajo-en/, [ʔa.jen]) 
(/ʔajo-en/, [ʔa.jo.ʔen]) 
(/ʔajo-en/, [ʔa.jo.en]) 
(/basa-en/, [ba.sa.ʔen]) 
(/basa-en/, [bas.
graphic
en]) 
(/basa-en/, [ba.sen]) 
(/basa-en/, [ba.sa.en]) 
(/basa-en/, [bas.wen]) 
(/taʔo-en/, [taʔ.wen]) .5 .49 .54 .49 .45 
(/taʔo-en/, [taw.ʔen]) .5 .50 .46 .51 .54 
(/taʔo-en/, [ta.ʔo.en]) 
(/taʔo-en/, [ta.ʔen]) 
(/taʔo-en/, [ta.wen]) 
(/taʔo-en/, [ta.ʔwen]) 
(/taʔo-en/, [ta.ʔo.ʔen]) 
ActualGLAminGLAEDCDCEDRA
(/paʔlak/, [pa.lak]) .75 .91 
(/paʔlak/, [paʔ.lak]) .12 .04 
(/paʔlak/, [pal.ʔak]) .12 .05 
(/paʔlak/, [pa.ʔlak]) 
(/ʔajo-en/, [ʔaj.wen]) 
(/ʔajo-en/, [ʔa.jen]) 
(/ʔajo-en/, [ʔa.jo.ʔen]) 
(/ʔajo-en/, [ʔa.jo.en]) 
(/basa-en/, [ba.sa.ʔen]) 
(/basa-en/, [bas.
graphic
en]) 
(/basa-en/, [ba.sen]) 
(/basa-en/, [ba.sa.en]) 
(/basa-en/, [bas.wen]) 
(/taʔo-en/, [taʔ.wen]) .5 .49 .54 .49 .45 
(/taʔo-en/, [taw.ʔen]) .5 .50 .46 .51 .54 
(/taʔo-en/, [ta.ʔo.en]) 
(/taʔo-en/, [ta.ʔen]) 
(/taʔo-en/, [ta.wen]) 
(/taʔo-en/, [ta.ʔwen]) 
(/taʔo-en/, [ta.ʔo.ʔen]) 

As table 3 shows, all four algorithms manage to learn the stochastic behavior of the underlying form /taʔo-en/. Furthermore, all four algorithms manage to learn the deterministic behavior of the underlying forms /ʔajo-en/ and /basa-en/. The critical test case is the underlying form /paʔlak/: both the GLA and the minGLA succeed, while the CEDRA comes up short and EDCD fails. What makes /paʔlak/ hard to learn? How do the GLA and the minGLA actually manage to succeed? Why is it that EDCD and the CEDRA instead fail?

3.3 Restating the Test Case in ERC Notation

To start, we describe the Ilokano metathesis test case (2)–(3) in ERC (Elementary Ranking Condition; Prince 2002) notation, as in table 4. In the leftmost column, we list all possible triplets (/x/, [y], [z]) of an underlying form /x/, a corresponding winner [y] (namely, any candidate for that underlying form that has a nonnull probability), and any other candidate [z] different from that winner, which therefore counts as a loser in the current comparison. For instance, the first triplet (/paʔlak/, [pa.lak], [pa.ʔlak]) consists of the underlying form /paʔlak/, its winner candidate [pa.lak], and one of its loser candidates, in this case [pa.ʔlak]. (We adopt the convention of striking out the loser in each triplet, in order to distinguish it from the winner.) The remaining triplets in the first block are obtained by considering all possible loser candidates for that underlying form /paʔlak/. The next two blocks, corresponding to the underlying forms /ʔajo-en/ and /basa-en/, are constructed analogously. Finally, the underlying form /taʔo-en/ comes with two winners, [taw.ʔen] and [taʔ.wen]. It thus yields two blocks of triplets, each corresponding to this underlying form, one of the two winners, and all remaining loser candidates.

Table 4

ERC description of the Ilokano metathesis test case (2)–(3)

(/x/, [y], [z])ONSETMAXIO(V)*LOWGLIDEIDENTIO[low]*[ʔCMAXOO(ʔ)DEPIO(ʔ)IDENTIO[syl]LINEARITY*ʔ]MAXIO(ʔ)
(/paʔlak/, [pa.lak], [pa.ʔlak]    W      L 
(/paʔlak/, [pa.lak], [pal.ʔak]        W  L 
(/paʔlak/, [pa.lak], [paʔ.lak]         W L 
(/ʔajo-en/, [ʔaj.wen], [ʔa.jen] W      L    
(/ʔajo-en/, [ʔaj.wen], [ʔa.jo.en]W       L    
(/ʔajo-en/, [ʔaj.wen], [ʔa.jo.ʔen]      W L    
(/basa-en/, [ba.sa.ʔen], [bas.
graphic
en]) 
  W    L W    
(/basa-en/, [ba.sa.ʔen], [bas.wen]   W   L W    
(/basa-en/, [ba.sa.ʔen], [ba.sen] W     L     
(/basa-en/, [ba.sa.ʔen], [ba.sa.en]W      L     
(/taʔo-en/, [taw.ʔen], [taʔ.wen]        L W  
(/taʔo-en/, [taw.ʔen], [ta.wen]     W   L  W 
(/taʔo-en/, [taw.ʔen], [ta.ʔen] W      L L   
(/taʔo-en/, [taw.ʔen], [ta.ʔo.en]W       L L   
(/taʔo-en/, [taw.ʔen], [ta.ʔo.ʔen]      W L L   
(/taʔo-en/, [taw.ʔen], [ta.ʔwen]    W    L   
(/taʔo-en/, [taʔ.wen], [taw.ʔen]        W L  
(/taʔo-en/, [taʔ.wen], [ta.wen]     W    L W 
(/taʔo-en/, [taʔ.wen], [ta.ʔen] W      L  L  
(/taʔo-en/, [taʔ.wen], [ta.ʔo.en]W       L  L  
(/taʔo-en/, [taʔ.wen], [ta.ʔo.ʔen]      W L  L  
(/taʔo-en/, [taʔ.wen], [ta.ʔwen]    W     L  
(/x/, [y], [z])ONSETMAXIO(V)*LOWGLIDEIDENTIO[low]*[ʔCMAXOO(ʔ)DEPIO(ʔ)IDENTIO[syl]LINEARITY*ʔ]MAXIO(ʔ)
(/paʔlak/, [pa.lak], [pa.ʔlak]    W      L 
(/paʔlak/, [pa.lak], [pal.ʔak]        W  L 
(/paʔlak/, [pa.lak], [paʔ.lak]         W L 
(/ʔajo-en/, [ʔaj.wen], [ʔa.jen] W      L    
(/ʔajo-en/, [ʔaj.wen], [ʔa.jo.en]W       L    
(/ʔajo-en/, [ʔaj.wen], [ʔa.jo.ʔen]      W L    
(/basa-en/, [ba.sa.ʔen], [bas.
graphic
en]) 
  W    L W    
(/basa-en/, [ba.sa.ʔen], [bas.wen]   W   L W    
(/basa-en/, [ba.sa.ʔen], [ba.sen] W     L     
(/basa-en/, [ba.sa.ʔen], [ba.sa.en]W      L     
(/taʔo-en/, [taw.ʔen], [taʔ.wen]        L W  
(/taʔo-en/, [taw.ʔen], [ta.wen]     W   L  W 
(/taʔo-en/, [taw.ʔen], [ta.ʔen] W      L L   
(/taʔo-en/, [taw.ʔen], [ta.ʔo.en]W       L L   
(/taʔo-en/, [taw.ʔen], [ta.ʔo.ʔen]      W L L   
(/taʔo-en/, [taw.ʔen], [ta.ʔwen]    W    L   
(/taʔo-en/, [taʔ.wen], [taw.ʔen]        W L  
(/taʔo-en/, [taʔ.wen], [ta.wen]     W    L W 
(/taʔo-en/, [taʔ.wen], [ta.ʔen] W      L  L  
(/taʔo-en/, [taʔ.wen], [ta.ʔo.en]W       L  L  
(/taʔo-en/, [taʔ.wen], [ta.ʔo.ʔen]      W L  L  
(/taʔo-en/, [taʔ.wen], [ta.ʔwen]    W     L  

Each underlying/winner/loser form triplet (/x/, [y], [z]) sorts the constraints into winner-preferring (i.e., those that assign fewer violations to the winner [y] than to the loser [z]), loser-preferring (i.e., those that assign fewer violations to the loser [z] than to the winner [y]), and even. We write a W (or an L) when the constraint corresponding to the column considered is winner-preferring (or loser-preferring) relative to the triplet corresponding to the row considered (while the entries corresponding to even constraints are left empty for readability). To illustrate, the entry corresponding to the first ERC (/paʔlak/, [pa.lak], [pa.ʔlak]) and the markedness constraint *[ʔC is a W because that constraint is winner-preferring, as it is violated by the loser [pa.ʔlak] but not by the winner [pa.lak]. The ERC matrix thus obtained summarizes the actions available to the EDRA: each update is triggered by one of these ERCs, and it can be described as promoting the constraints with a W and demoting (some of) the constraints with an L.

3.4 First Round of Simplifications

The six leftmost columns in table 4 have W’s but no L’s. The corresponding constraints are therefore never demoted; that is, they never drop below their initial ranking value. We focus on the triplets that have a W corresponding to one of these six constraints. The following fact 1 guarantees that these triplets can trigger only “few” updates in any run of the stochastic EDRA. The online appendix provides a more explicit formulation of this fact (with an explicit bound on the number of updates that these ERCs can trigger).

Fact 1Consider an arbitrary run of the stochastic EDRA on the Ilokano metathesis test case (2)–(3) with any of the four reranking rules listed in table 1. Assume that the stochastic values are sampled between −Δ and +Δ. Each of the triplets in table 4 that has a w corresponding to one of the six leftmost constraints can trigger only a “small” number of updates.

Fact 1 is empirically confirmed by a closer look at the simulations. The first column of table 5 lists all triplets. The columns headed “#U” (“number of updates”) report the total number of updates triggered by that triplet in one of our simulations. For instance, the entry 2 corresponding to the first triplet and the GLA reranking rule says that that triplet has triggered only 2 updates in a run of the GLA. For each simulation, we have also maintained an incremental counter of the number of updates: it starts at zero and is increased by 1 whenever the algorithm performs an update. The columns headed “LU” (“last update”) report the value of the counter when the corresponding triplet has triggered its last update in the simulation considered. For instance, the entry 26 corresponding to the first triplet and the GLA reranking rule says that that triplet has triggered the 26th update and has not triggered any additional updates afterward. Finally, we have also maintained an incremental counter of the number of iterations: it starts at zero and is increased by 1 at every iteration, no matter whether an update is performed at that iteration or not. The columns headed “LI” (“last iteration”) report the value of the counter when the corresponding triplet has triggered its last update in the simulation considered. For instance, the entry 43 corresponding to the first triplet and the GLA reranking rule says that that triplet has triggered an update at the 43rd iteration and has not triggered any additional updates afterward. Overall, table 5 thus provides information on the number of updates triggered by each triplet and on how late into the run each triplet has remained active. The triplets that have a w corresponding to one of the six leftmost constraints in table 4 are highlighted in dark gray in table 5. They are shown to trigger only a small number of updates and only at the very beginning of the run, as indeed stated by fact 1.

Table 5
Contribution of each underlying/winner/loser form triplet (/x/, [y], [z]) in table 4 to the learning dynamics: number of updates triggered by that triplet (#U); number of updates overall performed before that triplet has triggered its last update (LU); iteration at which that triplet has triggered its last update (LI)
GLAminGLAEDCDCEDRA
(/x/, [y], [z])#ULULI#ULULI#ULULI#ULULI
(/paʔlak/, [pa.lak], [pa.ʔlak]26 43 121 294 15 20 
(/paʔlak/ , [pa.lak], [pal.ʔak]19 822 5,398 17 1,021 6,380 630 3,832 20,994 223 3,096 20,965 
(/paʔlak/, [pa.lak], [paʔ.lak]21 899 6,059 19 621 3,413 589 3,829 20,988 205 3,092 20,938 
(/ʔajo-en/ , [ʔaj.wen.], [ʔa.jen]10 
(/ʔajo-en/, [ʔaj.wen], [ʔa.jo.en]55 104 15 20 46 81 
(/ʔajo-en/, [ʔaj.wen], [ʔa.jo.ʔen]10 236 975 32 1,080 6,940 20 1,357 6,997 20 723 4,143 
(/basa-en/, [ba.sa.ʔen], [bas.
graphic
en]
18 584 3,611 26 1,081 6,946 891 4,494 13 407 2,054 
(/basa-en/, [ba.sa.ʔen], [bas.wen]20 875 5,894 27 920 5,632 200 735 16 944 5,626 
(/basa-en/, [ba.sa.ʔen], [ba.sen]16 824 5,410 20 978 6,027 656 3,288 235 951 
(/basa-en/, [ba.sa.ʔen], [ba.sa.en]15 997 6,881 19 1,069 6,890 1,110 5,687 911 5,436 
(/taʔo-en/, [taw.ʔen], [taʔ.wen]1,275 2,730 20,968 1,285 2,839 20,995 1,253 3,826 20,967 1,269 3,095 20,951 
(/taʔo-en/, [taw.ʔen], [ta.wen]50 101 829 4,907 34 61 
(/taʔo-en/, [taw.ʔen], [ta.ʔen]13 19 57 114 29 51 
(/taʔo-en/, [taw.ʔen], [ta.ʔo.en]17 27 53 99 52 109 
(/taʔo-en /, [taw.ʔen], [ta.ʔo.ʔen]24 782 5,091 35 1,082 6,947 21 1,339 6,904 16 686 3,879 
(/taʔo-en/, [taw.ʔen], [ta.ʔwen]159 546 1,084 6,957 28 52 
(/taʔo-en/, [taʔ.wen], [taw.ʔen]1,262 2,733 21,000 1,300 2,840 21,000 1,274 3,830 20,989 1,288 3,100 20,996 
(/taʔo-en/, [taʔ.wen], [ta.wen]78 207 946 5,821 
(/taʔo-en/, [taʔ.wen], [ta.ʔen]42 79 26 50 14 19 
(/taʔo-en/, [ta?.wen], [ta.ʔo.en]51 102 82 182 45 90 
(/taʔo-en /, [taʔ.wen], [ta.ʔo.ʔen]32 965 6,609 28 806 4,756 575 2,773 21 936 5,586 
(/taʔo-en/, [taʔ.wen], [ta.ʔwen]559 3,388 271 1,015 42 80 
GLAminGLAEDCDCEDRA
(/x/, [y], [z])#ULULI#ULULI#ULULI#ULULI
(/paʔlak/, [pa.lak], [pa.ʔlak]26 43 121 294 15 20 
(/paʔlak/ , [pa.lak], [pal.ʔak]19 822 5,398 17 1,021 6,380 630 3,832 20,994 223 3,096 20,965 
(/paʔlak/, [pa.lak], [paʔ.lak]21 899 6,059 19 621 3,413 589 3,829 20,988 205 3,092 20,938 
(/ʔajo-en/ , [ʔaj.wen.], [ʔa.jen]10 
(/ʔajo-en/, [ʔaj.wen], [ʔa.jo.en]55 104 15 20 46 81 
(/ʔajo-en/, [ʔaj.wen], [ʔa.jo.ʔen]10 236 975 32 1,080 6,940 20 1,357 6,997 20 723 4,143 
(/basa-en/, [ba.sa.ʔen], [bas.
graphic
en]
18 584 3,611 26 1,081 6,946 891 4,494 13 407 2,054 
(/basa-en/, [ba.sa.ʔen], [bas.wen]20 875 5,894 27 920 5,632 200 735 16 944 5,626 
(/basa-en/, [ba.sa.ʔen], [ba.sen]16 824 5,410 20 978 6,027 656 3,288 235 951 
(/basa-en/, [ba.sa.ʔen], [ba.sa.en]15 997 6,881 19 1,069 6,890 1,110 5,687 911 5,436 
(/taʔo-en/, [taw.ʔen], [taʔ.wen]1,275 2,730 20,968 1,285 2,839 20,995 1,253 3,826 20,967 1,269 3,095 20,951 
(/taʔo-en/, [taw.ʔen], [ta.wen]50 101 829 4,907 34 61 
(/taʔo-en/, [taw.ʔen], [ta.ʔen]13 19 57 114 29 51 
(/taʔo-en/, [taw.ʔen], [ta.ʔo.en]17 27 53 99 52 109 
(/taʔo-en /, [taw.ʔen], [ta.ʔo.ʔen]24 782 5,091 35 1,082 6,947 21 1,339 6,904 16 686 3,879 
(/taʔo-en/, [taw.ʔen], [ta.ʔwen]159 546 1,084 6,957 28 52 
(/taʔo-en/, [taʔ.wen], [taw.ʔen]1,262 2,733 21,000 1,300 2,840 21,000 1,274 3,830 20,989 1,288 3,100 20,996 
(/taʔo-en/, [taʔ.wen], [ta.wen]78 207 946 5,821 
(/taʔo-en/, [taʔ.wen], [ta.ʔen]42 79 26 50 14 19 
(/taʔo-en/, [ta?.wen], [ta.ʔo.en]51 102 82 182 45 90 
(/taʔo-en /, [taʔ.wen], [ta.ʔo.ʔen]32 965 6,609 28 806 4,756 575 2,773 21 936 5,586 
(/taʔo-en/, [taʔ.wen], [ta.ʔwen]559 3,388 271 1,015 42 80 

Fact 1 holds trivially in the case of EDCD, which performs constraint demotion but no constraint promotion. For concreteness, let us focus on the first underlying/winner/loser form triplet (/paʔlak/, [pa.lak], [pa.ʔlak]). Its corresponding ERC has a W corresponding to the constraint *[ʔC, which is indeed among the six leftmost constraints with no L’s. Thus, *[ʔC is never demoted during the run and always sits at its initial ranking value, as represented in (4).

(4)

graphic

After this triplet has triggered 2Δ updates, its loser-preferring constraint MAXIO(ʔ) has been demoted at least 2Δ times. As this constraint is never promoted (because EDCD performs no constraint promotion), the ranking value of MAXIO(ʔ) has decreased by at least8 2Δ ( possibly more, in case MAXIO(ʔ) has been demoted by some other triplets as well). The winner-preferring constraint *[ʔC is thus ranked above the loser-preferring constraint MAXIO(ʔ) with a separation of at least 2Δ. If the stochastic values are sampled between −Δ and +Δ, they will never be able to swap the two constraints. This triplet will therefore never be able to trigger any further update, as stated by fact 1. Online appendix A formalizes this intuitive reasoning. The proof of fact 1 for the GLA, the minGLA, and the CEDRA is based on the same intuition as for EDCD, but it is more involved and therefore relegated to online appendix B. The additional difficulty is that, after the triplet considered has triggered 2Δ updates in a run of these three algorithms, we cannot straightforwardly conclude that the ranking value of its loser-preferring constraint MAXIO(ʔ) has decreased by at least 2Δ because that constraint could also have been promoted by some other triplets in the meantime. In the case of EDRAs that also perform constraint promotion, the connection between the number of updates triggered by an ERC and the ranking values of its loser-preferrers is harder to establish and requires a more subtle analysis.

Let us take stock. The six leftmost constraints in table 4 are never loser-preferring and are therefore never demoted. We focus on the ERCs that have a w corresponding to one of these six constraints. Fact 1 guarantees that after a few updates, the current ranking vector always satisfies these ERCs. The learning dynamics is therefore governed in the long run by the remaining ERCs, which are collected for convenience in the simplified table 6.

Table 6

ERC description of the Ilokano metathesis test case after the first round of simplifications

(/x/, [y], [z])ONSETMAXIO(V)*LOWGLIDEIDENTIO[low]*[ʔCMAXOO(ʔ)DEPIO(ʔ)IDENTIO[syl]LINEARITY*ʔ]MAXIO(ʔ)
(/paʔlak/, [pa.lak], [pal.ʔak]        W  L 
(/paʔlak/, [pa.lak], [paʔ.lak]         W L 
(/ʔajo-en/, [ʔaj.wen], [ʔa.jo.ʔen]      W L    
(/taʔo-en/, [taw.ʔen], [taʔ.wen]        L W  
(/taʔo-en/, [taw.ʔen], [ta.ʔo.ʔen]      W L L   
(/taʔo-en/, [taʔ.wen], [taw.ʔen]        W L  
(/taʔo-en/, [taʔ.wen], [ta.ʔo.ʔen]      W L  L  
(/x/, [y], [z])ONSETMAXIO(V)*LOWGLIDEIDENTIO[low]*[ʔCMAXOO(ʔ)DEPIO(ʔ)IDENTIO[syl]LINEARITY*ʔ]MAXIO(ʔ)
(/paʔlak/, [pa.lak], [pal.ʔak]        W  L 
(/paʔlak/, [pa.lak], [paʔ.lak]         W L 
(/ʔajo-en/, [ʔaj.wen], [ʔa.jo.ʔen]      W L    
(/taʔo-en/, [taw.ʔen], [taʔ.wen]        L W  
(/taʔo-en/, [taw.ʔen], [ta.ʔo.ʔen]      W L L   
(/taʔo-en/, [taʔ.wen], [taw.ʔen]        W L  
(/taʔo-en/, [taʔ.wen], [ta.ʔo.ʔen]      W L  L  

3.5 Second Round of Simplifications

We can now repeat the same reasoning. Constraint DEPIO(ʔ) in table 6 is winner-preferring but never loser-preferring, and it is therefore never demoted. Fact 2 thus guarantees that those triplets in table 6 that have a w corresponding to this constraint can only trigger “few” updates. Indeed, these triplets are shown in table 5 (where they have been highlighted in light gray) to trigger only “few” updates and only toward the beginning of the run. A more precise formulation of fact 2 is provided in the online appendix, together with a proof.

Fact 2Consider an arbitrary run of the stochastic EDRA on the Ilokano metathesis test case (2)–(3) with any of the four reranking rules listed in table 1. Assume that the stochastic values are sampled in between −Δ and. Each of the three triplets in table 6 that has aWcorresponding to the constraint DEPIO(ʔ) can trigger only a “small” number of updates.

Fact 2 guarantees that after a few updates, the current ranking vector always satisfies the ERCs corresponding to those triplets in table 6 that have a W corresponding to the constraint DEPIO(ʔ) and an L corresponding to the constraint IDENTIO[syl]. The learning dynamics is therefore governed in the long run by the remaining triplets, which are collected for convenience in the further simplified table 7. The kernel of the Ilokano metathesis test case thus consists of learning the ranking conditions highlighted in the light gray box in (3): the two constraints LINEARITY and *ʔ] must be assigned the same ranking value while the constraint MAXIO(ʔ) must slide underneath them, settling on a smaller ranking value, so that it cannot be swapped easily with the other two constraints.

Table 7

ERC description of the kernel of the Ilokano metathesis test case after the second round of simplifications

(/x/, [y], [z])ONSETMAXIO(V)*LOWGLIDEIDENTIO[low]*[ʔCMAXOO(ʔ)DEPIO(ʔ)IDENTIO[syl]LINEARITY*ʔ]MAXIO(ʔ)
ERC 1=(/paʔlak/, [pa.lak], [pal.ʔak]        W  L 
ERC 2=(/paʔlak/, [pa.lak], [paʔ.lak]         W L 
ERC 3=(/taʔo-en/, [taw.ʔen], [taʔ.wen]        W L  
ERC 4=(/taʔo-en/, [taʔ.wen], [taw.ʔen]        L W  
(/x/, [y], [z])ONSETMAXIO(V)*LOWGLIDEIDENTIO[low]*[ʔCMAXOO(ʔ)DEPIO(ʔ)IDENTIO[syl]LINEARITY*ʔ]MAXIO(ʔ)
ERC 1=(/paʔlak/, [pa.lak], [pal.ʔak]        W  L 
ERC 2=(/paʔlak/, [pa.lak], [paʔ.lak]         W L 
ERC 3=(/taʔo-en/, [taw.ʔen], [taʔ.wen]        W L  
ERC 4=(/taʔo-en/, [taʔ.wen], [taw.ʔen]        L W  

3.6 Why the GLA and the minGLA Succeed

We focus on the kernel of the Ilokano metathesis test case described in table 7. The dynamics of the ranking values in a run of the GLA in this case is plotted in table 8a. The horizontal axis plots the number of iterations and the vertical axis plots the ranking values of the three active constraints. The two constraints LINEARITY and *ʔ] quickly rise to their final ranking values (115.88 and 116.12, respectively) and then just keep oscillating without moving away from that position.9 The constraint MAXIO(ʔ) quickly drops to its final ranking value (68.0), well separated underneath the other two constraints, and then stays there.

To gain some intuition into this ranking dynamics, suppose the GLA is trained only on the underlying form /paʔlak/. The two corresponding ERCs 1 and 2 in table 7 promote their winner-preferring constraints LINEARITY and *ʔ] and both demote their shared loser-preferring constraint MAXIO(ʔ). These ERCs trigger updates until the two winner-preferring constraints are separated from the loser-preferring constraint by a distance large enough that they cannot be swapped. As soon as that ranking configuration is achieved after a few updates, the GLA stops performing any further updates and learning effectively ceases, as plotted in table 8b.

The ranking dynamics in table 8b corresponding to a run of the GLA on only the underlying form /paʔlak/ and the dynamics in table 8a corresponding to both underlying forms /paʔlak/ and /taʔo-en/ have the same shape (ignoring the oscillations in the upper branch). The reason is easy to grasp. Suppose that the GLA is trained only on the underlying form /taʔo-en/. The two corresponding triplets (/taʔo-en/, [taw.ʔen], [taʔ.wen]) and (/taʔo-en/, [taʔ.wen], [taw.ʔen]) differ because the winner and loser forms are swapped. Thus, their two corresponding ERCs 3 and 4 in table 7 are opposites: one has a W where the other has an L. Crucially, the GLA promotes and demotes winner- and loser-preferring constraints by exactly the same amount. The updates by these two ERCs 3 and 4 thus cancel each other out: one of them displaces the two constraints LINEARITY and *ʔ] up or down and the other shifts them back to their original position. If the two constraints start with the same initial ranking value, ERCs 3 and 4 do not displace them but just keep them oscillating, as in table 8c.

Table 8
Ranking dynamics of the three constraints LINEARITY, *ʔ], and MAXIO(ʔ) when the GLA, EDCD, and the CEDRA are trained on (a) both underlying forms /paʔlak/ and /taʔo-en/ in table 7, (b) only the underlying form /paʔlak/, (c) only the underlying form /taʔo-en/ LINEARITY and *ʔ] up or down and the other shifts them back to their original position. If the two constraints start with the same initial ranking value, ERCs 3 and 4 do not displace them but just keep them oscillating, as in table 8c.
GLAEDCDCEDRA
graphic
 
GLAEDCDCEDRA
graphic
 

As all ERCs in table 7 have a single L, the reranking rules of the GLA and minGLA coincide in this case and the preceding considerations extend to the minGLA, leading to fact 3. A more precise formulation is provided in online appendix C, together with a proof.

Fact 3Consider an arbitrary run of the stochastic EDRA on the Ilokano metathesis test case (2)–(3) with the GLA or the minGLA reranking rule (described in table 1). Assume that the stochastic values are sampled in between −Δ and +Δ. The two ERCs 1 and 2 corresponding to the underlying form /paʔlak/ can trigger only a “small” number of updates.

Facts 1, 2, and 3 together guarantee that all ERCs of the Ilokano metathesis test case listed in the original table 4 can trigger only a few updates, except the two ERCs corresponding to (/taʔo-en/, [taw.ʔen], [taʔ.wen]) and (/taʔo-en/, [taʔ.wen], [taw.ʔen]). In other words, from a certain moment on in any run of the GLA and the minGLA, only these two triplets (/taʔo-en/, [taw.ʔen], [taʔ.wen]) and (/taʔo-en/, [taʔ.wen], [taw.ʔen]) will trigger updates. Since they differ because the winner and loser forms are swapped, their corresponding ERCs are opposites. Since the GLA and the minGLA promote and demote by exactly the same amount, these two ERCs thus effectively maintain their two active constraints LINEARITY and *ʔ] close to each other as required to model variation. We conclude that facts 1, 2, and 3 prove that the GLA and the minGLA always succeed on the Ilokano metathesis test case.

3.7 Why EDCD and the CEDRA Fail

We focus again on the kernel of the Ilokano metathesis test case described in table 7. The dynamics of the ranking values in a run of EDCD in this case is plotted in table 8a. The ranking values of the three active constraints LINEARITY, *ʔ], and MAXIO(ʔ) start out all together and are never separated (the final ranking values are −1884.38, −1884.64, and −1886.98, respectively). EDCD has thus failed to learn that the constraint MAXIO(ʔ) needs to be ranked at a safe distance underneath both constraints LINEARITY and *ʔ]. As shown by the ERC matrix in table 7, the latter ranking condition is needed to account for the deterministic mapping of /paʔlak/ to [pa.lak], thus explaining the failure of EDCD on this mapping diagnosed in table 3.

What explains the difference in behavior between EDCD and the GLA? To start, suppose that EDCD is trained on the underlying form /paʔlak/ only. Since EDCD performs no constraint promotion, the two corresponding ERCs 1 and 2 do not rerank the two winner-preferring constraints LINEARITY and *ʔ]. But they both demote the loser-preferring constraint MAXIO(ʔ). After a few updates, this loser-preferring constraint has dropped to a safe distance. EDCD thus stops performing any updates and learning effectively ceases, as plotted in table 8b. EDCD’s dynamics is analogous (in scale and overall shape) to the GLA’s dynamics in table 8b: when trained on /paʔlak/ only, the two algorithms behave roughly in the same way.

The crucial difference shows up when the GLA and EDCD are trained on the underlying form /taʔo-en/, which displays variation. As noted above, the two corresponding ERCs 3 and 4 are opposites: LINEARITY and *ʔ] are winner-preferring in one of the two ERCs and loser-preferring in the other. Crucially, EDCD performs constraint demotion but no constraint promotion. When trained on these two ERCs 3 and 4, EDCD thus forces LINEARITY and *ʔ] into free fall, as shown by the ranking dynamics plotted in table 8c.10 This dynamics is completely different from the GLA’s dynamics in table 8c: LINEARITY and *ʔ] keep oscillating around themselves because the GLA’s promotions compensate for the demotions.

As ERCs 3 and 4 force LINEARITY and *ʔ] into free fall in the case of EDCD, ERCs 1 and 2 need to keep triggering updates in order to try to slide the constraint MAXIO(ʔ) underneath them, without ever managing to achieve the needed separation. Indeed, table 5 for EDCD shows that ERCs 1 and 2 (corresponding to the top white rows in the table) trigger many updates (roughly 600) and remain active until the end of the simulation (close to the last, 21,000th iteration). In the case of the GLA, ERCs 3 and 4 instead do not displace LINEARITY and *ʔ] and ERCs 1 and 2 thus easily slide the constraint MAXIO(ʔ) underneath them. Indeed, table 5 for the GLA shows that ERCs 1 and 2 trigger few updates (roughly 20) and only toward the beginning of the simulation (they become inactive after the 7,000th iteration).

The case of the CEDRA is analogous to the case of EDCD. The two learners differ because the latter performs no constraint promotion while the former performs little promotion: it crucially promotes less than it demotes. For instance, ERC 3 promotes LINEARITY by 0.5 (that is,

graphic
, where w = 1 is the number of winner-preferring constraints relative to ERC 3), and ERC 4 demotes it by 1.11 Overall, the ranking value of LINEARITY has thus decreased by .5, while it would have decreased by 1 in the case of EDCD. In the case of the CEDRA, we thus expect the same free fall of the three constraints LINEARITY, *ʔ], and MAXIO(ʔ) as in the case of EDCD, only slower. That is indeed what happens: the shape of the ranking dynamics reported in table 8 for the CEDRA is identical to the shape of the dynamics reported in table 8 for EDCD, only with a smaller scale because of a promotion amount equal to .5 rather than just 0.

This analysis shows that the failure of EDCD and the CEDRA does not depend on the choice of the initial ranking values. Indeed, set the initial ranking values of EDCD or the CEDRA equal to the final ranking values learned by the GLA, as reported in the first column of table 2. These ranking values already account for all the attested frequencies. A learner starting from those ranking values thus effectively has nothing to learn. Yet, EDCD and the CEDRA fail also with such a favorable choice of the initial ranking values: the underlying form /taʔo-en/, which displays variation, keeps triggering updates forever and thus forces the constraints into free fall because nothing cancels out the demotions it triggers.

4 The Finnish Genitive Plurals Test Case

This section analyzes the performance of the stochastic EDRA on BH’s Finnish genitive plurals test case (based on data and analysis from Anttila 1997a,b).

4.1 Description of the Test Case

This test case consists of twenty-two underlying forms paired with two candidates each, corresponding to the two genitive plural suffixes /-jen/ and /-iden/. Fourteen of those underlying forms select their genitive form deterministically. They therefore give rise to a unique underlying/winner/loser form triplet each, listed in rows a–n of table 9. The remaining eight underlying forms display variation in the choice of the genitive plural suffix. They therefore give rise to two underlying/winner/loser form triplets each, listed in rows o–v of table 9. The second and third columns of the table provide the frequency of each underlying form and the frequency of the two variants (conditioned on the underlying form). BH propose an analysis of this phonological pattern based on ten constraints described in ERC notation in the rest of table 9. The two constraints WEIGHT-TO-STRESS (WTS; no unstressed heavy syllables) and *LAPSE (no consecutive unstressed syllables) are familiar from the OT literature on stress (Prince 1983). The two constraints *H.H and *L.L prohibit consecutive heavy and consecutive light syllables. Finally, the three constraints *Í, *Ó , and *Á and the three constraints *Ĭ, *Ŏ , and *Ă penalize surface stressed and unstressed syllables with underlying high/mid/low vowels.

Table 9

ERC description of Boersma and Hayes’s (2001) Finnish genitive plurals test case

(/x/, [y], [z])p(x)p(y|x)WTS*Í*L.L*H.H*Ŏ*LAPSE*Ó*Á*Ĭ
a. (/sosialisti/, [só.si.a.lìs.ti.en], [só.si.a.lìs.tei.den].01737  W    W      
b. (/margariini/, [már.ga.rìi.ni.en], [már.ga.rìi.nei.den].1292  W    W      
c. (/edustusto/, [é.dus.tùs.to.jen], [é.dus.tùs.toi.den].01474  W    W      
d. (/italiaanno/, [í.ta.li.àa.no.jen], [í.ta.li.àa.noi.den].0001  W    W      
e. (/luonnehdinta/, [lúon.neh.dìn.to.jen], [lúon.neh.dìn.toi.den].0001  W    W      
f. (/evankelista/, [é.van.ke.lìs.to.jen], [é.van.ke.lìs.toi.den].0003  W    W      
g. (/kala/, [ká.lo.jen], [ká.loi.den].0877  W  L  W      
h. (/lasi/, [lá.si.en], [lá.sei.den].0877  W  L  W      
i. (/luettelo/, [lú.et.te.lòi.den], [lú.et.te.lo.jen].0044    W  L W W L   
j. (/televisio/, [té.le.vi.si.òi.den], [té.le.vi.si.o.jen].0072    W  L W W L   
k. (/kamera/, [ká.me.ròi.den], [ká.me.ro.jen].1264    W W L  W  L  
l. (/ajattelija/, [á.jat.te.li.jòi.den], [á.jat.te.li.jo.jen].0177    W W L  W  L  
m. (/taiteilija/, [tái.tei.li.jòi.den], [tái.tei.li.jo.jen].0484    W W L  W  L  
n. (/avantgardisti/, [á.vant.gàr.dis.ti.en], [á.vant.gàr.dis.tèi.den].0003   W   W  L   L 
o. (/korjaamo/, [kór.jaa.mo.jen], [kór.jaa.mòi.den].0748 .82     W L L W   
(/korjaamo/, [kór.jaa.mòi.den], [kór.jaa.mo.jen].18     L W W L   
p. (/koordinaatisto/, [kóor.di.nàa.tis.to.jen], [kóor.di .nàa.tis.tòi.den].0017 .8     W L L W   
(/koordinaatisto/, [kóor.di .nàa.tis.tòi.den], [kóor.di.nàa.tis.to.jen].2     L W W L   
q. (/aleksanteri/, [á.lek.sàn.te.ri.en], [á.lek.sàn.te.rèi.den].0029 .88  W L  W  L   L 
(/aleksanteri/, [á.lek.sàn.te.rèi.den], [á.lek.sàn.te.ri.en].12  L W  L  W   W 
r. (/ministeri/, [mí.nis.te.ri.en], [mí.nis.te.rèi.den].0479 .86  W L  W  L   L 
(/ministeri/, [mí.nis.te.rèi.den], [mí.nis.te.ri.en].14  L W  L  W   W 
s. (/naapuri /, [náa.pu.rèi.den], [náa.pu.ri.en].1023 .37  L W  L  W   W 
(/naapuri /, [náa.pu.ri.en], [náa.pu.rèi.den].63  W L  W  L   L 
t. (/poliisi/, [pó.lii.sèi.den], [pó.lii.si.en].1437 .02  L   L  W   W 
(/poliisi/, [pó.lii.si.en], [pó.lii.sèi.den].98  W   W  L   L 
u. (/hetero/, [hé.te.ròi.den], [hé.te.ro.jen].0686 .99   W  L W W L   
(/hetero/, [hé.te.ro.jen], [hé.te.ròi.den].01   L  W L L W   
v. (/maailma/, [máa.il.mo.jen], [máa.il.mòi.den].0159 .5    L W  L  W  
(/maailma/, [máa.il.mòi.den], [máa.il.mo.jen].5    W L  W  L  
(/x/, [y], [z])p(x)p(y|x)WTS*Í*L.L*H.H*Ŏ*LAPSE*Ó*Á*Ĭ
a. (/sosialisti/, [só.si.a.lìs.ti.en], [só.si.a.lìs.tei.den].01737  W    W      
b. (/margariini/, [már.ga.rìi.ni.en], [már.ga.rìi.nei.den].1292  W    W      
c. (/edustusto/, [é.dus.tùs.to.jen], [é.dus.tùs.toi.den].01474  W    W      
d. (/italiaanno/, [í.ta.li.àa.no.jen], [í.ta.li.àa.noi.den].0001  W    W      
e. (/luonnehdinta/, [lúon.neh.dìn.to.jen], [lúon.neh.dìn.toi.den].0001  W    W      
f. (/evankelista/, [é.van.ke.lìs.to.jen], [é.van.ke.lìs.toi.den].0003  W    W      
g. (/kala/, [ká.lo.jen], [ká.loi.den].0877  W  L  W      
h. (/lasi/, [lá.si.en], [lá.sei.den].0877  W  L  W      
i. (/luettelo/, [lú.et.te.lòi.den], [lú.et.te.lo.jen].0044    W  L W W L   
j. (/televisio/, [té.le.vi.si.òi.den], [té.le.vi.si.o.jen].0072    W  L W W L   
k. (/kamera/, [ká.me.ròi.den], [ká.me.ro.jen].1264    W W L  W  L  
l. (/ajattelija/, [á.jat.te.li.jòi.den], [á.jat.te.li.jo.jen].0177    W W L  W  L  
m. (/taiteilija/, [tái.tei.li.jòi.den], [tái.tei.li.jo.jen].0484    W W L  W  L  
n. (/avantgardisti/, [á.vant.gàr.dis.ti.en], [á.vant.gàr.dis.tèi.den].0003   W   W  L   L 
o. (/korjaamo/, [kór.jaa.mo.jen], [kór.jaa.mòi.den].0748 .82     W L L W   
(/korjaamo/, [kór.jaa.mòi.den], [kór.jaa.mo.jen].18     L W W L   
p. (/koordinaatisto/, [kóor.di.nàa.tis.to.jen], [kóor.di .nàa.tis.tòi.den].0017 .8     W L L W   
(/koordinaatisto/, [kóor.di .nàa.tis.tòi.den], [kóor.di.nàa.tis.to.jen].2     L W W L   
q. (/aleksanteri/, [á.lek.sàn.te.ri.en], [á.lek.sàn.te.rèi.den].0029 .88  W L  W  L   L 
(/aleksanteri/, [á.lek.sàn.te.rèi.den], [á.lek.sàn.te.ri.en].12  L W  L  W   W 
r. (/ministeri/, [mí.nis.te.ri.en], [mí.nis.te.rèi.den].0479 .86  W L  W  L   L 
(/ministeri/, [mí.nis.te.rèi.den], [mí.nis.te.ri.en].14  L W  L  W   W 
s. (/naapuri /, [náa.pu.rèi.den], [náa.pu.ri.en].1023 .37  L W  L  W   W 
(/naapuri /, [náa.pu.ri.en], [náa.pu.rèi.den].63  W L  W  L   L 
t. (/poliisi/, [pó.lii.sèi.den], [pó.lii.si.en].1437 .02  L   L  W   W 
(/poliisi/, [pó.lii.si.en], [pó.lii.sèi.den].98  W   W  L   L 
u. (/hetero/, [hé.te.ròi.den], [hé.te.ro.jen].0686 .99   W  L W W L   
(/hetero/, [hé.te.ro.jen], [hé.te.ròi.den].01   L  W L L W   
v. (/maailma/, [máa.il.mo.jen], [máa.il.mòi.den].0159 .5    L W  L  W  
(/maailma/, [máa.il.mòi.den], [máa.il.mo.jen].5    W L  W  L  

4.2 Simulation Results

BH report that the final ranking vector learned by the GLA in their simulations closely matches the attested frequencies of the suffixes /-iden/ and /-jen/ except for the three underlying forms /ministeri/ (the actual probabilities are .143 and .857 while the probabilities predicted by the ranking vector learned by the GLA are .3051 and .6949), /aleksanteri/ (the actual probabilities are .118 and .882 while the predicted probabilities are .3049 and .6951), and /naapuri/ (the actual probabilities are .369 and .631 while the predicted probabilities are .3049 and .6951). The other three stochastic error-driven learners (the minGLA, EDCD, and the CEDRA) instead massively fail on this test case (detailed simulation results are provided in section 4.7). How does the GLA manage to largely succeed? Why does it fail on those three problematic underlying forms? Why do other implementations of stochastic error-driven learning (especially the minGLA, which is so similar to the GLA) instead fail?

4.3 Simplifying the Candidate Set: First Round

The ERC matrix in table 9 can be substantially simplified into the reduced ERC matrix in table 10.12 A first round of simplifications involves underlying forms that display a categorical behavior. The six underlying forms /sosialisti/, /margariini/, /edustusto/, /italiaanno/, /luonnehdinta/, and /evankelista/ in rows a–f of table 9 yield ERCs that have no loser-preferring constraints. They can thus be dropped because they impose no ranking conditions and never trigger any update. The two underlying forms /kala/ and /lasi/ in rows g–h yield identical ERCs. One of them (say, /lasi/ for concreteness) can thus be dropped and its probability mass added to the mass of the other one. Thus, the probability mass reported in table 10 for /kala/ (.1755) is the sum of the two masses reported in table 9 for /kala/ and /lasi/ (which are .0877 each). The two underlying forms /luettelo/ and /televisio/ in rows i–j also yield identical ERCs. One of them (say, /televisio/ for concreteness) can thus be dropped and its probability mass added to the mass of the other one. Finally, the three underlying forms /kamera/, /ajattelija/, and /taiteilija/ in rows k–m also yield three identical ERCs. Two of them (say, /ajattelija/ and /taiteilija/ for concreteness) can be dropped and their mass added to the mass of the remaining one.

Table 10

ERC description of Boersma and Hayes’s (2001) Finnish genitive plurals test case after the simplifications of the candidate set described in sections 4.3, 4.4, and 4.5

(/x/, [y], [z])p(x)p(y|x)WTS*Í*L.L*H.H*Ŏ*LAPSE*Ó*Á*Ĭ
(/kala/, [ká.lo.jen], [ká.loi.den].1755  W  L  W      
(/luettelo/, [lú.et.te.lòi.den], [lú.et.te.lo.jen].0802    W  L W W L   
(/kamera/, [ká.me.ròi.den], [ká.me.ro.jen].1925    W W L  W  L  
(/avantgardisti/, [á.vant.gàr.dis.ti.en], [á.vant.gàr.dis.tèi.den].1440   W   W  L   L 
(/korjaamo/, [kór.jaa.mo.jen], [kór.jaa.mòi.den].0765 .82     W L L W   
(/korjaamo/, [kór.jaa.mòi.den], [kór.jaa.mo.jen].18     L W W L   
(/naapuri /, [náa.pu.rèi.den], [náa.pu.ri.en].1532 .29  L W  L  W   W 
(/naapuri /, [náa.pu.ri.en], [náa.pu.rèi.den].71  W L  W  L   L 
(/maailma/, [máa.il.mo.jen], [máa.il.mòi.den].0159 .5    L W  L  W  
(/maailma/, [máa.il.mòi.den], [máa.il.mo.jen].5    W L  W  L  
(/x/, [y], [z])p(x)p(y|x)WTS*Í*L.L*H.H*Ŏ*LAPSE*Ó*Á*Ĭ
(/kala/, [ká.lo.jen], [ká.loi.den].1755  W  L  W      
(/luettelo/, [lú.et.te.lòi.den], [lú.et.te.lo.jen].0802    W  L W W L   
(/kamera/, [ká.me.ròi.den], [ká.me.ro.jen].1925    W W L  W  L  
(/avantgardisti/, [á.vant.gàr.dis.ti.en], [á.vant.gàr.dis.tèi.den].1440   W   W  L   L 
(/korjaamo/, [kór.jaa.mo.jen], [kór.jaa.mòi.den].0765 .82     W L L W   
(/korjaamo/, [kór.jaa.mòi.den], [kór.jaa.mo.jen].18     L W W L   
(/naapuri /, [náa.pu.rèi.den], [náa.pu.ri.en].1532 .29  L W  L  W   W 
(/naapuri /, [náa.pu.ri.en], [náa.pu.rèi.den].71  W L  W  L   L 
(/maailma/, [máa.il.mo.jen], [máa.il.mòi.den].0159 .5    L W  L  W  
(/maailma/, [máa.il.mòi.den], [máa.il.mo.jen].5    W L  W  L  

4.4 Simplifying the Candidate Set: Second Round

A second round of simplifications of BH’s Finnish genitive plurals test case involves underlying forms that display variation. The two underlying forms /korjaamo/ and /koordinaatisto/ in rows o–p of table 9 each yield variation between the same two ERCs. The probabilities of these two ERCs conditioned on the two underlying forms are almost identical. One of the two underlying forms (say, /koordinaatisto/ for concreteness) can thus be dropped and its probability mass added to the mass of the other one. The three underlying forms /aleksanteri/, /ministeri/, and /naapuri/ in rows q–s also each yield variation between the same two ERCs. The probabilities of these two ERCs conditioned on the first two underlying forms, /aleksanteri/ and /ministeri/, are almost identical, as reported in the third column. Yet the probabilities of these two ERCs conditioned on the last underlying form, /naapuri/, are very different. BH’s constraint set is therefore not sufficiently fine-grained to capture the difference between the two patterns of variation corresponding to /naapuri/ on the one hand and to /aleksanteri/ and /ministeri/ on the other hand. The failure of the GLA on these three underlying forms reported in section 4.2 is thus due not to the learning procedure but to an insufficiency of the constraint set.

The incorrect frequencies predicted by the GLA for these three underlying forms can be explained as follows. The frequencies of the three underlying forms /aleksanteri/, /ministeri/, and /naapuri/ are qaleksanteri = .0029, qministeri = .0479, and qnaapuri = .1023, respectively. As they do not add up to 1, we normalize them as Qaleksanteri = .0194, Qministeri = .3127, and Qnaapuri = .6678.13 Let

graphic
= .88 and
graphic
= .12 be the probabilities of the two shared ERCs conditioned on the underlying form /aleksanteri/; let
graphic
= .86 and
graphic
= .14 be the probabilities of those two shared ERCs conditioned on the underlying form /ministeri/; finally, let
graphic
= .63 and
graphic
= .37 be the probabilities of those two shared ERCs conditioned on the underlying form /naapuri/. We define the probabilities P1 , P2 of these two shared ERCs as in (5): namely, as the average of the three original distributions for those two ERCs, weighted by the (normalized) frequencies of the three corresponding underlying forms.

(5)

  • P1 = Qaleksanteri

    graphic
    + Qministeri
    graphic
    + Qnaapuri
    graphic
    = .7067

  • P2 = Qaleksanteri

    graphic
    + Qministeri
    graphic
    + Qnaapuri
    graphic
    = .2932

The probabilities P1 and P2 just computed are indeed the probabilities learned by the GLA in BH’s simulations for the three underlying forms /aleksanteri/, /ministeri/, and /naapuri/ as reported in section 4.2. These are also the probabilities assumed in table 10.

4.5 Simplifying the Candidate Set: Third Round

A third round of simplifications of BH’s Finnish genitive test case involves interactions between underlying forms that display a categorical behavior and underlying forms that display variation. The underlying form /poliisi/ in row t of table 9 yields variation between the triplet (/poliisi/, [pó.lii.sèi.den], [pó .lii.si.en]), which has a very small conditional probability (.02), and the triplet (/poliisi/, [pó.lii.si.en], [pó .lii.sèi.den]), which instead has a very large probability (.98) and furthermore yields the same ERC as the categorical triplet (/avantgardisti/, [á.vant.gàr.dis.ti.en], [á.vant.gàr.dis.tèi.den]). We can thus drop the underlying form /poliisi/ and add its probability mass to the mass of the categorical underlying form /avantgardisti/. Analogous considerations hold for the underlying form /hetero/ in row u of table 9. The corresponding triplet (/hetero/, [hé.te.ròi.den], [hé.te.ro.jen]) has very large conditional probability (.99) and yields the same ERC as the categorical triplet (/luettelo/, [lú.et.te.lòi.den], [lú.et.te.lo.jen]). Once again, we can drop the underlying form /hetero/ and add its probability mass to that of the categorical underlying form /luettelo/. In the end, BH’s original Finnish genitive plurals test case can be substantially simplified as in table 10.14

4.6 Simplifying the Constraint Set

The constraint set used by BH and listed in tables 9 and 10 is a subset of Anttila’s (1997a,b) original constraint set. BH motivate the restriction to this subset through the observation that “[w]e found that we could derive the corpus frequencies accurately using only a subset of [Anttila’s] constraints” (2001:68). It turns out that the constraint set can be substantially further simplified: the four constraints *LAPSE, *Ó, *Á, and *Ĭ can be dropped without affecting the ability of stochastic OT to model the corpus frequencies accurately.15 The underlying/winner/loser form triplet (/avantgardisti/, [á.vant.gàr.dis.ti.en], [á.vant.gàr.dis.tèi.den]) can then also be dropped, because it has no loser-preferring constraints once *LAPSE and *Ĭ are dropped.16 In conclusion, the ERC matrix in table 10 is further simplified to table 11.

Table 11

ERC description of Boersma and Hayes’s (2001) Finnish genitive plurals test case after the simplifications of the constraint set described in section 4.6

(/x/, [y], [z])p(x)p(y|x)WTS*L.L*H.H*Í*Ŏ
(/kala/, [ká.lo.jen], [ká.loi.den].25  W L W    
(/luettelo/, [lú.et.te.lòi.den], [lú.et.te.lo.jen].12   W L  W  
(/kamera/, [ká.me.ròi.den], [ká.me.ro.jen].28   W L   W 
(/korjaamo/, [kór.jaa.mo.jen], [kór.jaa.mòi.den].11 .82   W  L  
(/korjaamo/, [kór.jaa.mòi.den], [kór.jaa.mo.jen].18   L  W  
(/maailma/, [máa.il.mo.jen], [máa.il.mòi.den].02 .5   W   L 
(/maailma/, [máa.il.mòi.den], [máa.il.mo.jen].5   L   W 
(/naapuri/, [náa.pu.rèi.den], [náa.pu.ri.en].22 .29  W L L   
(/naapuri/, [náa.pu.ri.en], [náa.pu.rèi.den].71  L W W   
(/x/, [y], [z])p(x)p(y|x)WTS*L.L*H.H*Í*Ŏ
(/kala/, [ká.lo.jen], [ká.loi.den].25  W L W    
(/luettelo/, [lú.et.te.lòi.den], [lú.et.te.lo.jen].12   W L  W  
(/kamera/, [ká.me.ròi.den], [ká.me.ro.jen].28   W L   W 
(/korjaamo/, [kór.jaa.mo.jen], [kór.jaa.mòi.den].11 .82   W  L  
(/korjaamo/, [kór.jaa.mòi.den], [kór.jaa.mo.jen].18   L  W  
(/maailma/, [máa.il.mo.jen], [máa.il.mòi.den].02 .5   W   L 
(/maailma/, [máa.il.mòi.den], [máa.il.mo.jen].5   L   W 
(/naapuri/, [náa.pu.rèi.den], [náa.pu.ri.en].22 .29  W L L   
(/naapuri/, [náa.pu.ri.en], [náa.pu.rèi.den].71  L W W   

The ranking conditions (6) required by the Finnish test case are easily determined from table 11. The underlying form /korjaamo/ requires *H.H and *Ŏ to stay close. Hence, the underlying form /luettelo/ requires *L.L to be ranked above *H.H. Analogously, the underlying form /maailma/ requires *H.H and *Ă to stay close, and the underlying form /kamera/ thus requires *L.L to be ranked above *H.H. The latter two blocks are completely analogous, except for the fact that it is *Ŏ that is active in one case and *Ă that is active in the other case. The underlying form /kala/ requires WTS to be ranked well above *L.L. As *L.L is deterministically ranked above *H.H, the underlying form /naapuri/ requires *Í and *L.L to stay close.

(6)

graphic

4.7 Explaining the Simulation Results

We ran the stochastic EDRA on the simplified Finnish test case described in table 11 with the four reranking rules in table 1.17Table 12 plots the ranking dynamics for each reranking rule. Table 13 provides the final ranking values, and table 14 compares the frequencies of each underlying/surface form mapping predicted by those ranking values to the actual frequencies.

Table 12
Ranking dynamics predicted by (a) the GLA, (b) the minGLA, (c) EDCD, and (d) the CEDRA trained on the simplified Finnish genitive plurals test case described in table 11 
graphic
 
graphic
 
Table 13
Ranking values learned by the stochastic EDRA with the four reranking rules listed in table 1 on the simplified Finnish genitive plurals test case described in table 11 
GLAminGLAEDCDCEDRA
WTS 314.0 WTS 2,364.5 WTS 100.0 WTS 108.7 
*Í 277.7 *Ă 2,361.8 *Ă −977.5 *Ă −998.8 
*L.L 276.3 *Í 2,361.5 *Í −3,958.6 *Í −1,864.0 
*Ă 231.5 *L.L 2,360.8 *L.L −3,960.0 *L.L −1,864.7 
*H.H 231.5 *H.H 2,359.4 * H.H −3,962.1 *H.H −1,866.0 
*Ŏ 228.8 *Ŏ 2,357.5 *Ŏ −3,962.3 *Ŏ −1,866.6 
GLAminGLAEDCDCEDRA
WTS 314.0 WTS 2,364.5 WTS 100.0 WTS 108.7 
*Í 277.7 *Ă 2,361.8 *Ă −977.5 *Ă −998.8 
*L.L 276.3 *Í 2,361.5 *Í −3,958.6 *Í −1,864.0 
*Ă 231.5 *L.L 2,360.8 *L.L −3,960.0 *L.L −1,864.7 
*H.H 231.5 *H.H 2,359.4 * H.H −3,962.1 *H.H −1,866.0 
*Ŏ 228.8 *Ŏ 2,357.5 *Ŏ −3,962.3 *Ŏ −1,866.6 
Table 14
Probabilities of each underlying/surface form mapping predicted by the four final ranking vectors reported in table 13 
ActualGLAminGLAEDCDCEDRA
(/kala/, [ká.lo.jen]) .906 
(/kala/, [ká.loi.den]) .094 
(/luettelo/, [lú.et.te.lòi.den]) .722 .821 .746 
(/luettelo/, [lú.et.te.lo.jen]) .278 .179 .254 
(/kamera/, [ká.me.ròi.den]) .884 
(/kamera/, [ká.me.ro.jen]) .116 
(/korjaamo/, [kór.jaa.mòi.den]) .18 .178 .255 .474 .417 
(/korjaamo/, [kór.jaa.mo.jen]) .82 .822 .745 .526 .583 
(/naapuri/, [náa.pu.rèi.den]) .29 .278 .354 .287 .345 
(/naapuri/, [náa.pu.ri.en]) .71 .722 .646 .713 .655 
(/maailma/, [máa.il.mòi.den]) .51 .503 .803 1 1 
(/maailma/, [máa.il.mo.jen]) .5 .497 .197 0 0 
ActualGLAminGLAEDCDCEDRA
(/kala/, [ká.lo.jen]) .906 
(/kala/, [ká.loi.den]) .094 
(/luettelo/, [lú.et.te.lòi.den]) .722 .821 .746 
(/luettelo/, [lú.et.te.lo.jen]) .278 .179 .254 
(/kamera/, [ká.me.ròi.den]) .884 
(/kamera/, [ká.me.ro.jen]) .116 
(/korjaamo/, [kór.jaa.mòi.den]) .18 .178 .255 .474 .417 
(/korjaamo/, [kór.jaa.mo.jen]) .82 .822 .745 .526 .583 
(/naapuri/, [náa.pu.rèi.den]) .29 .278 .354 .287 .345 
(/naapuri/, [náa.pu.ri.en]) .71 .722 .646 .713 .655 
(/maailma/, [máa.il.mòi.den]) .51 .503 .803 1 1 
(/maailma/, [máa.il.mo.jen]) .5 .497 .197 0 0 

The GLA succeeds for the by-now familiar reason. Since it promotes and demotes by exactly the same amount, the three underlying forms /korjaamo/, /maailma/, and /naapuri/, which display variation, keep their active constraints oscillating but do not displace them. The three categorical underlying forms /kala/, /luettelo/, and /kamera/ thus have the time to space the constraints apart as required by the categorical backbone of the target ranking (6). The ranking dynamics in table 12a indeed shows the six constraints nicely organizing into the required three strata. And the final ranking vector matches the attested frequencies, as shown in table 14.

EDCD and the CEDRA fail on the Finnish genitive plurals test case for the by-now familiar reason. Constraint WTS is never demoted because it is never loser-preferring in table 11. The other constraints are instead all forced into free fall. This is due to the fact that the three underlying forms /korjaamo/, /maailma/, and /naapuri/, which display variation, keep triggering updates forever and the resulting demotions are not balanced by corresponding promotions, because EDCD and the CEDRA perform no or only little constraint promotion. The constraint *Ă drops more slowly than the other four constraints, as shown by the ranking dynamics for EDCD and the CEDRA in table 12c–d. The reason is that *Ă is only demoted by the triplet (/maailma/, [máa.il.mo.jen], [máa.il.mòi.den]), whose underlying form is extremely infrequent. Table 14 indeed shows that this underlying form is effectively treated as categorical by the final ranking vectors learned by EDCD and the CEDRA.

Contrary to the GLA, the minGLA fails on the Finnish test case: table 14 shows that the final ranking vector learned by the minGLA fails at matching the attested frequencies.18 The minGLA differs from the GLA only because the latter demotes all loser-preferring constraints while the former demotes only the one with the largest stochastic ranking value. This means that the GLA and the minGLA can behave differently only in response to ERCs with two or more L’s. In the specific case of the simplified Finnish test case, the GLA and the minGLA can thus behave differently only on the ERC corresponding to (/naapuri/, [náa.pu.rèi.den], [náa.pu.ri.en]), which is the only one in table 11 with two L’s.

Since it displays variation, the underlying form /naapuri/ will keep triggering updates forever, as the current stochastic ranking vector will never be able to satisfy both contradictory ERCs corresponding to the two triplets (/naapuri/, [náa.pu.rèi.den], [náa.pu.ri.en]) and (/naapuri/, [náa.pu.ri.en], [náa.pu.rèi.den]). The latter triplet will always promote both *H.H and *Í. In the case of the minGLA, the former triplet instead will always demote only one of the two—namely, the one with the largest stochastic ranking value. As a result, the combined effect of the two ERCs corresponding to /naapuri/ is that the ranking value of *H.H keeps increasing indefinitely over time. And the minGLA thus does not manage to slide *L.L above *H.H. The ranking dynamics plotted in table 12b indeed shows that all constraints rise and the learner fails at spacing them apart.19

This analysis entails that the failure of the minGLA does not depend on the choice of the initial ranking values. Indeed, set the initial ranking values of the minGLA equal to the final ranking values learned by the GLA, as reported in table 13. These ranking values already account for all the attested frequencies. But the minGLA fails also with such a favorable choice of the initial ranking values: the underlying form /naapuri/, which displays variation, keeps triggering updates forever and thus forces the constraints into “free rise” because its two corresponding ERCs overall promote more than they demote.

5 Conclusions

The language acquisition and learnability literature (at least since Wexler and Culicover 1980) has endorsed error-driven learning because it imposes no memory requirements (training data are not stored but discarded after each update), because it straightforwardly models acquisition gradualness (child acquisition paths can be modeled through the predicted sequences of grammars) and because of its noise robustness (a single piece of corrupted data has only a tiny effect on the final grammar).20 A core issue of the theory of error-driven learning within OT concerns the specifics of the reranking rule used to revise the current ranking whenever it makes an error on the current piece of data. There are two crucial choice points in the theory. The first concerns which loser-preferring constraints are demoted: all or just a subset of them (say, just the undominated ones or just the top-ranked one)? The second choice point concerns the promotion amount: should it be equal to the demotion amount or smaller than that (and thus possibly null)? Some combinations of these options pursued in the literature are summarized in table 1.

The issue of the proper definition of the reranking rule has been tackled from two perspectives. One perspective focuses on categorical phonology within deterministic OT and asks the following question: which reranking rule ensures that the deterministic EDRA converges to a constraint ranking that captures the categorical pattern? In this categorical context, convergence means that the EDRA eventually stops making errors. Another perspective focuses on phonological variation within stochastic OT and asks the following question: which reranking rule ensures that the stochastic EDRA converges to a ranking vector that captures the pattern of variation? In this stochastic context, convergence means that, although the EDRA keeps making updates forever (because of the competition between variants), the ranking values eventually stop shifting away and instead oscillate around a midpoint that remains fixed.

Unfortunately, these two perspectives lead to contradicting conclusions. From the categorical perspective, it is better for the promotion amount to be smaller rather than equal to the demotion amount; and it is better to demote only the loser-preferring constraints that are undominated rather than all the loser-preferring constraints (Pater 2008, Magri 2012). In other words, from this perspective, EDCD and the CEDRA outperform the GLA. But from the perspective of variation, it is better for the promotion amount to be exactly equal to the demotion amount; and it is better to demote all the loser-preferring constraints rather than just the undominated ones or the top-ranked one. In other words, from this perspective, the GLA outperforms EDCD and the CEDRA.

This article has explained in detail why the latter conclusion holds. Patterns of variation lead to pairs of contradicting ERCs. These contradicting ERCs keep triggering updates forever, no matter which reranking rule is chosen. These updates thus need to cancel each other out, so that they do not indefinitely displace the constraints. This canceling out requires demotions and promotions to exactly balance each other. Thus, the promotion amount cannot be smaller than the demotion amount (otherwise, these ERCs do not promote enough, forcing their active constraints into free fall). And no loser-preferring constraint can forgo demotion (otherwise, these ERCs do not demote enough, forcing their active constraints into “free rise”). Unfortunately, we must leave unresolved for the time being this tension between the two opposing conclusions reached from the two perspectives of categorical phonology and variation in stochastic OT.

Notes

1 This can be achieved for instance by clipping the original Gaussian outside of [−Δ, + Δ].

2 BH assume that the demotion and promotion amounts d and p are multiplied by a step-size or plasticity that decreases with time, making the updates smaller toward the end of the run. We consider plasticity only in the simulations (see footnotes 5 and 17), while in the analyses we ignore it (namely, assume it is equal to 1) for simplicity.

3 The reranking rule originally proposed by Tesar and Smolensky (1998) can be easily restated in terms of ranking vectors, yielding (apart from a glitch discussed in Boersma 2009) the nongradual counterpart of the algorithm that we refer to here as gradual EDCD; see Magri 2012:sec. 3 for details.

4 In the case of EDCD, the CEDRA, and the minGLA, the same stochastic values = (1, . . . , n) are used twice: once at step (b), to compute the current prediction [z]; and again at step (d), to determine which loser-preferring constraints are demoted.

5 As in BH 2001, each simulation consists of three learning stages of 7,000 iterations each; plasticity (see footnote 2) is equal to 2.0, 0.2, and 0.02 in the three stages; the distribution Ɗ is a Gaussian with zero mean and variance equal to 10.0, 2.0, and 2.0 in the three stages.

6 The final ranking vector reported here for the GLA is almost identical to the one reported by BH. They note that multiple runs of the GLA yield almost identical final ranking vectors. The same holds for the other three reranking rules according to our simulations.

7 As in BH 2001, the probabilities in table 3 are obtained by sampling 100,000 times the stochastic OT grammar with a Gaussian distribution with zero mean and variance 2.0.

8 Recall from footnote 2 that we assume plasticity to be equal to 1 for simplicity.

9 The amplitude of the oscillations decreases with time, because plasticity decreases after each stage of 7,000 iterations (see footnote 5).

10 The slope of the fall decreases with time, because plasticity decreases after each stage of 7,000 iterations (see footnote 5).

11 Both amounts are actually multiplied by the plasticity corresponding to the current iteration.

12Goldwater and Johnson (2003) also report that they have been able to simplify the Finnish test case. Yet they do not provide specifics on what their simplified test case looks like and therefore we cannot build on their observation.

13 To illustrate, Qaleksanteri is computed as Qaleksanteri =

graphic
= .0194.

14 The probabilities of the underlying forms in the second column of table 10 need to be normalized to add up to 1, compensating for the mass initially allocated to the six underlying forms in rows a–f of table 10, which have been dropped because they have no L’s.

15 The three constraints *Ó, *Á, and *Ĭ that we drop head columns in the original table 9 that are the opposite (namely, have an L instead of a W and vice versa) of the columns headed by *Ŏ, *Ă, and *Í, respectively. Furthermore, the fourth constraint *LAPSE that we drop heads a column in table 9 that is the opposite of the column headed by *H.H apart from eight ERCs: the six ERCs a–f, which have no L’s and can thus be ignored, and the two ERCs corresponding to the categorical underlying forms /kala/ and /lasi/, which trigger only a few updates because they have a W corresponding to the constraint WTS, which has no L’s and is therefore never demoted. Indeed, the GLA in BH’s simulations (see BH 2001:70, (29)) assigns to these four constraints the smallest ranking values, whereby they play no role in predicting the attested frequencies.

16 Indeed, this triplet triggers no updates in our simulations on the original test case in table 9.

17 As in BH 2001, each simulation consists of five learning stages; the first four stages consist of 22,000 iterations each, while the fifth stage consists of 300,000 iterations; plasticity is equal to 2.0, 2.0, 0.2, 0.02, and 0.002 in the five stages; the distribution Ɗ is a Gaussian with zero mean and variance equal to 10.0 in the first stage and to 2.0 in the following four stages.

18 Another variant of the GLA that fails on the Finnish test case is the stochastic EDRA that (like the GLA) promotes all winner-preferring constraints by 1 but (like EDCD and the CEDRA) demotes only the loser-preferring constraints that are undominated relative to the current stochastic ranking vector (rather than just the highest-ranked one, like the minGLA).

19 The ranking values effectively keep increasing forever, although the rate of increase in the last learning stage is very small because the plasticity is very small (0.002) in the last stage.

20 Non-error-driven learning requires special ad hoc provisions in order to achieve these advantages, such as an iterated implementation through multiple batches of data to model gradualness, a stochastic implementation to achieve robustness, and a developing protolexicon to get around the lack of a stored full-blown lexicon in the early acquisition stages; see Gibson and Wexler 1994:410 for discussion.

Acknowledgments

The research reported in this article has been supported by a grant from the Agence National de la Recherche ( project title: “The Mathematics of Segmental Phonotactics”) and by a grant from the MIT France Seed Fund Award ( project title: “Phonological Typology and Learnability”). We would like to thank an anonymous LI reviewer for very useful comments.

References

References
Anttila,
Arto
.
1997a
. Deriving variation from grammar: A study of Finnish genitives. In
Variation, change and phonological theory
, ed. by
Frans
Hinskens
,
Roeland
van Hout
, and
Leo
Wetzels
,
35
68
.
Amsterdam
:
John Benjamins
. Rutgers Optimality Archive ROA-63, https://roa.rutgers.edu.
Anttila,
Arto
.
1997b
.
Variation in Finnish phonology and morphology
.
Doctoral dissertation, Stanford University, Stanford, CA
.
Boersma,
Paul
.
1997
. How we learn variation, optionality and probability. In
Proceedings of the Institute of Phonetic Sciences (IFA) 21
, ed. by
Rob
van Son
,
43
58
.
Amsterdam
:
University of Amsterdam, Institute of Phonetic Sciences
.
Boersma,
Paul
.
1998
.
Functional phonology
.
Doctoral dissertation, University of Amsterdam. The Hague: Holland Academic Graphics
.
Boersma,
Paul
.
2009
.
Some correct error-driven versions of the Constraint Demotion Algorithm
.
Linguistic Inquiry
40
:
667
686
.
Boersma,
Paul
, and
Bruce
Hayes
.
2001
.
Empirical tests of the Gradual Learning Algorithm
.
Linguistic Inquiry
32
:
45
86
.
Gibson,
Edward
, and
Kenneth
Wexler
.
1994
.
Triggers
.
Linguistic Inquiry
25
:
407
454
.
Goldwater,
Sharon
, and
Mark
Johnson
.
2003
. Learning OT constraint rankings using a Maximum Entropy model. In
Proceedings of the Stockholm Workshop on Variation within Optimality Theory
, ed. by
Jennifer
Spenader
,
Anders
Eriksson
, and
Östen
Dahl
,
111
120
.
Stockholm
:
Stockholm University
.
Hayes,
Bruce
, and
Mary
Abad
.
1989
.
Reduplication and syllabification in Ilokano
.
Lingua
77
:
331
374
.
Magri,
Giorgio
.
2012
.
Convergence of error-driven ranking algorithms
.
Phonology
29
:
213
269
.
Pater,
Joe
.
2008
.
Gradual learning and convergence
.
Linguistic Inquiry
39
:
334
345
.
Prince,
Alan
.
1983
.
Relating to the grid
.
Linguistic Inquiry
14
:
19
100
.
Prince,
Alan
.
2002
.
Entailed ranking arguments
.
Ms., Rutgers University, New Brunswick, NJ. Rutgers Optimality Archive ROA-500, https://roa.rutgers.edu
.
Prince,
Alan
, and
Paul
Smolensky
.
2004
.
Optimality Theory: Constraint interaction in generative grammar
.
Oxford
:
Blackwell
. Original version, Technical Report CU-CS-696-93, Department of Computer Science, University of Colorado at Boulder, and Technical Report TR-2, Rutgers Center for Cognitive Science, Rutgers University, April 1993. Rutgers Optimality Archive ROA-537, https://roa.rutg ers.edu.
Tesar,
Bruce
, and
Paul
Smolensky
.
1998
.
Learnability in Optimality Theory
.
Linguistic Inquiry
29
:
229
268
.
Wexler,
Kenneth
, and
Peter W.
Culicover
.
1980
.
Formal principles of language acquisition
.
Cambridge, MA
:
MIT Press
.