Abstract

Various authors have recently endorsed Harmonic Grammar (HG) as a replacement for Optimality Theory (OT). One argument for this move is that OT seems not to have close correspondents within machine learning while HG allows methods and results from machine learning to be imported into computational phonology. Here, I prove that this argument in favor of HG and against OT is wrong. In fact, I show that any algorithm for HG can be turned into an algorithm for OT. Hence, HG has no computational advantages over OT. This result allows tools from machine learning to be systematically adapted to OT. As an illustration of this new toolkit for computational OT, I prove convergence for a slight variant of Boersma’s (1998) (nonstochastic) Gradual Learning Algorithm.

1 Introduction

The peculiar property of Optimality Theory (OT; Prince and Smolensky 2004) is that it uses constraint ranking and thus enforces strict domination, according to which the highest-ranked relevant constraint ‘‘takes it all.’’ Because of this property, OT seems prima facie not to have any close correspondents within core machine learning.1 For this reason, the toolkit available nowadays in computational OT for modeling language acquisition, production, and perception consists mainly of combinatorial algorithms, specifically tailored to the framework of OT, developed with few connections to methods and results in machine learning. Tesar and Smolensky’s (1998) powerful ranking algorithms well exemplify this approach to computational OT.

In order to bridge this gap between computational phonology and machine learning, various scholars have started to entertain and explore variants of OT that replace constraint ranking with constraint weighting and strict domination with additive interaction, and thus fall within the general class of linear models very well studied in machine learning. An important and simple such model is Harmonic Grammar (HG; Legendre, Miyata, and Smolensky 1998a,b). For instance, Pater (2009) writes, ‘‘[I will] illustrate and extend existing arguments for the replacement of OT’s ranked constraints with [HG’s] weighted ones: that the resulting model of grammar . . . is compatible with well-understood algorithms for learning and other computations’’ ( p. 1002) and in particular ‘‘for learning variable outcomes and for learning gradually’’ ( p. 1021). He then adds that ‘‘the strengths of HG in this area are of considerable importance’’ ( p. 1002): in fact, ‘‘as these algorithms are broadly applied with connectionist and statistical models of cognition, this forms an important connection between the HG version of generative linguistics and other research in cognitive science’’ ( p. 1021). In other words, HG has been conjectured to be computationally superior to OT because it can make use of algorithms from machine learning (i.e., algorithms for linear classification), unlike OT. This conjecture of an alleged computational superiority of HG over OT has been endorsed by a number of authors in the recent literature (e.g., Coetzee and Pater 2008, Hayes and Wilson 2008, Jesney and Tessier 2009, Potts et al. 2010, Boersma and Pater, to appear).

In section 2, I briefly review the two frameworks of OT and HG. In section 3, I then prove that this conjecture of an alleged computational superiority of HG over OT is false. In fact, the main result of this article is a simple strategy that allows any algorithm for HG to be turned into an algorithm for OT (see ). Hence, HG has no computational advantages over OT, and the departure from OT to HG is not warranted on the basis of computational considerations. Of course, this result does not in any way provide an argument in favor of OT or against alternative frameworks such as HG. It only shows that the argument against OT and in favor of HG based on the conjectured computational superiority of the latter does not go through. Still, my result on the systematic portability of algorithms from HG into OT is significant because it leads to a substantial enrichment of the current toolkit of computational OT. As noted above, computational OT has relied so far mainly on combinatorial algorithms specifically tailored to the framework of OT, with few connections to machine learning. The result presented in section 3 allows this classical toolkit to be supplemented with a whole new set of algorithmic tools, obtained by systematically adapting to OT well-known algorithms from machine learning. A proper combination of the classical toolkit with the new one has the potential to spur new research in computational modeling of language production, perception, and acquisition within the framework of OT.

As an initial illustration of the fruitfulness of these new algorithmic tools, section 4 describes a specific application in some detail. An obvious property of language acquisition is that it is gradual, in the sense that the target adult language is approached through a path of conservative, intermediate stages. This gradualness suggests the following learning scheme within OT. The algorithm maintains a current ranking, which represents its current hypothesis of the target adult grammar. Data come in a stream. Every time the current piece of data is inconsistent with the current ranking, constraints are slightly reranked. As the ranking dynamics is driven by errors made on the stream of data, the algorithm is called error-driven. The sequence of slight rerankings describes a path within the space of possible OT grammars, thus modeling gradual child acquisition paths. It is mainly for this reason that error-driven learning has been endorsed in the OT acquisition literature (see, e.g., Bernhardt and Stemberger 1998, Boersma and Levelt 2000, Gnanadesikan 2004, as well as Tessier 2009 for discussion).

Two main implementations of the error-driven learning scheme have been developed in the OT computational literature, reviewed in section 4.1. One is Tesar and Smolensky’s (1998),Error Driven Constraint Demotion (EDCD), or Boersma’s (1998:323–327) gradual reformulation thereof. The other is Boersma’s (1997),Gradual Learning Algorithm (GLA). The main difference between (gradual) EDCD and the GLA is that the former only performs constraint demotion while the latter performs both promotion and demotion. This difference is crucial, from both a computational and a modeling perspective. From a computational perspective, lack of constraint promotion allowed Tesar and Smolensky to prove that EDCD converges; that is, it can only make a finite, small number of errors. On the contrary, constraint promotion was shown by Pater (2008) to prevent the GLA from converging in the general case. Although a liability from a computational perspective, constraint promotion turns into an advantage from a modeling perspective, as argued in section 4.2, building on Bernhardt and Stemberger 1998, Boersma 1998, and Magri 2012a.

Against the background of this tension between the computational and modeling perspectives, the following question stands out as one of the main open issues in computational OT: is it possible to devise a variant of the GLA that performs constraint promotion and yet provably converges, so as to retain its modeling virtues without sacrificing computational soundness? In –, I show that the new approach to computational OT developed in this article leads to a simple solution to this important question. I introduce a variant of the (nonstochastic) GLA and I prove that it converges in the general case, even though it performs both constraint demotion and promotion (see ). The proof uses the result on the portability of algorithms from HG into OT established in section 3: convergence of the revised GLA is shown to follow from a classical machine learning result—namely, convergence of the Perceptron algorithm for HG.

In section 5, I summarize this new approach to computational OT, based on a systematic translation into OT of methods and results from machine learning.

In order to make the article accessible to the reader with no computational inclination, in the body of the article I explain the main results in a plain, nontechnical way, focusing on concrete examples. Detailed proofs are provided in the appendices available as supplementary online materials.2

2 Description of the Frameworks of OT and HG

This section reviews the frameworks of OT and HG with an eye to the formal details, presupposing general familiarity with these frameworks. Section 2.1 introduces the two corresponding core computational problems: weighting and ranking. Section 2.2 restates these two problems in terms of a more compact notation that will turn out to be useful in the rest of the article. Finally, section 2.3 reviews what is currently known concerning the relationship between the weighting and ranking problems.

2.1 Basic Description of HG and OT

The basic data unit in both HG and OT is a data triplet as in (1a). The first entry in the triplet provides the underlying form, here notated x. The second entry provides the intended winner candidate for that underlying form, here notated y. The final entry in the triplet provides a loser candidate, here notated z (as a mnemonic, I strike out losers).

(1)

  • a.

    (x, y, z)

  • b.

    (/rad/, [rat], [rad])

An example of an underlying/winner/loser form triplet is provided in (1b): the underlying form /rad/ is paired with the two candidate surface forms [rat] and [rad], together with the information that the former is the intended winner while the latter is a loser.

An HG grammar is parameterized by a weight vector (2a), which is a tuple θ with n numerical components θ1, . . . , θn, one for each of the n constraints C1, . . . , Cn. The kth component θk is called the weight of the corresponding constraint Ck.

(2)

graphic

An example is provided in (2b). The constraint set contains three constraints. Constraints C1 and C2 are the faithfulness constraints IDENT[VOICE]/ONSET and IDENT[VOICE] (henceforth, Fpos and F, respectively), which enforce preservation of voicing in onset position and in an arbitrary position, respectively. Constraint C3 is the markedness constraint *[ + VOICE, − SONORANT] (henceforth, M) that penalizes voiced obstruents. The weight vector θ assigns these three constraints C1, C2, and C3 the three weights 8, 2, and 4, respectively.

To complete the description of the HG framework, we need a notion of ‘‘compatibility’’ between a hypothesis (i.e., a weight vector) and a piece of data (i.e., a data triplet). A weight vector θ is called HG-compatible with an underlying/winner/loser form data triplet (x, y, z) provided condition (3) holds. This condition says that the intended loser z violates the constraints ‘‘more severely’’ than the intended winner y, in the sense that the sum of the constraint violations for the loser z multiplied by the corresponding weights is (strictly) larger than the sum of the constraint violations for the winner y multiplied by the corresponding weights.

(3)

graphic

As an example, note that the weight vector (2b) is HG-compatible with the data triplet (1b). Of course, a weight vector is called HG-compatible with a set of data triplets provided it is HG-compatible with every triplet in the set. And a set of data triplets is called HG-compatible provided it is compatible with at least one weight vector.

If the weights are allowed to be negative, undesired typological consequences follow. Here is an example. Consider again the constraint set in (2b). If we allow negative weights, then the triplet (/ta/, [da], [ta]) turns out to be HG-compatible (say, with the weights θ1 = θ2 = θ3 = −1). This means that [da] wins over [ta] as the surface form corresponding to the underlying form /ta/. This result is undesired, as the underlying form /ta/ is unmarked relative to the constraints considered and should therefore always surface faithfully. For this reason, from now on I will require the weights to be nonnegative, as stated in (4).

(4) θ1, . . . , θn ≥ 0

Let me stress that this nonnegativity restriction (4) is not part of the core computational definition of HG, and it can thus be relaxed if needed. Condition (4) is only needed because of the assumption that constraints assign ‘‘violations,’’ and never ‘‘rewards.’’

Suppose that we know the constraints and we are provided with a finite set of data, namely, a finite set of underlying/winner/loser form triplets. We would like to come up with an assignment of (nonnegative) weights to the constraints that ‘‘works’’ for those data. How can we formalize this requirement? If the data are HG-compatible, then of course we want to find a weight vector that is indeed HG-compatible with all the data. If the data are not HG-compatible, then no such vector can be found. In this case, we might want to find a weight vector that is HG-compatible with just most of the data. Unfortunately, the problem thus formulated cannot be solved efficiently, as it is intractable (Johnson and Preparata 1978). We thus need to content ourselves with a less demanding formulation of the problem. Let’s say that, in the case where the data are not HGcompatible, we just need to detect the incompatibility. These considerations lead to the classical computational problem (5a), which I will refer to as the weighting problem. This is the simplest computational problem in HG. On the one hand, this problem is simple because the input to the problem is as rich as possible, as the underlying forms are provided as well as the losers and the winner forms are completely parsed. On the other hand, this problem is simple because the output of the problem is as unconstrained as possible: the weight vector returned (if any) is only required to be HG-compatible with the data, and there are no further requirements. The weighting problem is thus the kernel of any computational problem that arises in HG. I will denote by WP() the instance of the weighting problem (5a) corresponding to a set of data triplets , or equivalently the set of all its solutions.

(5)

  • a.

    Given: a constraint set and a data set consisting of a finite number of underlying/winner/loser form triplets.

    Return: ⊥, if the data are not HG-compatible; otherwise, a nonnegative weight vector θ that is HG-compatible with the data , according to condition (3).

  • b.

    Given: the three constraints in (2b) and the two data triplets (/da/, [da], [ta]) and (/rad/, [rat], [rad]).

    Return: ⊥, if the two triplets are not HG-compatible; otherwise, nonnegative weights for the constraints that are HG-compatible with the two triplets.

An instance of the weighting problem is provided in (5b). The underlying forms /da/ and /rad/ are paired with the corresponding intended winner surface forms [da] and [rat] and the corresponding intended losers [ta] and [rad], respectively. We are asked to determine whether these data are HG-compatible. If they are, we also have to return HG-compatible (nonnegative) weights for the constraints in (2b). Thus, the weight vector in (2b) is a solution to this instance (5b) of the weighting problem.

Let us now turn to OT. An OT grammar is parameterized by a ranking, which is a linear order >> over the constraint set, as illustrated in (6a), or equivalently in (6a′). We say that constraint Ch is >>-ranked above another constraint Ck provided that Ch>>Ck.

(6)

graphic

To illustrate, a ranking over the constraint set in (2b) is provided in (6b): it sandwiches the markedness constraint C3 in between the two faithfulness constraints C1 and C2, with the positional faithfulness constraint ranked at the top.

Also in the case of OT, data units are underlying/winner/loser form triplets, as in (1). To complete the definition of the OT framework, we need a notion of ‘‘compatibility’’ between a hypothesis (i.e., a ranking) and a piece of data (i.e., a data triplet). A ranking >> is called OT-compatible with an underlying/winner/loser form data triplet (x, y, z) provided condition (7) holds. This condition says that the intended loser z violates the constraints ‘‘more severely’’ than the intended winner y, in the sense that, among those constraints that distinguish between winner and loser, the top-ranked one Ctop assigns more violations to the loser than to the winner.

(7) Ctop(x, z) > Ctop(x, y)

where Ctop = the top >>-ranked constraint among those that assign a different number of violations to the loser z and to the winner y.

As an example, the ranking (6b) is OT-compatible with the underlying/winner/loser form triplet (/rad/, [rat], [rad]) in (1b). Of course, a ranking >> is called OT-compatible with a set of data triplets provided it is OT-compatible with every triplet in the set. Furthermore, a set of data triplets is called OT-compatible provided it is compatible with at least one ranking.

In complete analogy with the setting considered above for HG, suppose we know the constraint set and are provided with data consisting of a finite number of underlying/winner/loser form triplets. Again, we would like to come up with a constraint ranking that ‘‘works’’ for those data. If the data are OT-compatible, then of course this means that we want to find a ranking that is indeed OT-compatible with all the data. If the data are not OT-compatible, then no such ranking can be found. In this case, we might want to find a ranking that is OT-compatible with just most of the data. Unfortunately, the problem thus formulated cannot be solved efficiently, as it is intractable (Magri 2013). We thus need to content ourselves with a less demanding formulation of the problem. Let’s say that, in the case where the data are not OT-compatible, we just need to detect the incompatibility. These considerations lead to the classical computational problem (8a), which I will refer to as the ranking problem. This problem is completely analogous to the HG weighting problem (5a). The ranking problem is the kernel of any computational problem that arises in OT. I will denote by RP() the instance of the ranking problem (8a) corresponding to a set of data triplets , or equivalently the set of its solutions.

(8)

  • a.

    Given: a constraint set and a data set consisting of a finite number of underlying/ winner/loser form triplets.

    Return: ⊥, if the data are not OT-compatible; otherwise, a ranking >> that is OT-compatible with the data , according to condition (7).

  • b.

    Given: the three constraints in (2b) and the two data triplets (/da/, [da], [ta]) and (/rad/, [rat], [rad]).

    Return: ⊥, if the two triplets are not OT-compatible; otherwise, a ranking of the three constraints that is OT-compatible with the two data triplets.

An instance of the ranking problem is provided in (8b). The underlying forms /da/ and /rad/ are paired with the corresponding intended winner surface forms [da] and [rat] and the corresponding intended losers [ta] and [rad], respectively. We are asked to determine whether these data are OT-compatible. If they are, we also have to return an OT-compatible ranking of the constraints in (2b). The unique solution to this instance of the problem is ranking (6b).

2.2 A More Compact Notation for the Data in HG and OT

Given an underlying/winner/loser form data triplet (x, y, z), the difference (9a) between the number Ck(x, z) of violations assigned by constraint Ck to the loser z and the number Ck(x,y) of violations assigned to the winner y is called the kth constraint difference.

(9)

  • a.

    Ck(x, z) − Ck(x, y)

  • b.

    Fpos(/rad/, [rad]) − Fpos(/rad/, [rat]) = 0

    F(/rad/, [rad]) − F(/rad/, [rat]) = − 1

    M(/rad/, [rad]) − M(/rad/, [rat]) = 1

An example is provided in (9b) for the underlying/winner/loser form data triplet (/rad/, [rat], [rad]) in (1b) and the constraint set {Fpos, F, M} in (2b). The constraint difference corresponding to the positional faithfulness constraint Fpos = IDENT[VOICE]/ONSET is 0, because that constraint assigns zero violations to the mapping of /rad/ both to [rat] and to [rad]. The constraint difference corresponding to the general faithfulness constraint F = IDENT[VOICE] is − 1, because the intended loser [rad] is fully faithful to the underlying form /rad/, unlike the intended winner [rat]. Finally, the constraint difference corresponding to the markedness constraint M = *[ + VOICE] is 1, as only the intended loser [rad] violates the markedness constraint, while the winner [rat] does not.

Condition (3) for HG-compatibility can of course be rewritten as in (10), by bringing everything on one side. This restatement highlights the fact that HG-compatibility is only sensitive to the constraint differences, not to the actual numbers of constraint violations.

(10)

graphic

Thus, the information provided by a data triplet that is really needed for the sake of establishing HG-compatibility can be distilled as in (11). The data triplet is paired with a tuple with n entries (one for every constraint), with the convention that the kth entry is the kth constraint difference Ck(x, z) − Ck(x, y). One such n-tuple of numbers is called an elementary weighting condition (EWC).3 It contains all the information that is needed to compare the winner and the loser within HG. A generic EWC is denoted by ā and its components by ā1, . . . , ān; a finite collection of EWCs is denoted by Ā; the collection of EWCs corresponding to a set of data triplets is denoted by Ā (); I often omit zeros for readability.

(11)

  • a.

    (x,y,z) ==> ā = [ā1 . . . ān]

    where āk = Ck (x,z) − Ck (x,y)

  • b.

    graphic

An example is provided in (11b): we are given a data set consisting of two underlying/winner/ loser form triplets (/da/, [da], [ta]) and (/rad/, [rat], [rad]); we compute the corresponding constraint differences with respect to the constraint set in (2b), and we arrange them in the two EWCs as in Ā ().

With this notation in place, condition (10) for HG-compatibility between a weight vector θ = (θ1, . . . , θn) and an underlying/winner/loser form data triplet can be restated in terms of the corresponding EWC ā = [ā1, . . . , ān] as condition (12). Thus, let us say that a weight vector θ is HG-compatible with an arbitrary EWC ā = [ā1, . . . , ān] provided condition (12) holds.

(12)

graphic

Of course, a weight vector θ is called HG-compatible with a set Ā of EWCs provided it is HGcompatible with every EWC in the set. And a set of EWCs is called HG-compatible provided it is compatible with at least one weight vector.

The weighting problem has been stated in (5a) in terms of data triplets. But actual data triplets carry superfluous information, and a sharper representation of data triplets is provided by EWCs. Thus, it is convenient to restate the weighting problem in terms of EWCs, as in (13a). I will denote by WP(Ā) the instance of the weighting problem (13a) corresponding to a finite set Ā of EWCs, or equivalently the set of its solutions. Of course, a weight vector is a solution to the instance of the original weighting problem (5a) for a given set of data triplets if and only if it is a solution to the instance of the problem (13a) for the corresponding set Ā () of EWCs, namely, WP() = WP(Ā ()).

(13)

  • a.

    Given: a finite set Ā of EWCs.

    Return: ⊥, if the data Ā are not HG-compatible; otherwise, a nonnegative weight vector θ that is HG-compatible with Ā , according to condition (12).

  • b.

    Given: the set Ā of EWCs in (11b).

    Return: ⊥, if the data Ā are not HG-compatible; otherwise, a nonnegative weight vector θ for the constraint set in (2b) that is HG-compatible with Ā .

As an example, I give in (13b) the reformulation in terms of EWCs of the instance of the weighting problem (5b).

An analogous simplification of the representation of the data and of the corresponding core computational problem is available within the framework of OT. Given an underlying/winner/loser form data triplet (x, y, z), the constraints can be sorted into winner-preferring, loser-preferring, and even as in (14a), depending on whether the corresponding constraint difference is positive (i.e., the constraint assigns more violations to the loser than to the winner), negative (i.e., the constraint assigns fewer violations to the loser than to the winner), or null (i.e., the constraint assigns the same number of violations to the loser and to the winner).

(14)

graphic

An example is provided in (14b) for the constraint set {Fpos, F, M} in (2b) and the underlying/winner/loser form data triplet (/rad/, [rat], [rad]) in (1b), building on the constraint differences computed in (9b).

The notion of OT-compatibility in (7) only depends on whether the various constraints are winner-preferring, loser-preferring, or even. Following Prince (2002), the information provided by an underlying/winner/loser form triplet (x, y, z) that is really needed for the sake of OT-compatibility can thus be distilled as in (15a). The data triplet is paired with a tuple with n entries (one for every constraint), with the convention that the kth entry is equal to W, L, or e depending on whether the kth constraint Ck is winner-preferring, loser-preferring, or even. One such n-tuple of L’s, e’s, and W’s is called an elementary ranking condition (ERC). It contains all the information that is needed to compare the winner and the loser within OT. A generic ERC is denoted by a and its components by a1, . . . , an; a finite collection of ERCs is denoted by A; the collection of ERCs corresponding to a set of triplets is denoted by A(); I often omit e’s for readability.

(15)

graphic

As an example, I provide in (15b) the collection of ERCs corresponding to the collection of EWCs in (11b) for the two data triplets (/da/, [da], [ta]) and (/rad/, [rat], [rad]) and the constraint set {Fpos, F, M} in (2b).

With this notation in place, condition (7) for OT-compatibility between a ranking >> and a set of underlying/winner/loser form data triplets can be restated as condition (16a) in terms of the corresponding set of ERCs. Thus, let us say that a ranking >> is OT-compatible with an arbitrary set of ERCs provided condition (16a) holds.

(16)

  • a.

    Once the n entries of each ERC are reordered from left to right in decreasing order according to >>, then the leftmost non-e entry is a W in each reordered ERC.

  • b.

    graphic

As an illustration of the notion of OT-compatibility in (16a), note that the set of ERCs in (15b) is OT-compatible with the ranking Fpos >> M >>F in (6b): once the entries are ordered from left to right in >>-decreasing order (by switching the second column with the third), we obtain the ERCs in (16b), which indeed have the desired property that the leftmost non-e symbol of every ERC is a W. Of course, a collection of ERCs is called OT-compatible provided it is compatible with at least one ranking.

The ranking problem has been stated in (8a) in terms of data triplets. But actual data triplets carry superfluous information, and a sharper representation of data triplets is provided by ERCs. Thus, it is convenient to restate the ranking problem in terms of ERCs, as in (17a). I will denote by RP(A) the instance of the ranking problem (17a) corresponding to a set A of ERCs,4 or equivalently the set of its solutions. Of course, a ranking is a solution to the instance of the original ranking problem (8a) for a given set of data triplets if and only if it is a solution to the instance of the problem (17a) for the corresponding set of ERCs A(), namely, RP() = RP(A()).

(17)

  • a.

    Given: a set A of ERCs.

    Return: ⊥, if the data A are not OT-compatible; otherwise, a ranking >> that is OT-compatible with A, according to condition (16a).

  • b.

    Given: the set A of ERCs in (15b).

    Return: ⊥, if the data A are not OT-compatible; otherwise, a ranking >> of the constraint set in (2b) that is OT-compatible with A.

As an example, I give in (17b) the formulation in terms of ERCs of the instance of the ranking problem (8b).

2.3 What Is Currently Known about the Relationship between the Weighting and Ranking Problems

summarizes what is currently known concerning the relationship between the two core computational problems in HG and OT, the weighting and ranking problems.

LEMMA 1. If a finite set of underlying/winner/loser form triplets is OT-compatible, then it is also HG-compatible. More precisely, let >> be a ranking OT-compatible with . Without loss of generality, assume that it is (18a): Cnis ranked at the top, Cn−1below it, and so on, until the bottom-ranked C1.

(18)

graphic

Then, the weight vectorθ = (θ1, . . . , θn ) defined in () is HG-compatible with , where Δ is the largest constraint difference (ignoring sign) and δ is the smallest positive constraint difference.

For completeness, the proof is recalled in appendix A of the online supplementary materials, after Prince and Smolensky 2004 and Keller 2000, 2005. The idea of the proof is that the highest-takes-all behavior of the notion of OT-compatibility (7) can be mimicked by the weighted notion of HG-compatibility (3) through exponentially spaced weights.5 For a comparison between HG and OT from a typological perspective, see for instance Tesar 2007, Pater 2009, Bane and Riggle 2009, and Potts et al. 2010; and for a comparison from a learnability perspective, see Riggle 2009 and Bane, Riggle, and Sonderegger 2010.

Let me illustrate the lemma with an example. Given the data set consisting of the two underlying/winner/loser form triplets (/da/, [da], [ta]) and (/rad/, [rat], [rad]), consider the corresponding set of EWCs (11b) and the corresponding set of ERCs (15b), both repeated in (19).

(19)

graphic

The set of ERCs A is OT-compatible with the ranking Fpos >> M >> F in (6b). Since in this case Δ = 1, the corresponding weights according to (18b) are θFpos = 8, θF= 2, and θM= 4, considered in (2b). The latter weights are indeed HG-compatible with the set of EWCs Ā .

The reverse of does not hold; namely, there exist data sets that are HG-compatible but not OT-compatible. Here is a counterexample. Suppose that the set of EWCs is Ā in (20a). The corresponding set of ERCs is A in (20b). The former is HG-compatible (say, with the weights θ1 = 3 and θ2 = θ3 = 2), but the latter is not OT-compatible.

(20)

graphic

Example (20) can be made more explicit as follows. In order for a weight vector θ = (θ1, θ2, θ3) to be HG-compatible with the first and second EWCs in (20a), the weight of constraint C1 must be larger than both the weights of constraints C2 and C3, as stated in (21a); in order for a ranking >> to be OT-compatible with the first and second ERCs in (20b), constraint C1 must be ranked above both constraints C2 and C3, as stated in (21b).

(21)

graphic

No ranking that satisfies the ranking conditions (21b) can ever be OT-compatible with the third ERC in (20b). A weight vector that satisfies the weighting conditions (21a) can instead be HG-compatible with the third EWC in (20a), provided that the two constraints C2 and C3, despite their small weights, are allowed to join forces and gang up against constraint C1, in the sense that the sum θ2 +θ3 of their weights is larger than the weight θ1 of C1. The crucial difference between HG and OT is that the former allows these gang-up effects, while the latter doesn’t.

The algorithmic implications of can be brought out as follows. Suppose we are given a finite set of data triplets that happen to be not only HG-compatible but also OTcompatible; and suppose we are interested in solving the instance WP(Ā ) of the weighting problem for the corresponding set Ā = Ā () of EWCs. says that, instead of solving the weighting problem WP(Ā ) directly, we can solve it indirectly, through the three steps (22a–c). In step (22a), we construct the corresponding set A of ERCs, according to (15). In step (22b), we solve the corresponding instance RP(A) of the ranking problem. Finally, in step (22c), we obtain a weight vector that solves the given weighting problem WP(Ā ) through (18).

(22)

graphic

In other words, says that the weighting problem reduces to the ranking problem, in the sense that we can solve the former by solving the latter instead (provided the data are OTcompatible). Yet, as recalled in section 1, we already know how to solve the weighting problem, as we can draw on the large literature on linear models (see, e.g., Potts et al. 2010). What we are really looking for instead are good methods to solve the ranking problem. The fact that we can reduce the weighting problem to the ranking problem is of no algorithmic interest. And therefore has no interesting algorithmic implications.

3 Any Algorithm for HG Can Be Ported into OT

In the preceding section, I have reviewed the two frameworks of OT and HG and the two corresponding core computational problems: the ranking problem RP(A) and the weighting problem WP(Ā), repeated in (23) and (24).

(23)

  • Given: a finite set A of ERCs.

  • Return: ⊥ if the data A are not OT-compatible; otherwise, a ranking >> that is OTcompatible with A.

(24)

  • Given: a finite set Ā of EWCs.

  • Return: ⊥ if the data Ā are not HG-compatible; otherwise, a nonnegative weight vector θ that is HG-compatible with Ā

The question addressed in this section can roughly be stated as follows: given an arbitrary instance of the ranking problem (23), is it possible to pair it with an instance of the weighting problem (24) such that we can solve the former by solving the latter instead? This question can be stated more precisely as follows: given an instance RP(A) of the ranking problem, is it possible to find one (or, even better, all) of its solutions without solving the problem directly but rather by solving it indirectly, through the scheme in (25)? This scheme proceeds as follows. In step (25a), we pair the given set A of ERCs with a proper set Ā of EWCs. In step (25b), we find a solution θ to the corresponding weighting problem WP(Ā), or else determine that it admits no solution. Finally, in step (25c), we pair that solution θ with a ranking >>, or else return ⊥ if no solution to the weighting problem could be found. We hope that the ranking thus obtained solves the instance RP(A) of the ranking problem that we started with, whenever a solution exists; and that ⊥ is returned whenever a solution to the ranking problem does not exist.

(25)

graphic

The scheme in (25) is the inverse of the scheme in (22), which summarizes . Thus, the question considered in this section is whether the algorithmic perspective implicit in can be inverted, even though the inverse of does not hold.

3.1 The Intuitive Idea

In order to implement the scheme in (25), we need to define the two steps (25a) and (25c); that is, we need to find proper ways to pair ERCs with EWCs and weight vectors with rankings. Let me introduce the very simple, core idea with a few examples. Consider first the case of the ERC a in (26). Crucially, it contains a unique entry equal to W. Define the corresponding EWC ā as follows: the e of C1 is replaced with 0; the W of C2 is replaced with 1; and the L of C3 is replaced with −1.

(26)

graphic

A weight vector θ = (θ1, θ2, θ3) is HG-compatible with this derived EWC ā provided θ2θ3 is strictly positive—equivalently, provided the weight θ2 corresponding to constraint C2 is strictly larger than the weight θ3 corresponding to constraint C3. Consider a ranking that ‘‘respects’’ the ordering implicit in the relative size of these weights. Any such ranking thus ranks C2 above C3. The latter ranking condition ensures OT-compatibility with the ERC a that we started from.

Consider next the case of the ERC a in (27). Crucially, it contains two entries equal to W. Define the corresponding EWC ā as follows: the L of C3 is again replaced by − 1; but the two W’s of C1 and C2 are now replaced by (rather than by 1), to capture the fact that this ERC has two winner-preferrers.

(27)

graphic

A weight vector θ = (θ1, θ2, θ3) is HG-compatible with this derived EWC ā provided that the quantity θ1 + θ2θ3 is strictly positive—equivalently (by multiplying everything by 2), provided (θ1 − θ3) + (θ2θ3) is strictly positive. This implies in particular that either (θ1θ3) is strictly positive or (θ2θ3) is strictly positive (or both). Again, consider a ranking that ‘‘respects’’ the ordering implicit in the relative size of these weights. Any such ranking either ranks constraint C1 above C3 or ranks constraint C2 above C3 (or both). The latter ranking condition ensures OT-compatibility with the ERC a that we started from.

As a final example, consider again the set of ERCs (20b), repeated as A in (28). Consider the corresponding set Ā of EWCs in (28). Note that the two W’s in the first two ERCs of A are replaced with + 1 in Ā , as they have a unique winner-preferrer; while the two W’s in the last ERC are each replaced with , as that ERC has two winner-preferrers.

(28)

graphic

As noted above, the set of ERCs A is not OT-compatible. It is easy to check that also the derived set Ā of EWCs is not HG-compatible. In fact, a weight vector θ = (θ1, θ2, θ3) needs to satisfy the two inequalities θ1 > θ2 and θ1 > θ3 in order to be HG-compatible with the first two EWCs. Adding them together yields 2θ1 > θ2 + θ3. The latter inequality is equivalent (dividing everything by 2) to θ1 > θ2 + θ3, which says that θ is not HG-compatible with the third EWC in Ā.

3.2 Main Claim

The reasoning just illustrated with the three examples (26)–(28) holds in the general case, as follows. Given an ERC a with a total of w entries equal to W, consider the EWC ā derived from a as in (29): every entry equal to L in a corresponds to − 1 in ā; every entry equal to e in a corresponds to 0 in ā; and every entry equal to W in a corresponds to in ā.

(29)

  • a = [a1, . . . , an] ⇒ ā = [ā1, . . . , ān]

  • graphic

Let us say that a set of EWCs Ā is derived from a set of ERCs A provided that the EWCs are derived from the ERCs according to (29). The examples (26)–(28) illustrate this construction.

Let us say that a ranking >> is derived from a weight vector θ = (θ1, . . . , θn) provided it ‘‘respects’’ the order implicitly defined by the relative size of the weights, in the sense that condition (30) holds for every pair of constraints Ch and Ck. The idea of this correspondence between weight vectors and rankings is due to Boersma (1998, 2009).

(30) θh > θkCh >> Ck

Let me unpack condition (30), by considering two cases in turn. If all the components of the weight vector θ are pairwise distinct, the vector θ admits a unique derived ranking—namely, the unique ranking that ranks a constraint Ck above a constraint Ch if and only if the weight θk of Ck is larger than the weight θh of Ch. This case is illustrated in (31a): the unique derived ranking assigns C1 to the top, C2 to the bottom, and C3 in between, respecting the relative sizes of their weights.

(31)

graphic

If instead the components of the weight vector θ are not all pairwise distinct, then θ admits multiple derived rankings, because a tie between two weights can be broken differently by different derived rankings. This case is illustrated in (31b): the weights of C2 and C3 tie, and thus this weight vector admits two derived rankings, which break the tie in two different ways.

The main result of this section is . Its proof is a straightforward generalization of the reasoning illustrated above with the three examples (26)–(28), as shown in appendix B of the online supplementary materials. This proof crucially relies on the restriction (4) that HG weights be nonnegative. Appendix C in the online supplementary materials presents a variant of where this restriction is waived, but the given ERCs are required to contain a unique loser-preferrer.

LEMMA 2. Given a setAof ERCs, consider the corresponding setĀof EWCs derived fromAas in (29). IfĀis HG-compatible, thenAis OT-compatible. More precisely, if a (nonnegative) weight vectorθis HG-consistent withĀ , then any ranking derived fromθaccording to (30) is OT-compatible withA. ▓

Given a set of ERCs A and the corresponding set of EWCs Ā derived through (29), consider the corresponding instance RP(A) of the ranking problem (23) and the corresponding instance WP(Ā) of the weighting problem (24). says that HG-compatibility of Ā entails OT-compatibility of A. Does the reverse hold too? That is indeed the case, as indeed ensures that OT-compatibility of A entails HG-compatibility of Ā. Thus, the ranking problem RP(A) admits a solution if and only if the corresponding weighting problem WP(Ā ) admits one. Assume that they indeed admit a solution. says that, if θ is a solution to the weighting problem WP(Ā ), then any of its derived rankings is a solution to the ranking problem RP(A). In other words, we can obtain some of the solutions to the ranking problem RP(A) by looking for the derived rankings of solutions to the weighting problem WP(Ā ). Does the reverse hold too? Namely, is it the case that, if a ranking is a solution to the ranking problem RP(A), then it is derived from some weight vector that solves the weighting problem WP(Ā )? In other words, is it the case that, by looking for derived rankings of solutions to the weighting problem WP(Ā ), we obtain not just some but actually all of the solutions to the ranking problem RP(A)? That is indeed the case, thanks again to . In fact, consider a ranking >> that solves the ranking problem RP(A). Without loss of generality, assume that it is Cn >> Cn−1 >>. . .>>C1 (otherwise, just relabel the constraints). Consider the weight vector θ defined from >> as in (18b), where Δ is the largest entry (ignoring sign) of the derived set of EWCs Ā . Note that >> is derived from θ; that is, it satisfies condition (30a). Furthermore, guarantees that θ solves the weighting problem WP(Ā ). In conclusion, the two together entail the following theorem:

THEOREM 1. Given a set of ERCsA, consider the corresponding set of EWCsĀderived fromAas in (29). Then,Āis HG-compatible if and only ifAis OT-compatible. Furthermore, a ranking solves the instance RP(A) of the ranking problem (17) if and only if it is derived via (30) from a (nonnegative) weight vector that solves the corresponding instance WP(Ā) of the weighting problem (24).

This theorem says that the scheme (25) holds provided the mapping (25a) from sets of ERCs to sets of EWCs is defined as in (29) and the mapping (25c) from weight vectors to rankings is defined as in (30). In other words, the theorem says that it is possible to solve a given instance of the ranking problem (23) without solving it directly, but rather by solving a corresponding instance of the weighting problem (24). Thus, any algorithm for the weighting problem in HG can be turned into an algorithm for the ranking problem in OT. The conjecture of an alleged computational superiority of HG over OT, recalled from the literature in section 1, is thus wrong.

4 An Application to the GLA’s Convergence Problem

Computational OT has so far relied mainly on combinatorial algorithms specifically tailored to the framework of OT, developed with few connections to machine learning. As recalled in section 1, one of the reasons why various authors have started to entertain the alternative framework of HGis that it can straightforwardly make use of standard machine learning algorithms, thus bridging the gap between computational phonology and machine learning. Yet section 3 has shown that this alleged advantage of HG over OT is only apparent, as any algorithm for HG can be ported into OT, through theorem 1. This result thus allows the standard combinatorial toolkit of computational OT to be supplemented with a whole new set of algorithmic tools, obtained through a systematic translation into OT of well-known machine learning algorithms. In this section, I illustrate the fruitfulness of these new tools, by discussing in detail a specific application. Further applications left for future research are sketched in section 5.

In the strongest formulation of the OT framework, the constraint set is universal; it is shared by both children and adults and thus need not be learned. The acquisition of phonology thus consists of the problem of learning a constraint ranking that captures the target adult phonology. How could such a ranking be systematically inferred? Suppose that markedness constraints are initially ranked at the top and faithfulness constraints at the bottom, allowing only completely unmarked forms. Over time, the learner receives a stream of data from the target adult language.

Each time a piece of data is received, the learner checks whether its current constraint ranking accounts for the current piece of data. Whenever the learner makes an error, the relevant faithfulness constraints are slightly promoted and the relevant markedness constraints are slightly demoted. Reranking continues until the faithfulness and markedness constraints intersperse in a ranking that is consistent with the target adult language, so that the learner makes no more mistakes. This is the error-driven ranking model of the acquisition of phonology in OT. The intermediate rankings entertained by the model on its way to the final grammar correspond to intermediate acquisition stages, thus modeling the observed acquisition gradualness. Furthermore, the model is memoryless; that is, it does not keep track of previously seen forms and thus does not impose unrealistic memory requirements. Because of its cognitive plausibility, error-driven learning has been endorsed in the OT acquisition literature (see, e.g., Bernhardt and Stemberger 1998, Boersma and Levelt 2000, and Gnanadesikan 2004, as well as Tessier 2009 for critical discussion). How can this intuitively plausible learning scheme be formalized into a computationally sound learning algorithm?

Two main error-driven ranking algorithms have been developed in the OT computational literature, reviewed in section 4.1: one is Tesar and Smolensky’s (1998),Error-Driven Constraint Demotion (EDCD), or Boersma’s (1998:323–327) gradual reformulation thereof; the other is Boersma’s (1997),Gradual Learning Algorithm (GLA). The main difference between (gradual) EDCD and the GLA is that the former only performs constraint demotion while the latter performs both constraint promotion and demotion. Various studies have shown the good modeling capabilities of the GLA (see, e.g., Boersma and Levelt 2000, Boersma and Hayes 2001, Curtin and Zuraw 2002). Furthermore, various authors have argued that constraint promotion is needed from a modeling perspective, as reviewed in section 4.2. Yet, although a virtue from a modeling perspective, constraint promotion turns into a liability from a computational perspective. In fact, lack of constraint promotion allowed Tesar and Smolensky to prove that EDCD converges after a small number of updates. On the contrary, constraint promotion has been shown by Pater (2008) to prevent the GLA from converging in the general case. Against the background of this tension between the modeling and computational perspectives, one of the main open questions in computational OT is thus the following: is it possible to devise a variant of the GLA that performs promotion and yet provably converges, so as to retain its modeling virtues without sacrificing computational soundness?

This section offers the first positive solution to this question. In section 4.3, I introduce a variant of the GLA that differs from the original GLA because of a more careful calibration of constraint promotion. The analysis of this revised GLA developed in the rest of this section relies on the algorithmic portability from HG into OT established by theorem 1. In fact, in sections 4.4 and 4.5, I show that, once ERCs are mapped to derived EWCs as in (29) and weight vectors to derived rankings as in (30), this revised GLA can be reinterpreted as an instance of the Perceptron algorithm, a classical HG learning algorithm. In sections 4.6 and 4.8, I then use this reinterpretation in order to extend to the revised GLA the convergence properties of the Perceptron, known from the machine learning literature. I thus obtain the first convergence proof for an OT error-driven ranking algorithm that performs both constraint demotion and promotion. In section 4.7, I compare in more detail my revised implementation of the GLA with Boersma’s original formulation.

4.1 (Gradual) EDCD and the GLA

Here is a natural formalization of the informal error-driven learning scheme sketched above. The algorithm maintains a current ranking, which represents its current hypothesis about the target grammar. It initializes its current ranking to some predefined initial ranking. And it keeps updating its current ranking through the three steps in (32). At step (32a), the algorithm receives an ERC; at step (32b), the algorithm checks whether its current ranking is OT-compatible with this current ERC; if it isn’t, then the algorithm takes action at step (32c), by updating its current ranking to a ‘‘slightly’’ modified ranking.

(32)

graphic

I assume that the ERCs fed to the algorithm are sampled from a given OT-compatible set of ERCs A, called the input set. The algorithm converges provided it can only perform a finite number of updates for any input OT-compatible set of ERCs. If the algorithm converges, then its final ranking solves the instance RP(A) of the ranking problem (23) corresponding to the input set of ERCs A.

As noted in section 3, rankings can be represented through numerical weight vectors: we say that a ranking >> is derived from a given weight vector θ = (θ1, . . . , θn) provided that any pair of constraints Ch, Ck satisfies condition (30), repeated in (33). This condition says that the ranking >> respects the ordering of the constraints that is implicit in the relative size of their weights θ1, . . . , θn, in the sense that constraint Ch is ranked above constraint Ck whenever the weight θh of the former is strictly larger than the weight θk of the latter.

(33) θh > θkCh >> Ck

Boersma (1997, 1998, 2009) suggests using this correspondence between rankings and weight vectors in order to restate the OT error-driven algorithm (32) in terms of weight vectors.6 This has proven to be a remarkably important idea in the development of OT error-driven algorithms.

Suppose that, at every time, the algorithm entertains a current weight vector, rather than a current ranking. At a certain iteration, the current weight vector might happen to have two or more identical components, thus admitting multiple derived rankings. How should we proceed in this case? Following Boersma (2009), I assume that the algorithm updates its current weight vector whenever even just one of the rankings derived from the current weight vector is not OT-compatible with the current ERC, as stated in (34).

(34)

graphic

In fact, suppose that we decided instead to be more conservative, and update the current weight vector just in case none of its derived rankings is OT-compatible with the current ERC. At convergence, the algorithm will thus return a weight vector with the property that some of its derived rankings are OT-compatible with the input set of ERCs. But that is not very useful: how do we decide for a given derived ranking whether it is what we want or not?

To complete the description of the algorithm, we need an update rule to use in step (34c). Boersma (1997, 1998) puts forward the following intuition. If the algorithm fails on the current ERC, then the winner-preferrers are plausibly currently ranked too low and the loser-preferrers too high. The algorithm should thus react to its failure by promoting winner-preferrers and by demoting loser-preferrers by a small amount—say, 1 as in (35). The OT error-driven ranking algorithm (34) with this update rule (35) is called the (nonstochastic) Gradual Learning Algorithm (GLA).

(35)

  • a.

    Demote each current loser-preferring constraint by 1.

  • b.

    Promote each current winner-preferring constraint by 1.

Another update rule considered in the literature is (36). It performs constraint demotion but no constraint promotion, as it pushes down the loser-preferrers but does not push up the winnerpreferrers. Another difference between the two update rules (35) and (36) is that the former demotes all loser-preferrers while the latter demotes only those loser-preferrers that really need to be demoted—namely, those that are currently undominated. A loser-preferring constraint Cl is called currently undominated provided there is no winner-preferring constraint Ck that is currently ranked above Cl (in the sense that the current weight of Ck is strictly larger than that of Cl).

(36)

  • a.

    Demote by 1 each currently undominated loser-preferring constraint.

  • b.

    Do nothing to the winner-preferring constraints.

Boersma (1998:323–327) notes that the OT error-driven algorithm (34) with the demotion-only update rule (36) is a gradual version of Tesar and Smolensky’s (1998) EDCD.

Let me illustrate the GLA (35) with a concrete example; EDCD (36) works analogously. Suppose that the input set of ERCs fed to the GLA is (37a). The beginning of a possible run of the algorithm is provided in (37b).

(37)

graphic

Suppose that the algorithm starts from the null initial vector θ1. The algorithm is then fed an input ERC from (37a)—say, ERC 1. Neither of the two winner-preferrers C1 and C2 is ranked above the loser-preferrer C3 according to the current weight vector θ1, as all three constraints currently have the same weight. The algorithm thus takes action: the loser-preferrer C3 is demoted by 1 and the two winner-preferrers C1 and C2 are promoted by 1 each, giving θ2. The algorithm is then fed another input ERC from (37a)—say, ERC 2. The winner-preferrer C3 is not ranked above the loser-preferrer C2 according to the current weight vector θ2. The algorithm thus takes action: the loser-preferrer C2 is demoted by 1 and the winner-preferrer C3 is promoted by 1, giving θ3. And so on.

4.2 The Problem of the GLA’s Convergence

Gradual EDCD (36) performs no constraint promotion. It thus predicts a monotonic ranking dynamics, whereby constraints can only drop over time, but cannot rise or oscillate. This simple ranking dynamics allows for a straightforward analysis of the algorithm. Indeed, (gradual) EDCD has been shown to always converge after a number of updates that grows slowly (quadratically) with the number of constraints (Tesar and Smolensky 1998; see also Boersma 1998:323–327, 2009).7 The GLA (35) instead performs both constraint demotion and promotion, thus predicting a more complicated ranking dynamics, whereby constraints can drop, or rise, or oscillate. And convergence of the GLA remained an outstanding open issue for almost a decade. Indeed, shortly after the algorithm appeared in the literature, Keller and Asudeh (2002:237) lamented that ‘‘the convergence properties of the GLA are unknown,’’ as ‘‘this leaves open the possibility that there are data sets on which the GLA will not converge or not produce a meaningful set of constraint rankings. Convergence is a crucial property of a learning algorithm that should be investigated formally.’’ Finally, Pater (2008) settled the issue, with a counterexample that shows that the GLA does not converge in the general case (see Magri 2012a for a detailed explanation of Pater’s counterexample).8

Overall, these results seem to show that lack of constraint promotion is a virtue from a computational perspective, as it predicts a simple monotonic ranking dynamics that allows for straightforward analyses and powerful guarantees on the behavior of the algorithm. Unfortunately, lack of constraint promotion turns into a liability from a modeling perspective, as the monotonic ranking dynamics predicted by demotion-only seems to be too simple to match the attested acquisition complexity. Indeed, various authors in the OT acquisition literature have suggested that demotion-only is not sufficient and that we do want update rules for the OT error-driven algorithm that perform promotion too (see, e.g., Bernhardt and Stemberger 1998, Stemberger and Bernhardt 1999, Stemberger, Bernhardt, and Johnson 1999, Gnanadesikan 2004). An explicit computational argument for constraint promotion is due to Boersma (1997): he argues that constraint promotion is needed in order for (a stochastic variant of ) the GLA to learn certain cases of language variation. In Magri 2012a, I develop another computational argument for constraint promotion (in a nonstochastic setting), based on the challenge raised by modeling the acquisition of phonotactics. Let me briefly sketch the latter argument.

In carefully controlled experimental conditions, 9-month-old infants react differently to licit and illicit sound combinations, thus already displaying knowledge of the target adult phonotactics (Jusczyk et al. 1993). As the acquisition of morphology is plausibly still lagging behind at this early developmental stage, the child is blind to phonological alternations (Hayes 2004). In conclusion, there is a stage of pure phonotactic learning, when the child manages to acquire substantial knowledge of the target phonotactics without being exposed to alternations. Many authors have argued that lack of alternations implies that the child can safely posit only fully faithful underlying forms (see, e.g., Gnanadesikan 2004, Hayes 2004, Prince and Tesar 2004). Suppose now that we try to model this early stage of the acquisition of phonotactics by means of the demotion-only EDCD (36). Because of the assumption of fully faithful underlying forms, the faithfulness constraints are never loser-preferrers. As the update rule (36) only demotes loser-preferring constraints, the faithfulness constraints are never reranked throughout this entire learning stage. In other words, their intermediate and final rankings will be identical to their initial ranking. This cannot be right, for at least two reasons. First, the algorithm is not able to learn the phonotactics of languages that require a specific relative ranking of two faithfulness constraints, such as those discussed by Hayes (2004) and Prince and Tesar (2004). Second, the algorithm is not able to model acquisition paths where the child’s repair strategy changes over time, such as the acquisition sequences documented in much of the literature on the acquisition of consonant clusters (McLeod, van Doorn, and Reed 2001).

Let me make these considerations more concrete, by illustrating the first issue with an example. Consider the OT typology in (38), which is a simplified version of an example considered in Prince and Tesar 2004, based on a phonological analysis developed in Lombardi 1999. Both features, [STOP-VOICING] and [FRICATIVE-VOICING], come with specific faithfulness constraints F1 and F2 and markedness constraints M1 and M2. Finally, the markedness constraint M3 = AGREE, which requires adjacent obstruents to agree in voicing, lets the two features interact.

(38)

graphic

Among the languages in the OT typology (38) are the two in (39). A ranking generates these two languages, (39a) and (39b), provided it is a refinement of the partial orders in (40a) and (40b), respectively.9

(39)

graphic

(40)

graphic

Because of the assumption of underlying forms fully faithful to the intended winners, the two faithfulness constraints F1 and F2 are never loser-preferrers throughout learning. As EDCD’s update rule (36) only demotes loser-preferrers, it will never rerank F1 and F2, which will thus stay put at their initial ranking values. In other words, the final relative ranking of F1 and F2 will be the same as the initial one, no matter which of the two languages (39a) or (39b) the algorithm has been trained on. The algorithm will thus fail to acquire the correct phonotactics in at least one of the two cases in (39), as they require the opposite relative ranking of F1 and F2, by (40). Some constraint promotion is needed, in order to move the faithfulness constraints around too, even though they are never loser-preferrers. In Magri 2012a, I show that the promotion component of the GLA (or, equivalently, of the revised GLA developed in section 4.3) indeed allows the algorithm to learn the correct relative ranking (40) of the two faithfulness constraints. And in Magri 2011b, I generalize this initial result, looking at all possible constraints of the type of M3 in (38), which is responsible for the interaction between the two features that define the typology. I show that constraint promotion allows the algorithm to always converge to the desired final ranking no matter how the input ERCs are sampled, but for phonologically implausible modes of feature interaction.

Let me take stock. (Gradual) EDCD only performs constraint demotion, while the GLA performs constraint promotion too. Demotion-only allows EDCD to converge quickly, while promotion prevents the GLA from converging. Although a liability from a computational perspective, constraint promotion turns into a virtue from a modeling perspective. Various studies have shown the GLA’s good modeling capabilities (Boersma and Levelt 2000, Boersma and Hayes 2001, Curtin and Zuraw 2002). Boersma (1997) argues that constraint demotion alone is not able to model certain cases of learning in the presence of variation. And in Magri 2012a, I argue that promotion is needed in order to model the early stage of the acquisition of phonotactics, as just sketched. Against this scenario, the following question thus stands out as one of the main open problems in computational OT: is it possible to devise a variant of the GLA that performs promotion too and yet provably converges? In other words, is it possible to retain the GLA’s good modeling capabilities without sacrificing computational soundness?

4.3 A Revised GLA

To start, suppose that the current ERC fed to the GLA has a unique L corresponding to some constraint Cl and a unique W corresponding to some constraint Ck, as in (41). This is a simple case: the unique winner-preferrer Ck needs to be ranked above the loser-preferrer Cl in the end, and can therefore be confidently promoted. In this case, it thus makes good sense, say, to promote Ck by 1 and demote Cl by 1, as prescribed by Boersma’s update rule (35).

(41)

graphic

Next, consider the case where the current ERC still has a unique L but now has multiple W’s. For concreteness, suppose it has two W’s corresponding to two constraints Ch and Ck, as in (42a). This ERC by itself does not say which one of the two winner-preferrers Ch or Ck should in the end be ranked above the loser-preferrer Cl. For instance, the set of input ERCs could contain another ERC like (42b), so that only Ch should be ranked above Cl, while Ck must be ranked below it.

(42)

graphic

Given only the current ERC (42a), there is no way to choose in a principled manner which one, Ch or Ck, should be promoted. Nor could we simply promote both, as shown by Pater’s counterexample against the GLA’s convergence.

Boersma’s promotion/demotion update rule (35) does not distinguish between the two cases (41) and (42): all winner-preferrers are promoted by the same amount (say, 1), no matter whether they appear in a simple ERC with a unique winner-preferrer as in (41) or in a challenging ERC with multiple winner-preferrers as in (42). This does not look like a good idea, though: a proper update rule should be sensitive to the crucial difference between the two cases (41) and (42), so as to match the intrinsic logic of OT. I suggest the following modification: in the simple case of ERC (41) with a unique winner-preferrer, we can promote that unique winner-preferrer by 1; but in the challenging case of ERC (42) with two winner-preferrers, we should split our confidence between the two winner-preferrers, by promoting each by just . In the general case, if the current ERC contains many W’s, then the total promotion amount of 1 should be split among the many winner-preferrers. I thus suggest the revised promotion/demotion update rule (43) for ERCs like (41) and (42) that contain a unique L: the loser-preferrer is demoted by 1 and each winnerpreferrer is promoted by 1 divided by the total number w of winner-preferrers.

(43)

  • a.

    Demote the loser-preferrer by 1.

  • b.

    Promote each of the w winner-preferrers by

So far, I have only considered current ERCs with a unique L. If such an ERC is not OTcompatible with the current ranking vector, then its unique loser-preferrer must be currently undominated; that is, it cannot already be ranked underneath a winner-preferrer (in the sense that its weight is larger than or equal to the weight of any winner-preferrer). If the current ERC has multiple L’s, then some of them might be currently undominated and others might not be. If only one loser-preferrer is currently undominated, then we can of course again use the very same update rule (43). What if there are two or more currently undominated loser-preferrers? For concreteness, suppose that there are two currently undominated loser-preferrers C′ and C″, as in the ERC in (44a). Split up that ERC into two ERCs, each of which retains only one of the two L’s of the original ERC, while the other gets replaced by an e, as in (44b).

(44)

graphic

As Prince (2002) notes, the original ERC (44a) and the two ERCs (44b) are OT-equivalent, in the sense that they are compatible with exactly the same rankings.

Because of this OT-equivalence, it makes sense to construe the update triggered by the original ERC (44a) as the sequence of the two updates triggered by the two ERCs (44b). Furthermore, since the latter two ERCs contain a single L, we can update in response to each of them using the update rule (43). The two ERCs (44b) have the same number of W’s as the original ERC (44a); call it w. Upon update by the first of the two ERCs (44b), the loser-preferrer C′ is demoted by 1 and each winner-preferrer is promoted by . And upon update by the second of the two ERCs (44b), the loser-preferrer C" is demoted by 1 too and each winner-preferrer is promoted once more by . In the end, each undominated loser-preferrer of the original ERC (44a) is demoted by 1, and each winner-preferrer is promoted by for as many times as there are undominated loser-preferrers. Equivalently, each winner-preferrer is promoted by as in (45), where w is the number of winner-preferrers and l is the number of undominated loser-preferrers.

(45)

  • a.

    Demote each of the l undominated loser-preferrers by 1.

  • b.

    Promote each of the w winner-preferrers by .

I will refer to the OT error-driven ranking algorithm (34) with this new, better-calibrated promotion/demotion reranking rule (45) as the revised GLA.

To illustrate, consider the beginning of a possible run of this revised GLA in (46b), again on the set of input ERCs (37a), repeated in (46a).

(46)

graphic

Suppose that the algorithm starts from the null initial vector θ1. The algorithm is then fed an input ERC from (46a)—say, ERC 1. Neither of the two winner-preferrers C1 and C2 is ranked above the loser-preferrer C3 according to the current weight vector θ1, as all three constraints currently have the same weight. The algorithm thus takes action: the loser-preferrer C3 is demoted by 1 and that same amount is split over the two winner-preferrers C1 and C2, which are thus each promoted by , giving the new current weight vector θ2. The algorithm is then fed another input ERC from (46a)—say, ERC 2. The winner-preferrer C3 is not ranked above the loser-preferrer C2 according to the current weight vector θ2. The algorithm thus takes action: the loser-preferrer C2 is demoted by 1 and the winner-preferrer C3 is promoted by 1, giving the new current weight vector θ3. This weight vector ranks C1 above C3 and C3 above C2. As the latter ranking is OT-compatible with the data (46a), any ERC that will be fed to the algorithm from this moment on will not trigger any further update.

In the rest of this section, I offer an analysis of the revised GLA (45) along the following lines. If ERCs are paired with derived EWCs as in (29) and weight vectors with derived rankings as in (30), then HG error-driven algorithms can be translated into OT error-driven ranking algorithms. From this perspective, the revised GLA just developed can be reinterpreted as a wellknown HG error-driven algorithm, the Perceptron. Machine learning results on convergence and robustness of the Perception thus translate to the revised GLA. In particular, I easily obtain a convergence guarantee for the revised GLA, which represents the first result on convergence for a ranking algorithm that performs constraint promotion too, besides demotion.

4.4 The Perceptron Algorithm

HG error-driven algorithms are analogous to OT error-driven algorithms, the only differences being that they are fed EWCs rather than ERCs and that they check for HG-compatibility rather than for OT-compatibility. Thus, an HG error-driven algorithm can be described as in (47), completely analogous to (34). At step (47a), the algorithm receives an EWC. At step (47b), the algorithm checks HG-compatibility between the current EWC and the current weight vector. At step (47c), the algorithm takes action, in case HG-compatibility does not hold. It is convenient for what follows to denote the current weight vector entertained by an HG error-driven algorithm as _θ = (θ1, . . . , θn), rather than as θ = (θ1, . . . , θn).

(47)

graphic

I assume that the EWCs fed to the algorithm at step (47a) are sampled from a given, finite, HGcompatible set Ā of EWCs, called the input set. The algorithm converges provided it can only perform a finite number of updates for any HG-compatible input set. If the algorithm converges, then its final weight vector solves the instance WP(Ā) of the weighting problem (24) corresponding to the input set Ā.

Different HG error-driven algorithms differ because of the update rule used in step (47c). These update rules are very well studied in the field of online linear classification (Cesa-Bianchi and Lugosi 2006:chap. 12 offers a modern introduction). As an example, consider the classical update rule (48): the updated weight vector θnew = (θ , . . . , θ) is obtained by adding component by component the current EWC ā = [ā1, . . . , ān] to the current weight vector θold = (θ , . . . , θ).

(48)

graphic

The HG error-driven algorithm (47) with the update rule (48) is called the Perceptron algorithm in the machine learning literature (Rosenblatt 1958, 1962, Block 1962, Novikoff 1962, as well as Cristianini and Shawe-Taylor 2000:chap. 2).10

To illustrate: Suppose that the set of input EWCs fed to the Perceptron is (49a). The beginning of a possible run of the algorithm is provided in (49b).

(49)

graphic

Suppose that the algorithm starts from the null initial weight vector θ1. The algorithm is then fed an input EWC from (49a)—say, EWC 1. The current weight vector θ1 is not HG-compatible with this EWC (as 0 · + 0 · + 0 · (−1) = 0 ≯ 0). Thus, it is updated according to (48). The first and second components of the updated weight vector θ2 become , which is the corresponding component of the old weight vector θ1 (namely, 0) plus the corresponding component of the EWC 1 (namely, ); the third component of the updated weight vector θ2 becomes −1, which is the third component of the old weight vector θ1 (namely, 0) plus the third component of the EWC 1 (namely, −1). The algorithm is then fed another EWC from (49a)—say, EWC 2. The current weight vector θ2 is not HG-compatible with this EWC (as · 0 + · (−1) + (−1) · 1 = − ≯ 0). Thus, it is updated to θ3 by adding EWC 2 to the current weight vector θ2 component by component. And so on.

4.5 A Connection between the Revised GLA and the Perceptron

To analyze the revised GLA means to investigate the properties of the sequence of weight vectors θ1, θ2, . . . , θt , . . . entertained in a run of the algorithm. To start, let’s look at a concrete example— say, the run of the revised GLA in (46b)—and let’s compare it with the run (49b) of the Perceptron. The two runs start from the same null initial weight vector: θ1 = θ1. Crucially, the weight vector entertained at any time in the run of the revised GLA coincides with the weight vector entertained at that same time in the run of the Perceptron: θ2 = θ2, θ3 = θ3, and so on. Here, I will explain in three steps why this fact actually holds in full generality. In section 4.6, I will then use this fact to deduce convergence of the revised GLA from convergence of the Perceptron.

Step 1. Every time the revised GLA is fed ERC 1 or ERC 2 from (46a) in the run (46b), the Perceptron is fed the corresponding EWC 1 or EWC 2 from (49a) in the corresponding run (49b). Crucially, every time the current ERC prompts the revised GLA to perform an update in the run (46b), the corresponding current EWC prompts the Perceptron to perform an update as well in the run (49b). This is not a coincidence. The set of EWCs in (49a) is derived from the set of ERCs in (46a) according to the correspondence (29) considered in section 3.2: each L is replaced with −1 and each W is replaced with , where w is the total number of W’s in that ERC. Thus, from section 3 applies in this case.11 Recall that this lemma says that, if a derived EWC is HG-compatible with some weight vector, then the original ERC is OT-compatible with each one of the derived rankings of that weight vector. The contrapositive of this lemma can thus be stated as follows. Suppose that a weight vector admits derived rankings that are not OT-compatible with an ERC from (46a), prompting the revised GLA to perform an update. Then, that weight vector is also not HG-compatible with the corresponding derived EWC in (49a), prompting the Perceptron to perform an update too.

Step 2. Suppose that an input ERC from (46a) triggers an update according to the revised GLA update rule (45). For concreteness, suppose it is ERC 1 that triggers the update. This means that the two winner-preferrers C1 and C2 are promoted by and the loser-preferrer C3 is demoted by 1. Equivalently, the current weight vector is updated by adding to its first and second components and by adding −1 to its third component. In other words again, the current weight vector is updated by adding component by component the corresponding EWC 1 in (49a), as prescribed by the Perceptron update rule (48). The same holds for ERC 2. In conclusion, an update triggered by an ERC in (46a) according to the revised GLA update rule (45) is equivalent to the update triggered by the corresponding EWC in (49a) according to the Perceptron update rule (48).

Step 3. Suppose that the run (46b) of the revised GLA is continued by feeding the algorithm with ERC 1, as in (50a). As the winner-preferrer C1 is ranked above the loser-preferrer C3 according to the current weight vector θ3, the revised GLA does nothing and the current weight vector θ4 is identical to θ3. At this point, the Perceptron fails at mimicking the revised GLA. In fact, suppose that the run (49b) of the Perceptron is likewise continued by feeding the algorithm with EWC 1, as in (50b). As the current weight vector θ3 is not HG-compatible with this EWC (because · + (−) · · 0 · (−1) = 0), the Perceptron will perform an update, unlike the revised GLA. As a result, the weight vector θ4 entertained by the revised GLA at this iteration is different from the weight vector θ4 entertained by the Perceptron.

(50)

  • a.

    Revised GLA’s run

    graphic

  • b.

    Perceptron’s run

    graphic

The difficulty just highlighted is nonetheless insubstantial. If the current weight vector entertained by the revised GLA admits some derived ranking that is not OT-compatible with the current input ERC, an update is performed. Otherwise, nothing happens and the algorithm just waits for more data. In the latter case, nothing would have been different if the current ERC had not been fed to the algorithm to start with. In other words, data that do not trigger an update are irrelevant and can therefore be ignored. We can thus restrict ourselves without loss of generality to runs of the revised GLA where an update is triggered at every iteration.

We are now ready to put these pieces together. The run (46b) of the revised GLA and the run (49b) of the Perceptron start from the same initial weight vector—namely, the null vector. By step 3, I can assume without loss of generality that the current weight vector is always updated by the revised GLA. Furthermore, every time the current weight vector is updated in the run (46b) of the revised GLA, it is also updated in the run (49b) of the Perceptron, by step 1. Moreover, they are updated in exactly the same way in the two runs, by step 2. It thus follows that the two runs entertain exactly the same sequence of weight vectors. As shown in appendix D of the online supplementary materials, this reasoning can be extended from the concrete example (46)/(49) considered here to the general case, obtaining . This lemma says that a run of the revised GLA can be mimicked with a run of the Perceptron. The next section uses this conclusion in order to reduce the analysis of the revised GLA to the analysis of the Perceptron.

LEMMA 3. Consider a run (51a) of the revised GLA on an input setAof ERCs. Assume that an update is triggered at every time (i.e.,θtθt+1at every time t). Then, there exists a run (51b) of the Perceptron on an input setĀof EWCs such that the sequences of weight vectors in the two runs (51a) and (51b) coincide (i.e.,θt=θtat every time t), provided that the two runs start from the same initial weight vector (i.e.,θ1 = θ1).

(51)

graphic

This setĀof input EWCs for the Perceptron run (51b) is finite; furthermore, it is HG-consistent whenever the setAof ERCs fed to the revised GLA in the run (51a) is OT-consistent. ▓

4.6 Convergence of the Revised GLA

The OT acquisition literature has endorsed error-driven learning as a plausible model of the child acquisition of phonology. Two implementations of error-driven learning have been devised so far within the OT computational literature: Boersma’s (1998) GLA (35) and Tesar and Smolensky’s (1998) (gradual) EDCD (36). The crucial difference between the two algorithms is that the GLA performs both constraint demotion and promotion while EDCD performs only constraint demotion. Lack of constraint promotion allowed Tesar and Smolensky to show that EDCD converges in the general case. Convergence of the GLA remained an open problem for almost a decade, as pointed out by Keller and Asudeh (2002), until Pater (2008) showed that constraint promotion prevents the GLA from converging in the general case. Yet constraint promotion seems to be needed from a modeling perspective, as argued in section 4.2. The following theorem solves this impasse: it says that convergence is not incompatible with constraint promotion, as long as the promotion component of the update rule is properly calibrated as in the revised GLA (45).

THEOREM 2. The revised GLA with the properly calibrated promotion/demotion update rule (45) converges for any OT-compatible set of input ERCs.

The convergence follows straightforwardly from the reinterpretation of the revised GLA in terms of the Perceptron, stated by . In fact, suppose by contradiction that the revised GLA did not converge. This means that there exists an OT-compatible set A of input ERCs such that at every iteration we can pick from A an ERC that forces the revised GLA to perform an update of the current weight vector. ensures that we can get the Perceptron to mimic that nonconvergent run of the revised GLA. In other words, the lemma ensures that there exists a finite and HG-compatible set Ā of EWCs such that at every iteration we can pick from Ā an EWC that forces the Perceptron to perform an update of the current weight vector. The latter conclusion contradicts the well-known Perceptron convergence theorem, recalled for completeness in appendix E of the online supplementary materials. In conclusion, the convergence for the revised GLA follows as a translation from HG into OT of the convergence theorem for the Perceptron, providing a first specific application of the algorithmic portability from HG into OT established in section 3. For a different approach to the GLA convergence problem, see Magri 2012a.

4.7 What Goes Wrong with the Original GLA

Now that we have seen why the revised GLA (45) converges, it is instructive to go back to the original GLA (35) and understand what goes wrong. As an illustration, consider the run (37b) of the original GLA on the set of input ERCs (37a), repeated in (52a). Consider the corresponding set of EWCs in (52b), obtained by replacing each W with 1, each L with −1, and each e with 0. Every time an ERC from (52a) triggers an update according to the original GLA update rule (35), winner-preferrers are promoted by 1 and loser-preferrers are demoted by 1. Equivalently, the current weight vector is updated by adding component by component the corresponding EWC in (52b), as prescribed by the Perceptron update rule (48).

(52)

graphic

Thus, step 2 of the reasoning outlined in section 4.5 for the revised GLA also holds for the original GLA: each update performed by the original GLA ( just like each update performed by the revised GLA) can be reinterpreted as a Perceptron update, provided that the corresponding EWCs are properly defined as in (52b).

What crucially fails in the case of the original GLA is step 1 of the reasoning outlined in section 4.5, which followed from the result established in section 3. Namely, it is not true that, whenever a weight vector admits a derived ranking not OT-compatible with an ERC in (52a), it is also not HG-compatible with the corresponding EWC in (52b). Here is a counterexample. The weight vector (53) admits derived rankings (such as C3 >> C1 >> C2) that are not OT-compatible with ERC 1 in (52a). Yet this weight vector is HG-compatible with the corresponding EWC 1 in (52b): even though the weights of C1 and C2 are each smaller than the weight of C3, they can gang up to overcome the latter.

(53)

graphic

Suppose now that at a certain point in the run of the original GLA, the current weight vector is indeed (53). If the original GLA is fed ERC 1 from (52a), an update is triggered. But this update cannot be mimicked in the corresponding run of the Perceptron: as the current weight vector (53) is HG-compatible with the corresponding EWC 1 in (52b), no update will be triggered in the Perceptron run. In general, the run of the original GLA can contain many more updates than the corresponding run of the Perceptron. Thus, the Perceptron convergence theorem does not entail convergence of the original GLA, even though updates by the original GLA can be described as Perceptron updates.

4.8 Number of Updates

In section 4.6, I focused on the issue of convergence. An important related issue is that of the number of updates required for convergence. Here is a way to address the latter issue. guarantees that a run of the revised GLA can be mimicked with a run of the Perceptron. Thus, the number of updates performed by the revised GLA on a set of input ERCs is always smaller than or at most equal to the number of updates performed by the Perceptron on the corresponding set of derived EWCs, and bounds on the number of updates performed by the Perceptron yield bounds on the number of updates performed by the revised GLA.

Let’s look at a couple of concrete examples. Following Riggle (2009), let the diagonal set of ERCs corresponding to n constraints be the set A consisting of n − 1 ERCs such that the kth ERC has all entries equal to e except for the kth entry, which is a W, and the following entry, which is an L. To illustrate, I give in (54a) the diagonal sets of ERCs for n = 4, 5 constraints.

(54)

graphic

Assume that we keep feeding to the revised GLA input ERCs sampled from A . What is the largest number of updates that we can force the algorithm to perform, before it converges to the final ranking C1 >> C2 >> . . . >> Cn consistent with A?

12 ensures that the number of updates performed by the revised GLA is at most the number of updates performed by the Perceptron on the corresponding derived set Ā of input EWCs constructed according to the correspondence (29) considered in section 3.2: each L is replaced with −1 and each W is replaced with +1 (as each ERC contains a unique W). To illustrate, I provide in (54b) the sets of EWCs corresponding to the two diagonal sets of ERCs in (54a). The Perceptron convergence theorem recalled in appendix E of the online supplementary materials offers a bound on the number of updates performed by the Perceptron on the derived set Ā of input EWCs in terms of parameters that quantify certain geometric properties of Ā (namely, its radius and margin). And these parameters can be computed explicitly as a function of the number n of constraints. I thus obtain the bound on the worst-case number of updates performed by the revised GLA on the diagonal set A of input ERCs stated in the following corollary. The proof is presented in appendix F of the online supplementary materials.

COROLLARY 1. The revised GLA (45) performs no more than n(n2 − 1)/6 updates on the diagonal setAof input ERCs corresponding to n constraints (starting from null initial weights).

Here is another example. Let Pater’s set of ERCs corresponding to n constraints be the set A of n − 1 ERCs obtained from the diagonal set by ‘‘adding’’ a W at the right of every L (except in the last ERC). To illustrate, I give in (55a) Pater’s sets of ERCs corresponding to n = 4, 5 constraints.

(55)

graphic

Assume that we keep feeding to the revised GLA input ERCs sampled from A . What is the largest number of updates that we can force the algorithm to perform, before it converges to the final ranking C1 >> C2 >> . . . >> Cn consistent with A? Again, we consider the corresponding setĀ of EWCs obtained according to the correspondence (29) in section 3.2: every L is replaced with −1, every e with 0, and every W with 1 or , depending on whether that W belongs to an ERC with 1 or 2 winner-preferrers. To illustrate, I provide in (55b) the sets of EWCs corresponding to the two sets of ERCs in (55a). Again, the number of updates performed by the revised GLA on Pater’s set A of input ERCs is bounded by the number of updates performed by the GLA on the corresponding set Ā of EWCs, leading to the following corollary. The proof is presented in appendix F of the online supplementary materials.

COROLLARY 2. The worst-case number of updates performed by the revised GLA (45) on Pater’s setAof input ERCs grows with the number n of constraints at most as n5(starting from null initial weights).

The mistake bounds for the revised GLA obtained through this strategy are admittedly worse than the quadratic mistake bound obtained by Tesar and Smolensky (1998) for EDCD. Further more, my mistake bounds are not general, as they are tailored to specific sets of input ERCs, while Tesar and Smolensky’s bound holds for any set of ERCs corresponding to n constraints (see also Heinz and Riggle 2011:71 for relevant discussion). Yet the analysis presented here has two advantages over Tesar and Smolensky’s classical analysis of EDCD. First, it is compatible with constraint promotion, which was argued to be necessary from a modeling perspective in section 4.2. Second, even though both analyses rest on the idealization of OT-compatible data, my analysis based on importing machine learning results into OT might allow us to take advantage of the extensive body of literature on the Perceptron robustness to noise, as I elaborate in the next section.

5 Conclusions

The peculiar notion of OT-compatibility (7) enforces strict domination, according to which the highest-ranked relevant constraint ‘‘takes it all.’’ Because of this property, OT does not seem to have any close correspondent within core machine learning. The current toolkit for computational OT thus consists mainly of combinatorial algorithms tailored to OT, developed with few connections to methods and results from machine learning. This classical approach to computational OT corresponds to the top horizontal arrow in the scheme (56).

(56)

graphic

In order to bridge this gap between computational OT and machine learning, various scholars have started to explore the alternative framework of HG, since HG can make use of well-established algorithms from machine learning, namely, algorithms for linear classification. from section 3 now enters the scene. It says that machine learning algorithms for HG can actually be systematically translated into algorithms for OT according to the scheme (56). In step (56a), the set of ERCs A given with an instance RP(A) of the ranking problem is translated into a properly defined set Ā of derived EWCs, according to (29). In step (56b), machine learning algorithms for HG are used to solve the corresponding weighting problem WP(Ā )—namely, to find a weight vector θ HG-compatible with the data Ā or else to determine that the data are not HG-compatible. Finally, in step (56c), the solution to the derived weighting problem WP(Ā ) is translated into a provably correct solution to the original ranking problem RP(A): if the former admits no solutions, then the latter is declared to admit no solutions either; otherwise, the weight vector θ that solves the former is translated through (30) into derived rankings that are guaranteed to solve the latter. This is the new algorithmic strategy anticipated in section 1. Thus, theorem 1 provides computational OT with a new toolkit.

To illustrate the fruitfulness of this new toolkit for computational OT, I presented a detailed application in section 4. Various modeling studies lend support to Boersma’s GLA implementation of the classical error-driven learning scheme (e.g., Boersma and Levelt 2000, Boersma and Hayes 2001, Curtin and Zuraw 2002). Yet Pater (2008) has shown that the GLA does not converge in the general case. Thus, one of the main open questions in computational OT is this: how could the GLA be modified in order to guarantee convergence, so as to retain its modeling virtues without sacrificing computational soundness? I have introduced a revised GLA, which differs from the original GLA because of a more careful calibration of the promotion component of the update rule. And I have offered a proof of the convergence of this revised GLA. The core idea of the proof is represented in (57).

(57)

graphic

Once ERCs are mapped to derived EWCs as in (29) and weight vectors to derived rankings as in (30), the revised GLA can be reinterpreted as an instance of the Perceptron algorithm. Convergence results for the Perceptron thus translate into convergence results for the revised GLA. The scheme (57) is thus a concrete illustration of the general approach (56). The potential of this new approach is further illustrated in the rest of this section, with a number of possible further applications left for future research.

As depicted in (57), the Perceptron can be translated into an error-driven OT ranking algorithm. But there is nothing special about the Perceptron. The reasoning developed in section 4 is completely general: any error-driven algorithm for HG can be translated into a corresponding error-driven ranking algorithm for OT. This new computational perspective thus greatly enriches the algorithmic tools at our disposal for implementing the error-driven learning scheme within OT. Let me make this point concrete with an example. All OT update rules considered so far in the literature, such as the original GLA’s update rule (35), the (gradual) EDCD’s update rule (36), and my revised GLA’s update rule (45), are additive, in the sense that winner-preferrers are promoted by adding to their current weight a small positive amount, and (undominated) loserpreferrers are demoted by adding to their current weight a small negative amount. These various implementations of the GLA thus correspond to the additive update rule (48) of the Perceptron: the weight vector is updated by adding component by component the EWC that is triggering the update. Another important class of HG error-driven algorithms has a multiplicative update rule (Kivinen, Warmuth, and Auer 1997). Here is an example. Suppose that the current weight vector θold = (θ , . . . , θ) is not HG-compatible with the current EWC ā = [ā1, . . . , ān]. Then let the updated weight vector θnew = (θ , . . . , θ) be defined component by component as in (58), where Z is the normalization coefficient and η is a properly chosen positive constant (called plasticity or stepsize). Modulo normalization, the updated weight θ is obtained by multiplying the current weight θ by the (exponential of the η -rescaled) corresponding entry āk in the current EWC.

(58)

graphic

The HG error-driven algorithm (47) with this multiplicative update rule (58) is called the Winnow algorithm. It is known to converge for a properly chosen stepsize η (Littlestone 1988).13

The Winnow algorithm can now be ported from HG into OT in exactly the same way I adapted the Perceptron into the revised GLA in section 4. Suppose that the current weight vector θold = (θ , . . . , θ) entertained by the OT error-driven ranking algorithm (34) is not OTcompatible with the current ERC a = [a1, . . . , an]. We map this ERC into a derived EWC ā = [ā1, . . . , ān] using the mapping (29) devised in section 3: each L is replaced with −1 and each W is replaced with , where w is the total number of winner-preferrers. We then update the current weight vector in response to the ERC a according to the Winnow update rule (58) in response to the corresponding derived EWC ā—namely, as in (59). Modulo normalization, the weight corresponding to winner-preferrers is multiplied by the (exponential of the η-rescaled) promotion coefficient ; and the weight of loser-preferrers is multiplied by the (exponential of the η-rescaled) demotion coefficient −1. Let me dub the OT error-driven algorithm (34) with this update rule (59) the multiplicative GLA.

(59)

graphic

By reasoning as in section 4, I conclude that a run of the multiplicative GLA on a set of input ERCs can be mimicked by a run of the Winnow algorithm on the set of corresponding EWCs. Convergence of the Winnow algorithm thus translates into convergence of the multiplicative GLA. The reasoning presented so far can be summarized with the diagram in (60), which is one more instance of the new algorithmic strategy (56).

(60)

graphic

The Perceptron additive update rule (48) and the Winnow multiplicative update rule (58) have been compared extensively in the machine learning literature, with the two update rules outperforming each other on different types of data sets (Kivinen, Warmuth, and Auer 1997). Even though the additive update rule of the GLA has become very popular in the computational OT literature, substantial computational and modeling work will be needed in order to determine whether ranking algorithms based on the additive Perceptron fare better than ranking algorithms based on the multiplicative Winnow or vice versa. These new computational developments might lead to new, improved tools for modeling the acquisition of sound patterns in OT.

Most of the OT computational literature has focused so far on an idealized learning scenario, whereby the data are assumed to be consistent (no noise) and learning is assumed to be categorical (no variation). Tesar and Smolensky’s (1998) analysis of EDCD crucially relies on the assumption that the data are compatible with an OT grammar—that is, that the data have not been corrupted by noise and do not display variation. And the convergence proof for the revised GLA presented in section 4 crucially relies on the idealized assumption of consistent data too. Boersma’s (1997, 1998) GLA is purported in the literature to be robust to noise and to be able to model variation (in its stochastic version), but no analytical results are currently available. In conclusion, very little is currently known concerning the robustness of the error-driven learning model in more realistic learning scenarios. The algorithmic tools developed in this article have the potential to lead to substantial progress in this direction, as sketched in (61).

(61)

graphic

As seen in section 4, a revised version of the GLA can be interpreted as the classical Perceptron algorithm for HG. A very large body of literature has studied the computational properties of the Perceptron algorithm (see, e.g., Cesa-Bianchi and Lugosi 2006:chap. 12). In particular, robustness to noise on the part of the Perceptron as well as a number of variants thereof has been thoroughly investigated (see, e.g., Freund and Schapire 1999 (building on Klasner and Simon 1995) and Shalev-Shwartz and Singer 2005, as well as Khardon and Wachman 2007 for a review and experimental results). The reinterpretation of the GLA in terms of the Perceptron might thus allow current machine learning results on the Perceptron’s robustness to be translated into the first analytical results on the GLA’s robustness.

Throughout the second part of this article, I have focused on error-driven ranking algorithms. These are algorithms for the ranking problem RP(A) that ‘‘work by row’’—in other words, look at A one ERC at a time. On the contrary, batch-ranking algorithms ‘‘work by column’’ and thus need to look at the entire set of ERCs A at once. Tesar (1995) and Tesar and Smolensky (1998) develop an efficient batch-ranking algorithm, called Constraint Demotion (CD). In Magri 2012b, I note that CD for the ranking problem in OT ‘‘corresponds’’ to the classical Fourier-Motzkin Elimination Algorithm (FMEA) for the weighting problem in HG (see, e.g., Bertsimas and Tsitsiklis 1997:70–74), in the sense that the scheme in (62) holds: if we map ERCs into derived EWCs as in (29) and weight vectors into derived rankings as in (30), then CD turns into a special application of the FMEA.

(62)

graphic

This reinterpretation of CD in terms of the FMEA might turn out to be useful from the following perspective. CD works roughly as follows: it starts building the ranking from the top, and it assigns to the highest available slot in the ranking a constraint that is never loser-preferring among those ERCs that are not already accounted for by constraints already assigned to a higher position in the unfolding ranking. At the first iteration, CD thus assigns to the top slot a constraint that is never loser-preferring; at the next iteration, it assigns to the next highest slot a constraint that is loser-preferring only in ERCs where the first constraint is winner-preferring; and so on. At a certain iteration, the algorithm might have to choose among different constraints, each of which could be assigned to the highest currently available slot. Hayes (2004) and Prince and Tesar (2004) have suggested that in such cases, the algorithm should not choose at random, but instead should choose according to specific principles heuristically designed in order to bias CD toward a final ranking that ranks the faithfulness constraints as low as possible. Reinterpreting CD as a special case of the FMEA allows us to obtain a more general variant of CD, whereby the algorithm is no longer required to loop through the constraints in the order dictated by how they become available for ranking. In other words, it allows us to develop variants of CD, say, where the first constraint ranked by the algorithm is not necessarily one that is never loser-preferring, so that we don’t necessarily have to start to construct the ranking from the top. These developments might lead to a more efficient implementation of Hayes’s (2004) and Prince and Tesar’s (2004) heuristics for restrictiveness.

Notes

I wish to thank Adam Albright for lots of help. The article has greatly benefited from comments by an anonymous LI reviewer. I also wish to thank Paul Boersma, Bruce Tesar, and Paul Smolensky for useful conversations on the material presented here. Earlier versions of this article have been presented at NECPhon 2 (Yale University, 15 November 2008), at DGfS 31 (Osnabrück, 6 March 2009), and at WCCFL 29 (University of Arizona, 23 April 2011); I wish to thank the audiences at those venues for useful discussion. Some of the material presented here has appeared as Magri 2011a. This work was supported in part by a ‘‘Euryi’’ grant from the European Science Foundation (‘‘Presupposition: A Formal Pragmatic Approach’’ to Philippe Schlenker).

1 A framework close in spirit to OT was popular in the operations research literature in the 1970s (Fishburn 1974). Indeed, Tesar and Smolensky’s (1998) Constraint Demotion algorithm was recently rediscovered within this literature (Dombi, Imreh, and Vincze 2007). The OT framework might also have connections with the machine learning field of preference learning (Fürnkranz and Hüllermeier 2010), although these possible connections have yet to be explored.

2 The supplementary online materials for this article are available at http://www.mitpressjournals.org/doi/suppl/10.1162/ling_a_00139.

3 I have chosen this name because it is analogous to elementary ranking condition, which has become the standard name in the literature for the analogous notion within OT; see (15) below. Bane, Riggle, and Sonderegger (2010) call EWCs difference vectors.

4 In the formulation (13a) of the weighting problem, I require the set Ā of EWCs to be finite. There is no need to also explicitly require the set A of ERCs in the ranking problem (17a) to be finite, as the total number of ERCs is finite, provided the constraint set is finite.

5 As discussed in detail in Tesar 2007, only holds as long as we consider a finite set of data triplets. Indeed, if the set is infinite, there might not exist any bound on the number of constraint violations, so that the constant Δ might not be defined.

6 The idea of a numerical representation of rankings is actually implicitly already present in Tesar and Smolensky’s (1998) notion of the offset of a constraint with respect to a ranking, defined as the number of strata above that constraint in that ranking.

7 The variant of the demotion-only update rule (36) that demotes all loser-preferrers (both the currently undominated and the dominated ones) converges too, but can require a very large number of updates (exponential in the number of constraints); see Magri 2009 for details. This suggests that it is always a good idea to restrict demotion to only the currently undominated loser-preferrers, rather than demoting all loser-preferrers.

8Pater’s (2008) counterexample consists of the second set of ERCs in (55a) below.

9 Consider for instance the case of language (39a). In order for /ba/ not to be neutralized to [pa], the ranking (40ai) is needed. In order for /za/ to be neutralized to [sa], the ranking (40aii) is needed. Given ranking (40aii), in order for /abza/ not to be neutralized to [apsa], the ranking (40aiii) is needed. Furthermore, given the ranking (40aii), in order for /abza/ not to be neutralized to [absa], the ranking (40aiv) is needed too.

10 A remark on the issue of the nonnegativity of the weights is in order here. In the description of HG in section 2, I introduced the restriction (4) that the weights be nonnegative, in order to prevent undesired typological predictions. This restriction requires algorithms for HG to enforce nonnegativity of the weights. This is not trivial in the case of the Perceptron, as nothing in the design of the update rule (48) enforces nonnegativity of the weights. The recent HG computational literature usually tries to get around this problem by starting out with large positive weights (see, e.g., Jesney and Tessier 2009). But this trick does not guarantee that the weights will stay nonnegative at every iteration until convergence, as the number of updates depends on the size of the initial weights. Furthermore, cutting off the weights at zero jeopardizes the Perceptron convergence theorem, recalled in appendix E of the online supplementary materials. Strictly speaking, the Perceptron is thus not an algorithm for standard HG; rather, it is an algorithm for a variant of HG without the nonnegativity restriction (4). See footnote 13 for a better error-driven algorithm for standard HG that maintains the nonnegativity of the weights in a principled way.

11 Strictly speaking, this is not completely correct. In fact, in its current formulation requires the weights to be nonnegative, which is not necessarily the case for the current weights entertained by the (original or revised) GLA or by the Perceptron; see footnote 10. Yet, as noted in appendix C of the online supplementary materials, holds also without the nonnegativity restriction on the weights, provided that the input ERCs all have a unique L. And that is indeed the case for the set of input ERCs in (46a). The case of input ERCs with an arbitrary number of L’s is addressed in appendix D.2 of the online supplementary materials, where the reasoning sketched in this section is formalized.

12 More precisely, I am using here the formulation of provided in appendix D.1 of the online supplementary materials, which applies here because each input diagonal ERC has a unique L.

13 As noted in footnote 10, the Perceptron is strictly speaking not an algorithm for HG, as it does not ensure that the current weights stay nonnegative, even if the initial weights are large and positive. The Winnow algorithm instead maintains the nonnegativity of the weights at any iteration, provided that it is initialized with nonnegative weights. Winnow is thus a better algorithm for HG than the Perceptron, even though the Perceptron is more widely used in the HG computational literature.

References

References
Bane
,
Max
, and
Jason
Riggle
.
2009
.
Evaluating strict domination: The typological consequences of weighted constraints
.
Paper presented at the 45th annual meeting of the Chicago Linguistic Society. To appear in the proceedings
.
Bane
,
Max
,
Jason
Riggle
, and
Morgan
Sonderegger
.
2010
.
The VC dimension of constraint-based grammars
.
Lingua
120
:
1194
1208
.
Bernhardt
,
Barbara Handford
, and
Joseph Paul
Stemberger
.
1998
.
Handbook of phonological development from the perspective of constraint-based nonlinear phonology
.
San Diego, CA
:
Academic Press
.
Bertsimas
,
Dimitris
, and
John N.
Tsitsiklis
.
1997
.
Introduction to linear optimization
.
Belmont, MA
:
Athena Scientific
.
Block
,
H. D
.
1962
.
The Perceptron: A model of brain functioning
.
Review of Modern Physics
34
:
123
135
.
Reprinted in James A. Anderson and Edward Rosenfeld, eds. 1988. Neurocomputing. Vol. 1, Foundations of research. Cambridge, MA: MIT Press
.
Boersma
,
Paul
.
1997
.
How we learn variation, optionality and probability
. In
Proceedings of the Institute of Phonetic Sciences (IFA) 21
, ed. by
Rob
van Son
,
43
58
.
Amsterdam
:
University of Amsterdam, Institute of Phonetic Sciences
.
Boersma
,
Paul
.
1998
.
Functional phonology. Doctoral dissertation, University of Amsterdam
.
The Hague
:
Holland Academic Graphics
.
Boersma
,
Paul
.
2009
.
Some correct error-driven versions of the Constraint Demotion Algorithm
.
Linguistic Inquiry
40
:
667
686
.
Boersma
,
Paul
, and
Bruce
Hayes
.
2001
.
Empirical tests of the Gradual Learning Algorithm
.
Linguistic Inquiry
32
:
45
86
.
Boersma
,
Paul
, and
Clara
Levelt
.
2000
.
Gradual constraint-ranking learning algorithm predicts acquisition order
. In
Proceedings of the 30th Child Language Research Forum
, ed. by
Eve V.
Clark
,
229
237
.
Stanford, CA
:
CSLI Publications
.
Corrected version available as Rutgers Optimality Archive, ROA 361. Available at http://roa.rutgers.edu.
Boersma
,
Paul
, and
Joe
Pater
.
To appear
.
Convergence properties of a gradual learning algorithm for Harmonic Grammar
. In
Harmonic Grammar and Harmonic Serialism
, ed. by
John
McCarthy
and
Joe
Pater
.
London
:
Equinox
.
Cesa-Bianchi
,
Nicolò
, and
Gábor
Lugosi
.
2006
.
Prediction, learning, and games
.
Cambridge
:
Cambridge University Press
.
Coetzee
,
Andries W.
, and
Joe
Pater
.
2008
.
Weighted constraints and gradient restrictions on place cooccurrence in Muna and Arabic
.
Natural Language and Linguistic Theory
26
:
289
337
.
Cristianini
,
Nello
, and
John
Shawe-Taylor
.
2000
.
An introduction to support vector machines and other kernel-based learning methods
.
Cambridge
:
Cambridge University Press
.
Curtin
,
Suzanne
, and
Kie
Zuraw
.
2002
.
Explaining constraint demotion in a developing system
. In
Boston University Conference on Language Development (BUCLD) 26
, ed. by
Barbora
Skarabela
,
Sarah
Fish
, and
Anna H.-J.
Do
,
1
:
118
129
.
Somerville, MA
:
Cascadilla
.
Dombi
,
József
,
Csanád
Imreh
, and
Nándor
Vincze
.
2007
.
Learning lexicographic orders
.
European Journal of Operational Research
183
:
748
756
.
Fishburn
,
P. C
.
1974
.
Lexicographic orders, utilities and decision rules: A survey
.
Management Science
20
:
1442
1471
.
Freund
,
Yoav
, and
Robert E.
Schapire
.
1999
.
Large margin classification using the Perceptron algorithm
.
Machine Learning
37
:
277
296
.
Fürnkranz
,
Johannes
, and
Eyke
Hüllermeier
.
2010
.
Preference learning
.
Heidelberg
:
Springer
.
Gnanadesikan
,
Amalia E
.
2004
.
Markedness and faithfulness constraints in child phonology
. In
Constraints in phonological acquisition
, ed. by
René
Kager
,
Joe
Pater
, and
Wim
Zonneveld
,
73
108
.
Cambridge
:
Cambridge University Press
. Circulated since 1995.
Hayes
,
Bruce
.
2004
.
Phonological acquisition in Optimality Theory: The early stages
. In
Constraints in phonological acquisition
, ed. by
René
Kager
,
Joe
Pater
, and
Wim
Zonneveld
,
158
203
.
Cambridge
:
Cambridge University Press
.
Hayes
,
Bruce
, and
Colin
Wilson
.
2008
.
A maximum entropy model of phonotactics and phonotactic learning
.
Linguistic Inquiry
39
:
379
440
.
Heinz
,
Jeffrey
, and
Jason
Riggle
.
2011
.
Learnability
. In
The Blackwell companion to phonology
, ed. by
Marc
van Oostendorp
,
Colin J.
Ewen
,
Elizabeth V.
Hume
, and
Keren
Rice
,
1
:
54
78
.
Malden, MA
:
Wiley-Blackwell
.
Jesney
,
Karen
, and
Anne-Michelle
Tessier
.
2009
.
Biases in Harmonic Grammar: The road to restrictive learning
.
Natural Language and Linguistic Theory
29
:
251
290
.
Johnson
,
David S
., and
Franco P.
Preparata
.
1978
.
The densest hemisphere problem
.
Theoretical Computer Science
6
:
93
107
.
Jusczyk
,
Peter W
.,
Angela D.
Friederici
,
Jeanine M. I.
Wessels
,
Vigdis Y.
Svenkerud
, and
Ann Marie
Jusczyk
.
1993
.
Infants’ sensitivity to the sound patterns of native language words
.
Journal of Memory and Language
32
:
402
420
.
Keller
,
Frank
.
2000
.
Gradience in grammar: Experimental and computational aspects of degrees of grammaticality
.
Doctoral dissertation, University of Edinburgh
.
Keller
,
Frank
.
2005
.
Linear Optimality Theory as a model of gradience in grammar
. In
Gradience in grammar: Generative perspectives
, ed. by
Gisbert
Fanselow
,
Caroline
Féry
,
Ralph
Vogel
, and
Matthias
Schlesewsky
,
270
287
.
Oxford
:
Oxford University Press
.
Keller
,
Frank
, and
Ash
Asudeh
.
2002
.
Probabilistic learning algorithms and Optimality Theory
.
Linguistic Inquiry
33
:
225
244
.
Khardon
,
Roni
, and
Gabriel
Wachman
.
2007
.
Noise tolerant variants of the Perceptron algorithm
.
Journal of Machine Learning Research
8
:
227
248
.
Kivinen
,
Jyrki
,
Manfred K.
Warmuth
, and
Peter
Auer
.
1997
.
The Perceptron algorithm versus Winnow: Linear versus logarithmic mistake bounds when few input variables are relevant
.
Artificial Intelligence
97
:
325
343
.
Klasner
,
Norbert
, and
Hans-Ulrich
Simon
.
1995
.
From noise-free to noise-tolerant and from on-line to batch learning
. In
Proceedings of the 8th Annual Conference on Computational Learning Theory
, ed. by
Wolfgang
Maass
,
250
257
.
Association for Computing Machinery
.
Legendre
,
Géraldine
,
Yoshiro
Miyata
, and
Paul
Smolensky
.
1998a
.
Harmonic Grammar: A formal multilevel connectionist theory of linguistic well-formedness: An application
. In
Proceedings of the 12th annual conference of the Cognitive Science Society
, ed. by
Morton Ann
Gernsbacher
and
Sharon J.
Derry
,
884
891
.
Mahwah, NJ
:
Lawrence Erlbaum
.
Legendre
,
Géraldine
,
Yoshiro
Miyata
, and
Paul
Smolensky
.
1998b
.
Harmonic Grammar: A formal multilevel connectionist theory of linguistic well-formedness: Theoretical foundations
. In
Proceedings of the 12th annual conference of the Cognitive Science Society
, ed. by
Morton Ann
Gernsbacher
and
Sharon J.
Derry
,
388
395
.
Mahwah, NJ
:
Lawrence Erlbaum
.
Littlestone
,
Nick
.
1988
.
Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm
.
Machine Learning
2
:
285
318
.
Lombardi
,
Linda
.
1999
.
Positional faithfulness and voicing assimilation in Optimality Theory
.
Natural Language and Linguistic Theory
17
:
267
302
.
Magri
,
Giorgio
.
2009
.
A theory of individual level predicates based on blind mandatory implicatures: Constraint promotion for Optimality Theory
.
Doctoral dissertation, MIT, Cambridge, MA
.
Magri
,
Giorgio
.
2011a
.
HG has no computational advantages over OT
. In
WCCFL 29
, ed. by
Jaehoon
Choi
,
E. Alan
Hogue
,
Jeffrey
Punske
,
Deniz
Tat
,
Jessamyn
Schertz
, and
Alex
Trueman
,
380
388
.
Somerville, MA
:
Cascadilla Proceedings Project
.
Magri
,
Giorgio
.
2011b
.
An online model of the acquisition of phonotactics within Optimality Theory
. In
Expanding the space of cognitive science: Proceedings of the 33rd annual conference of the Cognitive Science Society
, ed. by
Laura
Carlson
,
Christoph
Hölscher
, and
Thomas F.
Shipley
,
2012
2017
.
Austin, TX
:
Cognitive Science Society
.
Magri
,
Giorgio
.
2012a
.
Convergence of error-driven ranking algorithms
.
Phonology
29
:
213
269
.
Magri
,
Giorgio
.
2012b
.
A note on the equivalence between Recursive Constraint Demotion and the Fourier-Motzkin Elimination Algorithm
.
Ms., CNRS, Université Paris 8
.
Magri
,
Giorgio
.
2013
.
The complexity of learning in Optimality Theory and its implications for the acquisition of phonotactics
.
Linguistic Inquiry
44
:
433
468
.
McLeod
,
Sharynne
,
Jon
van Doorn
, and
Vicki A.
Reed
.
2001
.
Normal acquisition of consonant clusters
.
American Journal of Speech-Language Pathology
10
:
99
110
.
Novikoff
,
Albert B. J
.
1962
.
On convergence proofs on Perceptrons
. In
Proceedings of the Symposium on the Mathematical Theory of Automata
,
12
:
615
622
.
Pater
,
Joe
.
2008
.
Gradual learning and convergence
.
Linguistic Inquiry
39
:
334
345
.
Pater
,
Joe
.
2009
.
Weighted constraints in generative linguistics
.
Cognitive Science
33
:
999
1035
.
Potts
,
Christopher
,
Joe
Pater
,
Karen
Jesney
,
Rajesh
Bhatt
, and
Michael
Becker
.
2010
.
Harmonic Grammar with linear programming: From linear systems to linguistic typology
.
Phonology
27
:
1
41
.
Prince
,
Alan
.
2002
.
Entailed ranking arguments
.
Ms., Rutgers University, New Brunswick, NJ. Rutgers Optimality Archive, ROA 500. Available at http://roa.rutgers.edu.
Prince
,
Alan
, and
Paul
Smolensky
.
2004
.
Optimality Theory: Constraint interaction in generative grammar
.
Oxford
:
Blackwell
.
Initially published in 1993 as Technical Report CU-CS-696-93, Department of Computer Science, University of Colorado at Boulder, and Technical Report TR-2, Rutgers Center for Cognitive Science, Rutgers University, New Brunswick, NJ. Rutgers Optimality Archive, ROA 537. Available at http://roa.rutgers.edu
.
Prince
,
Alan
, and
Bruce
Tesar
.
2004
.
Learning phonotactic distributions
. In
Constraints in phonological acquisition
, ed. by
René
Kager
,
Joe
Pater
, and
Wim
Zonneveld
,
245
291
.
Cambridge
:
Cambridge University Press
.
Riggle
,
Jason
.
2009
.
The complexity of ranking hypotheses in Optimality Theory
.
Computational Linguistics
35
:
47
59
.
Rosenblatt
,
Frank
.
1958
.
The Perceptron: A probabilistic model for information storage and organization in the brain
.
Psychological Review
65
:
386
408
.
Rosenblatt
,
Frank
.
1962
.
Principles of neurodynamics
.
New York
:
Spartan
.
Shalev-Shwartz
,
Shai
, and
Yoram
Singer
.
2005
.
A new perspective on an old Perceptron algorithm
. In
Learning theory: 18th Annual Conference on Learning Theory
, ed. by
Peter
Auer
and
Ron
Meir
,
264
278
.
Springer
.
Stemberger
,
Joseph Paul
, and
Barbara Handford
Bernhardt
.
1999
.
The emergence of faithfulness
. In
The emergence of language
, ed. by
Brian
MacWhinney
,
417
446
.
Mahwah, NJ
:
Lawrence Erlbaum
.
Stemberger
,
Joseph Paul
,
Barbara Handford
Bernhardt
, and
Carolyn E.
Johnson
.
1999
.
U-shaped learning in the acquisition of prosodic structure
.
Poster presented at the sixth International Child Language Congress
.
Tesar
,
Bruce
.
1995
.
Computational Optimality Theory
.
Doctoral dissertation, University of Colorado, Boulder. Rutgers Optimality Archive, ROA 90. Available at http://roa.rutgers.edu
.
Tesar
,
Bruce
.
2007
.
A comparison of lexicographic and linear numeric optimization using violation difference ratios
.
Ms., Rutgers University, New Brunswick, NJ. Rutgers Optimality Archive, ROA 939. Available at http://roa.rutgers.edu
.
Tesar
,
Bruce
, and
Paul
Smolensky
.
1998
.
Learnability in Optimality Theory
.
Linguistic Inquiry
29
:
229
268
.
Tessier
,
Anne-Michelle
.
2009
.
Frequency of violation and constraint-based phonological learning
.
Lingua
119
:
6
38
.

Supplementary data