## Abstract

Various authors have recently endorsed Harmonic Grammar (HG) as a replacement for Optimality Theory (OT). One argument for this move is that OT seems not to have close correspondents within machine learning while HG allows methods and results from machine learning to be imported into computational phonology. Here, I prove that this argument in favor of HG and against OT is wrong. In fact, I show that any algorithm for HG can be turned into an algorithm for OT. Hence, HG has no computational advantages over OT. This result allows tools from machine learning to be systematically adapted to OT. As an illustration of this new toolkit for computational OT, I prove convergence for a slight variant of Boersma’s (1998) (nonstochastic) Gradual Learning Algorithm.

## 1 Introduction

The peculiar property of Optimality Theory (OT; Prince and Smolensky 2004) is that it uses *constraint ranking* and thus enforces *strict domination*, according to which the highest-ranked relevant constraint ‘‘takes it all.’’ Because of this property, OT seems prima facie not to have any close correspondents within core machine learning.^{1} For this reason, the toolkit available nowadays in computational OT for modeling language acquisition, production, and perception consists mainly of combinatorial algorithms, specifically tailored to the framework of OT, developed with few connections to methods and results in machine learning. Tesar and Smolensky’s (1998) powerful ranking algorithms well exemplify this approach to computational OT.

In order to bridge this gap between computational phonology and machine learning, various scholars have started to entertain and explore variants of OT that replace constraint ranking with *constraint weighting* and strict domination with *additive interaction*, and thus fall within the general class of *linear models* very well studied in machine learning. An important and simple such model is *Harmonic Grammar* (HG; Legendre, Miyata, and Smolensky 1998a,b). For instance, Pater (2009) writes, ‘‘[I will] illustrate and extend existing arguments for the replacement of OT’s ranked constraints with [HG’s] weighted ones: that the resulting model of grammar . . . is compatible with well-understood algorithms for learning and other computations’’ ( p. 1002) and in particular ‘‘for learning variable outcomes and for learning gradually’’ ( p. 1021). He then adds that ‘‘the strengths of HG in this area are of considerable importance’’ ( p. 1002): in fact, ‘‘as these algorithms are broadly applied with connectionist and statistical models of cognition, this forms an important connection between the HG version of generative linguistics and other research in cognitive science’’ ( p. 1021). In other words, HG has been conjectured to be computationally superior to OT because it can make use of algorithms from machine learning (i.e., algorithms for linear classification), unlike OT. This conjecture of an alleged computational superiority of HG over OT has been endorsed by a number of authors in the recent literature (e.g., Coetzee and Pater 2008, Hayes and Wilson 2008, Jesney and Tessier 2009, Potts et al. 2010, Boersma and Pater, to appear).

In section 2, I briefly review the two frameworks of OT and HG. In section 3, I then *prove* that this conjecture of an alleged computational superiority of HG over OT is false. In fact, the main result of this article is a simple strategy that allows any algorithm for HG to be turned into an algorithm for OT (see ). Hence, HG has no computational advantages over OT, and the departure from OT to HG is not warranted on the basis of computational considerations. Of course, this result does not in any way provide an argument in favor of OT or against alternative frameworks such as HG. It only shows that the argument against OT and in favor of HG based on the conjectured computational superiority of the latter does not go through. Still, my result on the systematic portability of algorithms from HG into OT is significant because it leads to a substantial enrichment of the current toolkit of computational OT. As noted above, computational OT has relied so far mainly on combinatorial algorithms specifically tailored to the framework of OT, with few connections to machine learning. The result presented in section 3 allows this classical toolkit to be supplemented with a whole new set of algorithmic tools, obtained by systematically adapting to OT well-known algorithms from machine learning. A proper combination of the classical toolkit with the new one has the potential to spur new research in computational modeling of language production, perception, and acquisition within the framework of OT.

As an initial illustration of the fruitfulness of these new algorithmic tools, section 4 describes a specific application in some detail. An obvious property of language acquisition is that it is gradual, in the sense that the target adult language is approached through a path of conservative, intermediate stages. This gradualness suggests the following learning scheme within OT. The algorithm maintains a current ranking, which represents its current hypothesis of the target adult grammar. Data come in a stream. Every time the current piece of data is inconsistent with the current ranking, constraints are slightly reranked. As the ranking dynamics is driven by errors made on the stream of data, the algorithm is called *error-driven*. The sequence of slight rerankings describes a path within the space of possible OT grammars, thus modeling gradual child acquisition paths. It is mainly for this reason that error-driven learning has been endorsed in the OT acquisition literature (see, e.g., Bernhardt and Stemberger 1998, Boersma and Levelt 2000, Gnanadesikan 2004, as well as Tessier 2009 for discussion).

Two main implementations of the error-driven learning scheme have been developed in the OT computational literature, reviewed in section 4.1. One is Tesar and Smolensky’s (1998),*Error Driven Constraint Demotion* (EDCD), or Boersma’s (1998:323–327) gradual reformulation thereof. The other is Boersma’s (1997),*Gradual Learning Algorithm* (GLA). The main difference between (gradual) EDCD and the GLA is that the former only performs constraint demotion while the latter performs both promotion and demotion. This difference is crucial, from both a computational and a modeling perspective. From a *computational* perspective, lack of constraint promotion allowed Tesar and Smolensky to prove that EDCD converges; that is, it can only make a finite, small number of errors. On the contrary, constraint promotion was shown by Pater (2008) to prevent the GLA from converging in the general case. Although a liability from a computational perspective, constraint promotion turns into an advantage from a *modeling* perspective, as argued in section 4.2, building on Bernhardt and Stemberger 1998, Boersma 1998, and Magri 2012a.

Against the background of this tension between the computational and modeling perspectives, the following question stands out as one of the main open issues in computational OT: is it possible to devise a variant of the GLA that performs constraint promotion and yet provably converges, so as to retain its modeling virtues without sacrificing computational soundness? In –, I show that the new approach to computational OT developed in this article leads to a simple solution to this important question. I introduce a variant of the (nonstochastic) GLA and I prove that it converges in the general case, even though it performs both constraint demotion and promotion (see ). The proof uses the result on the portability of algorithms from HG into OT established in section 3: convergence of the revised GLA is shown to follow from a classical machine learning result—namely, convergence of the *Perceptron* algorithm for HG.

In section 5, I summarize this new approach to computational OT, based on a systematic translation into OT of methods and results from machine learning.

In order to make the article accessible to the reader with no computational inclination, in the body of the article I explain the main results in a plain, nontechnical way, focusing on concrete examples. Detailed proofs are provided in the appendices available as supplementary online materials.^{2}

## 2 Description of the Frameworks of OT and HG

This section reviews the frameworks of OT and HG with an eye to the formal details, presupposing general familiarity with these frameworks. Section 2.1 introduces the two corresponding core computational problems: *weighting* and *ranking*. Section 2.2 restates these two problems in terms of a more compact notation that will turn out to be useful in the rest of the article. Finally, section 2.3 reviews what is currently known concerning the relationship between the weighting and ranking problems.

### 2.1 Basic Description of HG and OT

The basic data unit in both HG and OT is a *data triplet* as in (1a). The first entry in the triplet provides the *underlying form*, here notated *x*. The second entry provides the intended *winner candidate* for that underlying form, here notated *y*. The final entry in the triplet provides a *loser candidate*, here notated (as a mnemonic, I strike out losers).*z*

(1)

- a.
(

x,y,~~)~~z- b.
(/rad/, [rat], [

~~rad~~])

An example of an underlying/winner/loser form triplet is provided in (1b): the underlying form /rad/ is paired with the two candidate surface forms [rat] and [rad], together with the information that the former is the intended winner while the latter is a loser.

An HG grammar is parameterized by a *weight vector* (2a), which is a tuple ** θ** with

*n*numerical components

*θ*

_{1}, . . . ,

*θ*

*, one for each of the*

_{n}*n*constraints

*C*

_{1}, . . . ,

*C*

*. The*

_{n}*k*th component

*θ*

*is called the*

_{k}*weight*of the corresponding constraint

*C*

*.*

_{k}An example is provided in (2b). The constraint set contains three constraints. Constraints *C*_{1} and *C*_{2} are the faithfulness constraints IDENT[VOICE]/ONSET and IDENT[VOICE] (henceforth, *F*_{pos} and *F*, respectively), which enforce preservation of voicing in onset position and in an arbitrary position, respectively. Constraint *C*_{3} is the markedness constraint *[ + VOICE, − SONORANT] (henceforth, *M*) that penalizes voiced obstruents. The weight vector ** θ** assigns these three constraints

*C*

_{1},

*C*

_{2}, and

*C*

_{3}the three weights 8, 2, and 4, respectively.

To complete the description of the HG framework, we need a notion of ‘‘compatibility’’ between a hypothesis (i.e., a weight vector) and a piece of data (i.e., a data triplet). A weight vector ** θ** is called

*HG-compatible*with an underlying/winner/loser form data triplet (

*x*,

*y*,

*z*

*z*

*y*, in the sense that the sum of the constraint violations for the loser

*z*

*y*multiplied by the corresponding weights.

As an example, note that the weight vector (2b) is HG-compatible with the data triplet (1b). Of course, a weight vector is called *HG-compatible* with a set of data triplets provided it is HG-compatible with every triplet in the set. And a set of data triplets is called *HG-compatible* provided it is compatible with at least one weight vector.

If the weights are allowed to be negative, undesired typological consequences follow. Here is an example. Consider again the constraint set in (2b). If we allow negative weights, then the triplet (/ta/, [da], [~~ta~~]) turns out to be HG-compatible (say, with the weights *θ*_{1} = *θ*_{2} = *θ*_{3} = −1). This means that [da] wins over [~~ta~~] as the surface form corresponding to the underlying form /ta/. This result is undesired, as the underlying form /ta/ is unmarked relative to the constraints considered and should therefore always surface faithfully. For this reason, from now on I will require the weights to be nonnegative, as stated in (4).

(4)

θ_{1}, . . . ,θ≥ 0_{n}

Let me stress that this nonnegativity restriction (4) is not part of the core computational definition of HG, and it can thus be relaxed if needed. Condition (4) is only needed because of the assumption that constraints assign ‘‘violations,’’ and never ‘‘rewards.’’

Suppose that we know the constraints and we are provided with a finite set of data, namely, a finite set of underlying/winner/loser form triplets. We would like to come up with an assignment of (nonnegative) weights to the constraints that ‘‘works’’ for those data. How can we formalize this requirement? If the data are HG-compatible, then of course we want to find a weight vector that is indeed HG-compatible with *all* the data. If the data are not HG-compatible, then no such vector can be found. In this case, we might want to find a weight vector that is HG-compatible with just *most* of the data. Unfortunately, the problem thus formulated cannot be solved efficiently, as it is intractable (Johnson and Preparata 1978). We thus need to content ourselves with a less demanding formulation of the problem. Let’s say that, in the case where the data are not HGcompatible, we just need to detect the incompatibility. These considerations lead to the classical computational problem (5a), which I will refer to as the *weighting problem.* This is the simplest computational problem in HG. On the one hand, this problem is simple because the input to the problem is as rich as possible, as the underlying forms are provided as well as the losers and the winner forms are completely parsed. On the other hand, this problem is simple because the output of the problem is as unconstrained as possible: the weight vector returned (if any) is only required to be HG-compatible with the data, and there are no further requirements. The weighting problem is thus the kernel of any computational problem that arises in HG. I will denote by WP() the instance of the weighting problem (5a) corresponding to a set of data triplets , or equivalently the set of all its solutions.

(5)

- a.

Given:a constraint set and a data set consisting of a finite number of underlying/winner/loser form triplets.

Return:⊥, if the data are not HG-compatible; otherwise, a nonnegative weight vectorthat is HG-compatible with the data , according to condition (3).θ- b.

Given:the three constraints in (2b) and the two data triplets (/da/, [da], [~~ta~~]) and (/rad/, [rat], [~~rad~~]).

Return:⊥, if the two triplets are not HG-compatible; otherwise, nonnegative weights for the constraints that are HG-compatible with the two triplets.

An instance of the weighting problem is provided in (5b). The underlying forms /da/ and /rad/ are paired with the corresponding intended winner surface forms [da] and [rat] and the corresponding intended losers [~~ta~~] and [~~rad~~], respectively. We are asked to determine whether these data are HG-compatible. If they are, we also have to return HG-compatible (nonnegative) weights for the constraints in (2b). Thus, the weight vector in (2b) is a solution to this instance (5b) of the weighting problem.

Let us now turn to OT. An OT grammar is parameterized by a *ranking*, which is a linear order >> over the constraint set, as illustrated in (6a), or equivalently in (6a′). We say that constraint *C** _{h}* is >>-

*ranked above*another constraint

*C*

*provided that*

_{k}*C*

_{h}*>>*

*C*

*.*

_{k}To illustrate, a ranking over the constraint set in (2b) is provided in (6b): it sandwiches the markedness constraint *C*_{3} in between the two faithfulness constraints *C*_{1} and *C*_{2}, with the positional faithfulness constraint ranked at the top.

Also in the case of OT, data units are underlying/winner/loser form triplets, as in (1). To complete the definition of the OT framework, we need a notion of ‘‘compatibility’’ between a hypothesis (i.e., a ranking) and a piece of data (i.e., a data triplet). A ranking >> is called *OT-compatible* with an underlying/winner/loser form data triplet (*x*, *y*, ) provided condition (7) holds. This condition says that the intended loser *z* violates the constraints ‘‘more severely’’ than the intended winner *z**y*, in the sense that, among those constraints that distinguish between winner and loser, the top-ranked one *C*_{top} assigns more violations to the loser than to the winner.

(7)

C_{top}(x,~~) >~~zC_{top}(x,y)where

C_{top}= the top >>-ranked constraint among those that assign a different number of violations to the loser~~and to the winner~~zy.

As an example, the ranking (6b) is OT-compatible with the underlying/winner/loser form triplet (/rad/, [rat], [~~rad~~]) in (1b). Of course, a ranking >> is called *OT-compatible* with a set of data triplets provided it is OT-compatible with every triplet in the set. Furthermore, a set of data triplets is called *OT-compatible* provided it is compatible with at least one ranking.

In complete analogy with the setting considered above for HG, suppose we know the constraint set and are provided with data consisting of a finite number of underlying/winner/loser form triplets. Again, we would like to come up with a constraint ranking that ‘‘works’’ for those data. If the data are OT-compatible, then of course this means that we want to find a ranking that is indeed OT-compatible with *all* the data. If the data are not OT-compatible, then no such ranking can be found. In this case, we might want to find a ranking that is OT-compatible with just *most* of the data. Unfortunately, the problem thus formulated cannot be solved efficiently, as it is intractable (Magri 2013). We thus need to content ourselves with a less demanding formulation of the problem. Let’s say that, in the case where the data are not OT-compatible, we just need to detect the incompatibility. These considerations lead to the classical computational problem (8a), which I will refer to as the *ranking problem.* This problem is completely analogous to the HG weighting problem (5a). The ranking problem is the kernel of any computational problem that arises in OT. I will denote by RP() the instance of the ranking problem (8a) corresponding to a set of data triplets , or equivalently the set of its solutions.

(8)

- a.

Given:a constraint set and a data set consisting of a finite number of underlying/ winner/loser form triplets.

Return:⊥, if the data are not OT-compatible; otherwise, a ranking >> that is OT-compatible with the data , according to condition (7).- b.

Given:the three constraints in (2b) and the two data triplets (/da/, [da], [~~ta~~]) and (/rad/, [rat], [~~rad~~]).

Return:⊥, if the two triplets are not OT-compatible; otherwise, a ranking of the three constraints that is OT-compatible with the two data triplets.

An instance of the ranking problem is provided in (8b). The underlying forms /da/ and /rad/ are paired with the corresponding intended winner surface forms [da] and [rat] and the corresponding intended losers [~~ta~~] and [~~rad~~], respectively. We are asked to determine whether these data are OT-compatible. If they are, we also have to return an OT-compatible ranking of the constraints in (2b). The unique solution to this instance of the problem is ranking (6b).

### 2.2 A More Compact Notation for the Data in HG and OT

Given an underlying/winner/loser form data triplet (*x*, *y*, ), the difference (9a) between the number *z**C** _{k}*(

*x*,

*z*

*C*

*to the loser*

_{k}*z*

*C*

*(*

_{k}*x*,

*y*) of violations assigned to the winner

*y*is called the k

*th constraint difference*.

(9)

- a.

C(_{k}x,~~) −~~zC(_{k}x,y)- b.

F_{pos}(/rad/, [~~rad~~]) −F_{pos}(/rad/, [rat]) = 0

F(/rad/, [~~rad~~]) −F(/rad/, [rat]) = − 1

M(/rad/, [~~rad~~]) −M(/rad/, [rat]) = 1

An example is provided in (9b) for the underlying/winner/loser form data triplet (/rad/, [rat], [~~rad~~]) in (1b) and the constraint set {*F*_{pos}, *F*, *M*} in (2b). The constraint difference corresponding to the positional faithfulness constraint *F*_{pos} = IDENT[VOICE]/ONSET is 0, because that constraint assigns zero violations to the mapping of /rad/ both to [rat] and to [~~rad~~]. The constraint difference corresponding to the general faithfulness constraint *F* = IDENT[VOICE] is − 1, because the intended loser [~~rad~~] is fully faithful to the underlying form /rad/, unlike the intended winner [rat]. Finally, the constraint difference corresponding to the markedness constraint *M =* *[ + VOICE] is 1, as only the intended loser [~~rad~~] violates the markedness constraint, while the winner [rat] does not.

Condition (3) for HG-compatibility can of course be rewritten as in (10), by bringing everything on one side. This restatement highlights the fact that HG-compatibility is only sensitive to the constraint differences, not to the actual numbers of constraint violations.

Thus, the information provided by a data triplet that is really needed for the sake of establishing HG-compatibility can be distilled as in (11). The data triplet is paired with a tuple with *n* entries (one for every constraint), with the convention that the *k*th entry is the *k*th constraint difference *C** _{k}*(

*x*,

*z*

*C*

*(*

_{k}*x*,

*y*). One such

*n*-tuple of numbers is called an

*elementary weighting condition*(EWC).

^{3}It contains all the information that is needed to compare the winner and the loser within HG. A generic EWC is denoted by

**ā**and its components by

*ā*

_{1}, . . . ,

*ā*

*; a finite collection of EWCs is denoted by*

_{n}**Ā**; the collection of EWCs corresponding to a set of data triplets is denoted by

**Ā**(); I often omit zeros for readability.

An example is provided in (11b): we are given a data set consisting of two underlying/winner/ loser form triplets (/da/, [da], [~~ta~~]) and (/rad/, [rat], [~~rad~~]); we compute the corresponding constraint differences with respect to the constraint set in (2b), and we arrange them in the two EWCs as in **Ā** ().

With this notation in place, condition (10) for HG-compatibility between a weight vector ** θ** = (

*θ*

_{1}, . . . ,

*θ*

*) and an underlying/winner/loser form data triplet can be restated in terms of the corresponding EWC*

_{n}**ā**= [

*ā*

_{1}, . . . ,

*ā*

*] as condition (12). Thus, let us say that a weight vector*

_{n}**is**

*θ**HG-compatible*with an arbitrary EWC

**ā**= [

*ā*

_{1}, . . . ,

*ā*

*] provided condition (12) holds.*

_{n}Of course, a weight vector ** θ** is called

*HG-compatible*with a set

**Ā**of EWCs provided it is HGcompatible with every EWC in the set. And a set of EWCs is called

*HG-compatible*provided it is compatible with at least one weight vector.

The weighting problem has been stated in (5a) in terms of data triplets. But actual data triplets carry superfluous information, and a sharper representation of data triplets is provided by EWCs. Thus, it is convenient to restate the weighting problem in terms of EWCs, as in (13a). I will denote by WP(**Ā**) the instance of the weighting problem (13a) corresponding to a finite set **Ā** of EWCs, or equivalently the set of its solutions. Of course, a weight vector is a solution to the instance of the original weighting problem (5a) for a given set of data triplets if and only if it is a solution to the instance of the problem (13a) for the corresponding set **Ā** () of EWCs, namely, WP() = WP(**Ā** ()).

(13)

- a.

Given:a finite setĀof EWCs.

Return:⊥, if the dataĀare not HG-compatible; otherwise, a nonnegative weight vectorthat is HG-compatible withθĀ, according to condition (12).- b.

Given:the setĀof EWCs in (11b).

Return:⊥, if the dataĀare not HG-compatible; otherwise, a nonnegative weight vectorfor the constraint set in (2b) that is HG-compatible withθĀ.

As an example, I give in (13b) the reformulation in terms of EWCs of the instance of the weighting problem (5b).

An analogous simplification of the representation of the data and of the corresponding core computational problem is available within the framework of OT. Given an underlying/winner/loser form data triplet (*x*, *y*, ), the constraints can be sorted into *z**winner-preferring, loser-preferring*, and *even* as in (14a), depending on whether the corresponding constraint difference is positive (i.e., the constraint assigns more violations to the loser than to the winner), negative (i.e., the constraint assigns fewer violations to the loser than to the winner), or null (i.e., the constraint assigns the same number of violations to the loser and to the winner).

An example is provided in (14b) for the constraint set {*F*_{pos}, *F*, *M*} in (2b) and the underlying/winner/loser form data triplet (/rad/, [rat], [~~rad~~]) in (1b), building on the constraint differences computed in (9b).

The notion of OT-compatibility in (7) only depends on whether the various constraints are winner-preferring, loser-preferring, or even. Following Prince (2002), the information provided by an underlying/winner/loser form triplet (*x*, *y*, ) that is really needed for the sake of OT-compatibility can thus be distilled as in (15a). The data triplet is paired with a tuple with *z**n* entries (one for every constraint), with the convention that the *k*th entry is equal to W, L, or *e* depending on whether the *k*th constraint *C** _{k}* is winner-preferring, loser-preferring, or even. One such

*n*-tuple of L’s,

*e*’s, and W’s is called an

*elementary ranking condition*(ERC). It contains all the information that is needed to compare the winner and the loser within OT. A generic ERC is denoted by

**a**and its components by

*a*

_{1}, . . . ,

*a*

*; a finite collection of ERCs is denoted by*

_{n}**A**; the collection of ERCs corresponding to a set of triplets is denoted by

**A**(); I often omit

*e*’s for readability.

As an example, I provide in (15b) the collection of ERCs corresponding to the collection of EWCs in (11b) for the two data triplets (/da/, [da], [~~ta~~]) and (/rad/, [rat], [~~rad~~]) and the constraint set {*F*_{pos}, *F*, *M*} in (2b).

With this notation in place, condition (7) for OT-compatibility between a ranking >> and a set of underlying/winner/loser form data triplets can be restated as condition (16a) in terms of the corresponding set of ERCs. Thus, let us say that a ranking >> is *OT-compatible* with an arbitrary set of ERCs provided condition (16a) holds.

As an illustration of the notion of OT-compatibility in (16a), note that the set of ERCs in (15b) is OT-compatible with the ranking *F*_{pos} >> *M >>**F* in (6b): once the entries are ordered from left to right in >>-decreasing order (by switching the second column with the third), we obtain the ERCs in (16b), which indeed have the desired property that the leftmost non-*e* symbol of every ERC is a W. Of course, a collection of ERCs is called *OT-compatible* provided it is compatible with at least one ranking.

The ranking problem has been stated in (8a) in terms of data triplets. But actual data triplets carry superfluous information, and a sharper representation of data triplets is provided by ERCs. Thus, it is convenient to restate the ranking problem in terms of ERCs, as in (17a). I will denote by RP(**A**) the instance of the ranking problem (17a) corresponding to a set **A** of ERCs,^{4} or equivalently the set of its solutions. Of course, a ranking is a solution to the instance of the original ranking problem (8a) for a given set of data triplets if and only if it is a solution to the instance of the problem (17a) for the corresponding set of ERCs **A**(), namely, RP() = RP(**A**()).

(17)

- a.

Given:a setAof ERCs.

Return:⊥, if the dataAare not OT-compatible; otherwise, a ranking >> that is OT-compatible withA, according to condition (16a).- b.

Given:the setAof ERCs in (15b).

Return:⊥, if the dataAare not OT-compatible; otherwise, a ranking >> of the constraint set in (2b) that is OT-compatible withA.

### 2.3 What Is Currently Known about the Relationship between the Weighting and Ranking Problems

summarizes what is currently known concerning the relationship between the two core computational problems in HG and OT, the weighting and ranking problems.

LEMMA 1.

If a finite set of underlying/winner/loser form triplets is OT-compatible, then it is also HG-compatible. More precisely, let>>be a ranking OT-compatible with . Without loss of generality, assume that it is (18a): C_{n}is ranked at the top, C_{n}_{−1}below it, and so on, until the bottom-ranked C_{1}.

Then, the weight vector= (θθ_{1}, . . . ,θ)_{n}defined in () is HG-compatible with , where Δ is the largest constraint difference (ignoring sign) and δ is the smallest positive constraint difference.▓

For completeness, the proof is recalled in appendix A of the online supplementary materials, after Prince and Smolensky 2004 and Keller 2000, 2005. The idea of the proof is that the highest-takes-all behavior of the notion of OT-compatibility (7) can be mimicked by the weighted notion of HG-compatibility (3) through exponentially spaced weights.^{5} For a comparison between HG and OT from a typological perspective, see for instance Tesar 2007, Pater 2009, Bane and Riggle 2009, and Potts et al. 2010; and for a comparison from a learnability perspective, see Riggle 2009 and Bane, Riggle, and Sonderegger 2010.

Let me illustrate the lemma with an example. Given the data set consisting of the two underlying/winner/loser form triplets (/da/, [da], [~~ta~~]) and (/rad/, [rat], [~~rad~~]), consider the corresponding set of EWCs (11b) and the corresponding set of ERCs (15b), both repeated in (19).

The set of ERCs **A** is OT-compatible with the ranking *F*_{pos} >> *M* >> *F* in (6b). Since in this case Δ = 1, the corresponding weights according to (18b) are *θ*_{F}_{pos} = 8, *θ*_{F}*=* 2, and *θ*_{M}*=* 4, considered in (2b). The latter weights are indeed HG-compatible with the set of EWCs **Ā** .

The reverse of does not hold; namely, there exist data sets that are HG-compatible but not OT-compatible. Here is a counterexample. Suppose that the set of EWCs is **Ā** in (20a). The corresponding set of ERCs is **A** in (20b). The former is HG-compatible (say, with the weights *θ*_{1} = 3 and *θ*_{2} = *θ*_{3} = 2), but the latter is not OT-compatible.

Example (20) can be made more explicit as follows. In order for a weight vector ** θ** = (

*θ*

_{1},

*θ*

_{2},

*θ*

_{3}) to be HG-compatible with the first and second EWCs in (20a), the weight of constraint

*C*

_{1}must be larger than both the weights of constraints

*C*

_{2}and

*C*

_{3}, as stated in (21a); in order for a ranking >> to be OT-compatible with the first and second ERCs in (20b), constraint

*C*

_{1}must be ranked above both constraints

*C*

_{2}and

*C*

_{3}, as stated in (21b).

No ranking that satisfies the ranking conditions (21b) can ever be OT-compatible with the third ERC in (20b). A weight vector that satisfies the weighting conditions (21a) can instead be HG-compatible with the third EWC in (20a), provided that the two constraints *C*_{2} and *C*_{3}, despite their small weights, are allowed to join forces and gang up against constraint *C*_{1}, in the sense that the sum *θ*_{2} +*θ*_{3} of their weights is larger than the weight *θ*_{1} of *C*_{1}. The crucial difference between HG and OT is that the former allows these *gang-up effects*, while the latter doesn’t.

The algorithmic implications of can be brought out as follows. Suppose we are given a finite set of data triplets that happen to be not only HG-compatible but also OTcompatible; and suppose we are interested in solving the instance WP(**Ā** ) of the weighting problem for the corresponding set **Ā** = **Ā** () of EWCs. says that, instead of solving the weighting problem WP(**Ā** ) *directly*, we can solve it *indirectly*, through the three steps (22a–c). In step (22a), we construct the corresponding set **A** of ERCs, according to (15). In step (22b), we solve the corresponding instance RP(**A**) of the ranking problem. Finally, in step (22c), we obtain a weight vector that solves the given weighting problem WP(**Ā** ) through (18).

In other words, says that the weighting problem reduces to the ranking problem, in the sense that we can solve the former by solving the latter instead (provided the data are OTcompatible). Yet, as recalled in section 1, we already know how to solve the weighting problem, as we can draw on the large literature on linear models (see, e.g., Potts et al. 2010). What we are really looking for instead are good methods to solve the ranking problem. The fact that we can reduce the weighting problem to the ranking problem is of no algorithmic interest. And therefore has no interesting algorithmic implications.

## 3 Any Algorithm for HG Can Be Ported into OT

In the preceding section, I have reviewed the two frameworks of OT and HG and the two corresponding core computational problems: the ranking problem RP(**A**) and the weighting problem WP(**Ā**), repeated in (23) and (24).

(23)

Given:a finite setAof ERCs.

Return:⊥ if the dataAare not OT-compatible; otherwise, a ranking >> that is OTcompatible withA.

(24)

Given:a finite setĀof EWCs.

Return:⊥ if the dataĀare not HG-compatible; otherwise, a nonnegative weight vectorthat is HG-compatible withθĀ

The question addressed in this section can roughly be stated as follows: given an arbitrary instance of the ranking problem (23), is it possible to pair it with an instance of the weighting problem (24) such that we can solve the former by solving the latter instead? This question can be stated more precisely as follows: given an instance RP(**A**) of the ranking problem, is it possible to find one (or, even better, all) of its solutions without solving the problem *directly* but rather by solving it *indirectly*, through the scheme in (25)? This scheme proceeds as follows. In step (25a), we pair the given set **A** of ERCs with a proper set **Ā** of EWCs. In step (25b), we find a solution ** θ** to the corresponding weighting problem WP(

**Ā**), or else determine that it admits no solution. Finally, in step (25c), we pair that solution

**with a ranking >>, or else return ⊥ if no solution to the weighting problem could be found. We hope that the ranking thus obtained solves the instance RP(**

*θ***A**) of the ranking problem that we started with, whenever a solution exists; and that ⊥ is returned whenever a solution to the ranking problem does not exist.

### 3.1 The Intuitive Idea

In order to implement the scheme in (25), we need to define the two steps (25a) and (25c); that is, we need to find proper ways to pair ERCs with EWCs and weight vectors with rankings. Let me introduce the very simple, core idea with a few examples. Consider first the case of the ERC **a** in (26). Crucially, it contains a unique entry equal to W. Define the corresponding EWC **ā** as follows: the *e* of *C*_{1} is replaced with 0; the W of *C*_{2} is replaced with 1; and the L of *C*_{3} is replaced with −1.

A weight vector ** θ** = (

*θ*

_{1},

*θ*

_{2},

*θ*

_{3}) is HG-compatible with this derived EWC

**ā**provided

*θ*

_{2}−

*θ*

_{3}is strictly positive—equivalently, provided the weight

*θ*

_{2}corresponding to constraint

*C*

_{2}is strictly larger than the weight

*θ*

_{3}corresponding to constraint

*C*

_{3}. Consider a ranking that ‘‘respects’’ the ordering implicit in the relative size of these weights. Any such ranking thus ranks

*C*

_{2}above

*C*

_{3}. The latter ranking condition ensures OT-compatibility with the ERC

**a**that we started from.

Consider next the case of the ERC **a** in (27). Crucially, it contains two entries equal to W. Define the corresponding EWC **ā** as follows: the L of *C*_{3} is again replaced by − 1; but the two W’s of *C*_{1} and *C*_{2} are now replaced by (rather than by 1), to capture the fact that this ERC has two winner-preferrers.

A weight vector ** θ** = (

*θ*

_{1},

*θ*

_{2},

*θ*

_{3}) is HG-compatible with this derived EWC

**ā**provided that the quantity

*θ*

_{1}+

*θ*

_{2}−

*θ*

_{3}is strictly positive—equivalently (by multiplying everything by 2), provided (

*θ*1 −

*θ*

_{3}) + (

*θ*

_{2}−

*θ*

_{3}) is strictly positive. This implies in particular that either (

*θ*

_{1}−

*θ*

_{3}) is strictly positive or (

*θ*

_{2}−

*θ*

_{3}) is strictly positive (or both). Again, consider a ranking that ‘‘respects’’ the ordering implicit in the relative size of these weights. Any such ranking either ranks constraint

*C*

_{1}above

*C*

_{3}or ranks constraint

*C*

_{2}above

*C*

_{3}(or both). The latter ranking condition ensures OT-compatibility with the ERC

**a**that we started from.

As a final example, consider again the set of ERCs (20b), repeated as **A** in (28). Consider the corresponding set **Ā** of EWCs in (28). Note that the two W’s in the first two ERCs of **A** are replaced with + 1 in **Ā** , as they have a unique winner-preferrer; while the two W’s in the last ERC are each replaced with , as that ERC has two winner-preferrers.

As noted above, the set of ERCs **A** is not OT-compatible. It is easy to check that also the derived set **Ā** of EWCs is not HG-compatible. In fact, a weight vector ** θ** = (

*θ*

_{1},

*θ*

_{2},

*θ*

_{3}) needs to satisfy the two inequalities

*θ*

_{1}>

*θ*

_{2}and

*θ*

_{1}>

*θ*

_{3}in order to be HG-compatible with the first two EWCs. Adding them together yields 2

*θ*

_{1}>

*θ*

_{2}+

*θ*

_{3}. The latter inequality is equivalent (dividing everything by 2) to

*θ*

_{1}>

*θ*

_{2}+

*θ*

_{3}, which says that

**is not HG-compatible with the third EWC in**

*θ***Ā**.

### 3.2 Main Claim

The reasoning just illustrated with the three examples (26)–(28) holds in the general case, as follows. Given an ERC **a** with a total of *w* entries equal to W, consider the EWC **ā** derived from **a** as in (29): every entry equal to L in **a** corresponds to − 1 in **ā**; every entry equal to *e* in **a** corresponds to 0 in **ā**; and every entry equal to W in **a** corresponds to in **ā**.

Let us say that a set of EWCs **Ā** is *derived* from a set of ERCs **A** provided that the EWCs are derived from the ERCs according to (29). The examples (26)–(28) illustrate this construction.

Let us say that a ranking >> is *derived* from a weight vector ** θ** = (

*θ*

_{1}, . . . ,

*θ*

*) provided it ‘‘respects’’ the order implicitly defined by the relative size of the weights, in the sense that condition (30) holds for every pair of constraints*

_{n}*C*

*and*

_{h}*C*

*. The idea of this correspondence between weight vectors and rankings is due to Boersma (1998, 2009).*

_{k}(30)

θ>_{h}θ⇒_{k}C>>_{h}C_{k}

Let me unpack condition (30), by considering two cases in turn. If all the components of the weight vector ** θ** are pairwise distinct, the vector

**admits a unique derived ranking—namely, the unique ranking that ranks a constraint**

*θ**C*

*above a constraint*

_{k}*C*

*if and only if the weight*

_{h}*θ*

*of*

_{k}*C*

*is larger than the weight*

_{k}*θ*

*of*

_{h}*C*

*. This case is illustrated in (31a): the unique derived ranking assigns*

_{h}*C*

_{1}to the top,

*C*

_{2}to the bottom, and

*C*

_{3}in between, respecting the relative sizes of their weights.

If instead the components of the weight vector ** θ** are

*not*all pairwise distinct, then

**admits multiple derived rankings, because a tie between two weights can be broken differently by different derived rankings. This case is illustrated in (31b): the weights of**

*θ**C*

_{2}and

*C*

_{3}tie, and thus this weight vector admits two derived rankings, which break the tie in two different ways.

The main result of this section is . Its proof is a straightforward generalization of the reasoning illustrated above with the three examples (26)–(28), as shown in appendix B of the online supplementary materials. This proof crucially relies on the restriction (4) that HG weights be nonnegative. Appendix C in the online supplementary materials presents a variant of where this restriction is waived, but the given ERCs are required to contain a unique loser-preferrer.

Given a set of ERCs **A** and the corresponding set of EWCs **Ā** derived through (29), consider the corresponding instance RP(**A**) of the ranking problem (23) and the corresponding instance WP(**Ā**) of the weighting problem (24). says that HG-compatibility of **Ā** entails OT-compatibility of **A**. Does the reverse hold too? That is indeed the case, as indeed ensures that OT-compatibility of **A** entails HG-compatibility of **Ā**. Thus, the ranking problem RP(**A**) admits a solution if and only if the corresponding weighting problem WP(**Ā** ) admits one. Assume that they indeed admit a solution. says that, if ** θ** is a solution to the weighting problem WP(

**Ā**), then any of its derived rankings is a solution to the ranking problem RP(

**A**). In other words, we can obtain

*some*of the solutions to the ranking problem RP(

**A**) by looking for the derived rankings of solutions to the weighting problem WP(

**Ā**). Does the reverse hold too? Namely, is it the case that, if a ranking is a solution to the ranking problem RP(

**A**), then it is derived from some weight vector that solves the weighting problem WP(

**Ā**)? In other words, is it the case that, by looking for derived rankings of solutions to the weighting problem WP(

**Ā**), we obtain not just some but actually all of the solutions to the ranking problem RP(

**A**)? That is indeed the case, thanks again to . In fact, consider a ranking >> that solves the ranking problem RP(

**A**). Without loss of generality, assume that it is

*C*

*>>*

_{n}*C*

_{n}_{−1}>>. . .>>

*C*

_{1}(otherwise, just relabel the constraints). Consider the weight vector

**defined from >> as in (18b), where Δ is the largest entry (ignoring sign) of the derived set of EWCs**

*θ***Ā**. Note that >> is derived from

**; that is, it satisfies condition (30a). Furthermore, guarantees that**

*θ***solves the weighting problem WP(**

*θ***Ā**). In conclusion, the two together entail the following theorem:

THEOREM 1.

Given a set of ERCsA,consider the corresponding set of EWCsĀderived fromAas in (29). Then,Āis HG-compatible if and only ifAis OT-compatible. Furthermore, a ranking solves the instance RP(A)of the ranking problem (17) if and only if it is derived via (30) from a (nonnegative) weight vector that solves the corresponding instance WP(Ā)of the weighting problem (24).▓

This theorem says that the scheme (25) holds provided the mapping (25a) from sets of ERCs to sets of EWCs is defined as in (29) and the mapping (25c) from weight vectors to rankings is defined as in (30). In other words, the theorem says that it is possible to solve a given instance of the ranking problem (23) without solving it directly, but rather by solving a corresponding instance of the weighting problem (24). Thus, any algorithm for the weighting problem in HG can be turned into an algorithm for the ranking problem in OT. The conjecture of an alleged computational superiority of HG over OT, recalled from the literature in section 1, is thus wrong.

## 4 An Application to the GLA’s Convergence Problem

Computational OT has so far relied mainly on combinatorial algorithms specifically tailored to the framework of OT, developed with few connections to machine learning. As recalled in section 1, one of the reasons why various authors have started to entertain the alternative framework of HGis that it can straightforwardly make use of standard machine learning algorithms, thus bridging the gap between computational phonology and machine learning. Yet section 3 has shown that this alleged advantage of HG over OT is only apparent, as any algorithm for HG can be ported into OT, through theorem 1. This result thus allows the standard combinatorial toolkit of computational OT to be supplemented with a whole new set of algorithmic tools, obtained through a systematic translation into OT of well-known machine learning algorithms. In this section, I illustrate the fruitfulness of these new tools, by discussing in detail a specific application. Further applications left for future research are sketched in section 5.

In the strongest formulation of the OT framework, the constraint set is universal; it is shared by both children and adults and thus need not be learned. The acquisition of phonology thus consists of the problem of learning a constraint ranking that captures the target adult phonology. How could such a ranking be systematically inferred? Suppose that markedness constraints are initially ranked at the top and faithfulness constraints at the bottom, allowing only completely unmarked forms. Over time, the learner receives a stream of data from the target adult language.

Each time a piece of data is received, the learner checks whether its current constraint ranking accounts for the current piece of data. Whenever the learner makes an error, the relevant faithfulness constraints are slightly promoted and the relevant markedness constraints are slightly demoted. Reranking continues until the faithfulness and markedness constraints intersperse in a ranking that is consistent with the target adult language, so that the learner makes no more mistakes. This is the *error-driven ranking model* of the acquisition of phonology in OT. The intermediate rankings entertained by the model on its way to the final grammar correspond to intermediate acquisition stages, thus modeling the observed acquisition gradualness. Furthermore, the model is memoryless; that is, it does not keep track of previously seen forms and thus does not impose unrealistic memory requirements. Because of its cognitive plausibility, error-driven learning has been endorsed in the OT acquisition literature (see, e.g., Bernhardt and Stemberger 1998, Boersma and Levelt 2000, and Gnanadesikan 2004, as well as Tessier 2009 for critical discussion). How can this intuitively plausible learning scheme be formalized into a computationally sound learning algorithm?

Two main error-driven ranking algorithms have been developed in the OT computational literature, reviewed in section 4.1: one is Tesar and Smolensky’s (1998),*Error-Driven Constraint Demotion* (EDCD), or Boersma’s (1998:323–327) gradual reformulation thereof; the other is Boersma’s (1997),*Gradual Learning Algorithm* (GLA). The main difference between (gradual) EDCD and the GLA is that the former only performs constraint demotion while the latter performs both constraint promotion and demotion. Various studies have shown the good modeling capabilities of the GLA (see, e.g., Boersma and Levelt 2000, Boersma and Hayes 2001, Curtin and Zuraw 2002). Furthermore, various authors have argued that constraint promotion is needed from a *modeling* perspective, as reviewed in section 4.2. Yet, although a virtue from a modeling perspective, constraint promotion turns into a liability from a *computational* perspective. In fact, lack of constraint promotion allowed Tesar and Smolensky to prove that EDCD converges after a small number of updates. On the contrary, constraint promotion has been shown by Pater (2008) to prevent the GLA from converging in the general case. Against the background of this tension between the modeling and computational perspectives, one of the main open questions in computational OT is thus the following: is it possible to devise a variant of the GLA that performs promotion and yet provably converges, so as to retain its modeling virtues without sacrificing computational soundness?

This section offers the first positive solution to this question. In section 4.3, I introduce a variant of the GLA that differs from the original GLA because of a more careful calibration of constraint promotion. The analysis of this revised GLA developed in the rest of this section relies on the algorithmic portability from HG into OT established by theorem 1. In fact, in sections 4.4 and 4.5, I show that, once ERCs are mapped to derived EWCs as in (29) and weight vectors to derived rankings as in (30), this revised GLA can be reinterpreted as an instance of the *Perceptron* algorithm, a classical HG learning algorithm. In sections 4.6 and 4.8, I then use this reinterpretation in order to extend to the revised GLA the convergence properties of the Perceptron, known from the machine learning literature. I thus obtain the first convergence proof for an OT error-driven ranking algorithm that performs both constraint demotion and promotion. In section 4.7, I compare in more detail my revised implementation of the GLA with Boersma’s original formulation.

### 4.1 (Gradual) EDCD and the GLA

Here is a natural formalization of the informal error-driven learning scheme sketched above. The algorithm maintains a *current ranking*, which represents its current hypothesis about the target grammar. It initializes its current ranking to some predefined *initial ranking*. And it keeps updating its current ranking through the three steps in (32). At step (32a), the algorithm receives an ERC; at step (32b), the algorithm checks whether its current ranking is OT-compatible with this *current ERC*; if it isn’t, then the algorithm takes action at step (32c), by updating its current ranking to a ‘‘slightly’’ modified ranking.

I assume that the ERCs fed to the algorithm are sampled from a given OT-compatible set of ERCs **A**, called the *input set*. The algorithm *converges* provided it can only perform a finite number of updates for any input OT-compatible set of ERCs. If the algorithm converges, then its final ranking solves the instance RP(**A**) of the ranking problem (23) corresponding to the input set of ERCs **A**.

As noted in section 3, rankings can be represented through numerical weight vectors: we say that a ranking >> is *derived* from a given weight vector ** θ** = (

*θ*

_{1}, . . . ,

*θ*

*) provided that any pair of constraints*

_{n}*C*

*,*

_{h}*C*

*satisfies condition (30), repeated in (33). This condition says that the ranking >> respects the ordering of the constraints that is implicit in the relative size of their weights*

_{k}*θ*

_{1}, . . . ,

*θ*

*, in the sense that constraint*

_{n}*C*

*is ranked above constraint*

_{h}*C*

*whenever the weight*

_{k}*θ*

*of the former is strictly larger than the weight*

_{h}*θ*

*of the latter.*

_{k}(33)

θ>_{h}θ⇒_{k}C>>_{h}C_{k}

Boersma (1997, 1998, 2009) suggests using this correspondence between rankings and weight vectors in order to restate the OT error-driven algorithm (32) in terms of weight vectors.^{6} This has proven to be a remarkably important idea in the development of OT error-driven algorithms.

Suppose that, at every time, the algorithm entertains a current weight vector, rather than a current ranking. At a certain iteration, the current weight vector might happen to have two or more identical components, thus admitting multiple derived rankings. How should we proceed in this case? Following Boersma (2009), I assume that the algorithm updates its current weight vector whenever even just one of the rankings derived from the current weight vector is not OT-compatible with the current ERC, as stated in (34).

In fact, suppose that we decided instead to be more conservative, and update the current weight vector just in case none of its derived rankings is OT-compatible with the current ERC. At convergence, the algorithm will thus return a weight vector with the property that *some* of its derived rankings are OT-compatible with the input set of ERCs. But that is not very useful: how do we decide for a given derived ranking whether it is what we want or not?

To complete the description of the algorithm, we need an *update rule* to use in step (34c). Boersma (1997, 1998) puts forward the following intuition. If the algorithm fails on the current ERC, then the winner-preferrers are plausibly currently ranked too low and the loser-preferrers too high. The algorithm should thus react to its failure by promoting winner-preferrers and by demoting loser-preferrers by a small amount—say, 1 as in (35). The OT error-driven ranking algorithm (34) with this update rule (35) is called the (nonstochastic) *Gradual Learning Algorithm* (GLA).

(35)

- a.
Demote each current loser-preferring constraint by 1.

- b.
Promote each current winner-preferring constraint by 1.

Another update rule considered in the literature is (36). It performs constraint demotion but no constraint promotion, as it pushes down the loser-preferrers but does not push up the winnerpreferrers. Another difference between the two update rules (35) and (36) is that the former demotes all loser-preferrers while the latter demotes only those loser-preferrers that really need to be demoted—namely, those that are currently undominated. A loser-preferring constraint *C** _{l}* is called

*currently undominated*provided there is no winner-preferring constraint

*C*

*that is currently ranked above*

_{k}*C*

*(in the sense that the current weight of*

_{l}*C*

*is strictly larger than that of*

_{k}*C*

*).*

_{l}(36)

- a.
Demote by 1 each

currently undominatedloser-preferring constraint.- b.
Do nothing to the winner-preferring constraints.

Boersma (1998:323–327) notes that the OT error-driven algorithm (34) with the demotion-only update rule (36) is a *gradual* version of Tesar and Smolensky’s (1998) EDCD.

Let me illustrate the GLA (35) with a concrete example; EDCD (36) works analogously. Suppose that the input set of ERCs fed to the GLA is (37a). The beginning of a possible run of the algorithm is provided in (37b).

Suppose that the algorithm starts from the null initial vector *θ*_{1}. The algorithm is then fed an input ERC from (37a)—say, ERC 1. Neither of the two winner-preferrers *C*_{1} and *C*_{2} is ranked above the loser-preferrer *C*_{3} according to the current weight vector *θ*_{1}, as all three constraints currently have the same weight. The algorithm thus takes action: the loser-preferrer *C*_{3} is demoted by 1 and the two winner-preferrers *C*_{1} and *C*_{2} are promoted by 1 each, giving *θ*_{2}. The algorithm is then fed another input ERC from (37a)—say, ERC 2. The winner-preferrer *C*_{3} is not ranked above the loser-preferrer *C*_{2} according to the current weight vector *θ*_{2}. The algorithm thus takes action: the loser-preferrer *C*_{2} is demoted by 1 and the winner-preferrer *C*_{3} is promoted by 1, giving *θ*_{3}. And so on.

### 4.2 The Problem of the GLA’s Convergence

Gradual EDCD (36) performs no constraint promotion. It thus predicts a monotonic ranking dynamics, whereby constraints can only drop over time, but cannot rise or oscillate. This simple ranking dynamics allows for a straightforward analysis of the algorithm. Indeed, (gradual) EDCD has been shown to always converge after a number of updates that grows slowly (quadratically) with the number of constraints (Tesar and Smolensky 1998; see also Boersma 1998:323–327, 2009).^{7} The GLA (35) instead performs both constraint demotion and promotion, thus predicting a more complicated ranking dynamics, whereby constraints can drop, or rise, or oscillate. And convergence of the GLA remained an outstanding open issue for almost a decade. Indeed, shortly after the algorithm appeared in the literature, Keller and Asudeh (2002:237) lamented that ‘‘the convergence properties of the GLA are unknown,’’ as ‘‘this leaves open the possibility that there are data sets on which the GLA will not converge or not produce a meaningful set of constraint rankings. Convergence is a crucial property of a learning algorithm that should be investigated formally.’’ Finally, Pater (2008) settled the issue, with a counterexample that shows that the GLA does not converge in the general case (see Magri 2012a for a detailed explanation of Pater’s counterexample).^{8}

Overall, these results seem to show that lack of constraint promotion is a virtue from a *computational perspective*, as it predicts a simple monotonic ranking dynamics that allows for straightforward analyses and powerful guarantees on the behavior of the algorithm. Unfortunately, lack of constraint promotion turns into a liability from a *modeling perspective*, as the monotonic ranking dynamics predicted by demotion-only seems to be too simple to match the attested acquisition complexity. Indeed, various authors in the OT acquisition literature have suggested that demotion-only is not sufficient and that we do want update rules for the OT error-driven algorithm that perform promotion too (see, e.g., Bernhardt and Stemberger 1998, Stemberger and Bernhardt 1999, Stemberger, Bernhardt, and Johnson 1999, Gnanadesikan 2004). An explicit computational argument for constraint promotion is due to Boersma (1997): he argues that constraint promotion is needed in order for (a stochastic variant of ) the GLA to learn certain cases of language variation. In Magri 2012a, I develop another computational argument for constraint promotion (in a nonstochastic setting), based on the challenge raised by modeling the acquisition of phonotactics. Let me briefly sketch the latter argument.

In carefully controlled experimental conditions, 9-month-old infants react differently to licit and illicit sound combinations, thus already displaying knowledge of the target adult phonotactics (Jusczyk et al. 1993). As the acquisition of morphology is plausibly still lagging behind at this early developmental stage, the child is blind to phonological alternations (Hayes 2004). In conclusion, there is a stage of *pure phonotactic learning*, when the child manages to acquire substantial knowledge of the target phonotactics without being exposed to alternations. Many authors have argued that lack of alternations implies that the child can safely posit only fully faithful underlying forms (see, e.g., Gnanadesikan 2004, Hayes 2004, Prince and Tesar 2004). Suppose now that we try to model this early stage of the acquisition of phonotactics by means of the demotion-only EDCD (36). Because of the assumption of fully faithful underlying forms, the faithfulness constraints are never loser-preferrers. As the update rule (36) only demotes loser-preferring constraints, the faithfulness constraints are never reranked throughout this entire learning stage. In other words, their intermediate and final rankings will be identical to their initial ranking. This cannot be right, for at least two reasons. First, the algorithm is not able to learn the phonotactics of languages that require a specific relative ranking of two faithfulness constraints, such as those discussed by Hayes (2004) and Prince and Tesar (2004). Second, the algorithm is not able to model acquisition paths where the child’s repair strategy changes over time, such as the acquisition sequences documented in much of the literature on the acquisition of consonant clusters (McLeod, van Doorn, and Reed 2001).

Let me make these considerations more concrete, by illustrating the first issue with an example. Consider the OT typology in (38), which is a simplified version of an example considered in Prince and Tesar 2004, based on a phonological analysis developed in Lombardi 1999. Both features, [STOP-VOICING] and [FRICATIVE-VOICING], come with specific faithfulness constraints *F*_{1} and *F*_{2} and markedness constraints *M*_{1} and *M*_{2}. Finally, the markedness constraint *M*_{3} = AGREE, which requires adjacent obstruents to agree in voicing, lets the two features interact.

Among the languages in the OT typology (38) are the two in (39). A ranking generates these two languages, (39a) and (39b), provided it is a refinement of the partial orders in (40a) and (40b), respectively.^{9}

Because of the assumption of underlying forms fully faithful to the intended winners, the two faithfulness constraints *F*_{1} and *F*_{2} are never loser-preferrers throughout learning. As EDCD’s update rule (36) only demotes loser-preferrers, it will never rerank *F*_{1} and *F*_{2}, which will thus stay put at their initial ranking values. In other words, the final relative ranking of *F*_{1} and *F*_{2} will be the same as the initial one, no matter which of the two languages (39a) or (39b) the algorithm has been trained on. The algorithm will thus fail to acquire the correct phonotactics in at least one of the two cases in (39), as they require the opposite relative ranking of *F*_{1} and *F*_{2}, by (40). Some constraint promotion is needed, in order to move the faithfulness constraints around too, even though they are never loser-preferrers. In Magri 2012a, I show that the promotion component of the GLA (or, equivalently, of the revised GLA developed in section 4.3) indeed allows the algorithm to learn the correct relative ranking (40) of the two faithfulness constraints. And in Magri 2011b, I generalize this initial result, looking at all possible constraints of the type of *M*_{3} in (38), which is responsible for the interaction between the two features that define the typology. I show that constraint promotion allows the algorithm to always converge to the desired final ranking no matter how the input ERCs are sampled, but for phonologically implausible modes of feature interaction.

Let me take stock. (Gradual) EDCD only performs constraint demotion, while the GLA performs constraint promotion too. Demotion-only allows EDCD to converge quickly, while promotion prevents the GLA from converging. Although a liability from a computational perspective, constraint promotion turns into a virtue from a modeling perspective. Various studies have shown the GLA’s good modeling capabilities (Boersma and Levelt 2000, Boersma and Hayes 2001, Curtin and Zuraw 2002). Boersma (1997) argues that constraint demotion alone is not able to model certain cases of learning in the presence of variation. And in Magri 2012a, I argue that promotion is needed in order to model the early stage of the acquisition of phonotactics, as just sketched. Against this scenario, the following question thus stands out as one of the main open problems in computational OT: is it possible to devise a variant of the GLA that performs promotion too and yet provably converges? In other words, is it possible to retain the GLA’s good modeling capabilities without sacrificing computational soundness?

### 4.3 A Revised GLA

To start, suppose that the current ERC fed to the GLA has a unique L corresponding to some constraint *C** _{l}* and a unique W corresponding to some constraint

*C*

*, as in (41). This is a simple case: the unique winner-preferrer*

_{k}*C*

*needs to be ranked above the loser-preferrer*

_{k}*C*

*in the end, and can therefore be confidently promoted. In this case, it thus makes good sense, say, to promote*

_{l}*C*

*by 1 and demote*

_{k}*C*

*by 1, as prescribed by Boersma’s update rule (35).*

_{l}Next, consider the case where the current ERC still has a unique L but now has multiple W’s. For concreteness, suppose it has two W’s corresponding to two constraints *C** _{h}* and

*C*

*, as in (42a). This ERC by itself does not say which one of the two winner-preferrers*

_{k}*C*

*or*

_{h}*C*

*should in the end be ranked above the loser-preferrer*

_{k}*C*

*. For instance, the set of input ERCs could contain another ERC like (42b), so that only*

_{l}*C*

*should be ranked above*

_{h}*C*

*, while*

_{l}*C*

*must be ranked below it.*

_{k}Given only the current ERC (42a), there is no way to choose in a principled manner which one, *C** _{h}* or

*C*

*, should be promoted. Nor could we simply promote both, as shown by Pater’s counterexample against the GLA’s convergence.*

_{k}Boersma’s promotion/demotion update rule (35) does not distinguish between the two cases (41) and (42): all winner-preferrers are promoted by the same amount (say, 1), no matter whether they appear in a simple ERC with a unique winner-preferrer as in (41) or in a challenging ERC with multiple winner-preferrers as in (42). This does not look like a good idea, though: a proper update rule should be sensitive to the crucial difference between the two cases (41) and (42), so as to match the intrinsic logic of OT. I suggest the following modification: in the simple case of ERC (41) with a unique winner-preferrer, we can promote that unique winner-preferrer by 1; but in the challenging case of ERC (42) with two winner-preferrers, we should split our confidence between the two winner-preferrers, by promoting each by just . In the general case, if the current ERC contains many W’s, then the total promotion amount of 1 should be split among the many winner-preferrers. I thus suggest the revised promotion/demotion update rule (43) for ERCs like (41) and (42) that contain a unique L: the loser-preferrer is demoted by 1 and each winnerpreferrer is promoted by 1 divided by the total number *w* of winner-preferrers.

(43)

- a.
Demote the loser-preferrer by 1.

- b.
Promote each of the

wwinner-preferrers by

So far, I have only considered current ERCs with a unique L. If such an ERC is not OTcompatible with the current ranking vector, then its unique loser-preferrer must be currently undominated; that is, it cannot already be ranked underneath a winner-preferrer (in the sense that its weight is larger than or equal to the weight of any winner-preferrer). If the current ERC has multiple L’s, then some of them might be currently undominated and others might not be. If only one loser-preferrer is currently undominated, then we can of course again use the very same update rule (43). What if there are two or more currently undominated loser-preferrers? For concreteness, suppose that there are two currently undominated loser-preferrers *C′* and *C″*, as in the ERC in (44a). Split up that ERC into two ERCs, each of which retains only one of the two L’s of the original ERC, while the other gets replaced by an *e*, as in (44b).

As Prince (2002) notes, the original ERC (44a) and the two ERCs (44b) are *OT-equivalent*, in the sense that they are compatible with exactly the same rankings.

Because of this OT-equivalence, it makes sense to construe the update triggered by the original ERC (44a) as the sequence of the two updates triggered by the two ERCs (44b). Furthermore, since the latter two ERCs contain a single L, we can update in response to each of them using the update rule (43). The two ERCs (44b) have the same number of W’s as the original ERC (44a); call it *w*. Upon update by the first of the two ERCs (44b), the loser-preferrer *C′* is demoted by 1 and each winner-preferrer is promoted by . And upon update by the second of the two ERCs (44b), the loser-preferrer *C"* is demoted by 1 too and each winner-preferrer is promoted once more by . In the end, each undominated loser-preferrer of the original ERC (44a) is demoted by 1, and each winner-preferrer is promoted by for as many times as there are undominated loser-preferrers. Equivalently, each winner-preferrer is promoted by as in (45), where *w* is the number of winner-preferrers and *l* is the number of undominated loser-preferrers.

(45)

- a.
Demote each of the

lundominated loser-preferrers by 1.- b.
Promote each of the

wwinner-preferrers by .

I will refer to the OT error-driven ranking algorithm (34) with this new, better-calibrated promotion/demotion reranking rule (45) as the *revised GLA*.

To illustrate, consider the beginning of a possible run of this revised GLA in (46b), again on the set of input ERCs (37a), repeated in (46a).

Suppose that the algorithm starts from the null initial vector *θ*_{1}. The algorithm is then fed an input ERC from (46a)—say, ERC 1. Neither of the two winner-preferrers *C*_{1} and *C*_{2} is ranked above the loser-preferrer *C*_{3} according to the current weight vector *θ*_{1}, as all three constraints currently have the same weight. The algorithm thus takes action: the loser-preferrer *C*_{3} is demoted by 1 and that same amount is split over the two winner-preferrers *C*_{1} and *C*_{2}, which are thus each promoted by , giving the new current weight vector *θ*_{2}. The algorithm is then fed another input ERC from (46a)—say, ERC 2. The winner-preferrer *C*_{3} is not ranked above the loser-preferrer *C*_{2} according to the current weight vector *θ*_{2}. The algorithm thus takes action: the loser-preferrer *C*_{2} is demoted by 1 and the winner-preferrer *C*_{3} is promoted by 1, giving the new current weight vector *θ*_{3}. This weight vector ranks *C*_{1} above *C*_{3} and *C*_{3} above *C*_{2}. As the latter ranking is OT-compatible with the data (46a), any ERC that will be fed to the algorithm from this moment on will not trigger any further update.

In the rest of this section, I offer an analysis of the revised GLA (45) along the following lines. If ERCs are paired with derived EWCs as in (29) and weight vectors with derived rankings as in (30), then HG error-driven algorithms can be translated into OT error-driven ranking algorithms. From this perspective, the revised GLA just developed can be reinterpreted as a wellknown HG error-driven algorithm, the *Perceptron*. Machine learning results on convergence and robustness of the Perception thus translate to the revised GLA. In particular, I easily obtain a convergence guarantee for the revised GLA, which represents the first result on convergence for a ranking algorithm that performs constraint promotion too, besides demotion.

### 4.4 The Perceptron Algorithm

HG error-driven algorithms are analogous to OT error-driven algorithms, the only differences being that they are fed EWCs rather than ERCs and that they check for HG-compatibility rather than for OT-compatibility. Thus, an HG error-driven algorithm can be described as in (47), completely analogous to (34). At step (47a), the algorithm receives an EWC. At step (47b), the algorithm checks HG-compatibility between the current EWC and the current weight vector. At step (47c), the algorithm takes action, in case HG-compatibility does not hold. It is convenient for what follows to denote the current weight vector entertained by an HG error-driven algorithm as _** θ** = (

*θ*

_{1}, . . . ,

*θ*

*), rather than as*

_{n}**= (**

*θ**θ*

_{1}, . . . ,

*θ*

*).*

_{n}I assume that the EWCs fed to the algorithm at step (47a) are sampled from a given, finite, HGcompatible set **Ā** of EWCs, called the *input set*. The algorithm *converges* provided it can only perform a finite number of updates for any HG-compatible input set. If the algorithm converges, then its final weight vector solves the instance WP(**Ā**) of the weighting problem (24) corresponding to the input set **Ā**.

Different HG error-driven algorithms differ because of the update rule used in step (47c). These update rules are very well studied in the field of *online linear classification* (Cesa-Bianchi and Lugosi 2006:chap. 12 offers a modern introduction). As an example, consider the classical update rule (48): the updated weight vector *θ*^{new} = (*θ* , . . . , *θ*) is obtained by adding component by component the current EWC **ā** = [*ā*_{1}, . . . , *ā** _{n}*] to the current weight vector

*θ*^{old}= (

*θ*, . . . ,

*θ*).

The HG error-driven algorithm (47) with the update rule (48) is called the *Perceptron* algorithm in the machine learning literature (Rosenblatt 1958, 1962, Block 1962, Novikoff 1962, as well as Cristianini and Shawe-Taylor 2000:chap. 2).^{10}

To illustrate: Suppose that the set of input EWCs fed to the Perceptron is (49a). The beginning of a possible run of the algorithm is provided in (49b).

Suppose that the algorithm starts from the null initial weight vector *θ*_{1}. The algorithm is then fed an input EWC from (49a)—say, EWC 1. The current weight vector *θ*_{1} is not HG-compatible with this EWC (as 0 · + 0 · + 0 · (−1) = 0 ≯ 0). Thus, it is updated according to (48). The first and second components of the updated weight vector *θ*_{2} become , which is the corresponding component of the old weight vector *θ*_{1} (namely, 0) plus the corresponding component of the EWC 1 (namely, ); the third component of the updated weight vector *θ*_{2} becomes −1, which is the third component of the old weight vector *θ*_{1} (namely, 0) plus the third component of the EWC 1 (namely, −1). The algorithm is then fed another EWC from (49a)—say, EWC 2. The current weight vector *θ*_{2} is not HG-compatible with this EWC (as · 0 + · (−1) + (−1) · 1 = − ≯ 0). Thus, it is updated to *θ*_{3} by adding EWC 2 to the current weight vector *θ*_{2} component by component. And so on.

### 4.5 A Connection between the Revised GLA and the Perceptron

To analyze the revised GLA means to investigate the properties of the sequence of weight vectors *θ*_{1}, *θ*_{2}, . . . , *θ*_{t} , . . . entertained in a run of the algorithm. To start, let’s look at a concrete example— say, the run of the revised GLA in (46b)—and let’s compare it with the run (49b) of the Perceptron. The two runs start from the same null initial weight vector: *θ*_{1} = *θ*_{1}. Crucially, the weight vector entertained at any time in the run of the revised GLA coincides with the weight vector entertained at that same time in the run of the Perceptron: *θ*_{2} = *θ*_{2}, *θ*_{3} = *θ*_{3}, and so on. Here, I will explain in three steps why this fact actually holds in full generality. In section 4.6, I will then use this fact to deduce convergence of the revised GLA from convergence of the Perceptron.

*Step 1*. Every time the revised GLA is fed ERC 1 or ERC 2 from (46a) in the run (46b), the Perceptron is fed the corresponding EWC 1 or EWC 2 from (49a) in the corresponding run (49b). Crucially, every time the current ERC prompts the revised GLA to perform an update in the run (46b), the corresponding current EWC prompts the Perceptron to perform an update as well in the run (49b). This is not a coincidence. The set of EWCs in (49a) is derived from the set of ERCs in (46a) according to the correspondence (29) considered in section 3.2: each L is replaced with −1 and each W is replaced with , where *w* is the total number of W’s in that ERC. Thus, from section 3 applies in this case.^{11} Recall that this lemma says that, if a derived EWC is HG-compatible with some weight vector, then the original ERC is OT-compatible with each one of the derived rankings of that weight vector. The contrapositive of this lemma can thus be stated as follows. Suppose that a weight vector admits derived rankings that are not OT-compatible with an ERC from (46a), prompting the revised GLA to perform an update. Then, that weight vector is also not HG-compatible with the corresponding derived EWC in (49a), prompting the Perceptron to perform an update too.

*Step 2*. Suppose that an input ERC from (46a) triggers an update according to the revised GLA update rule (45). For concreteness, suppose it is ERC 1 that triggers the update. This means that the two winner-preferrers *C*_{1} and *C*_{2} are promoted by and the loser-preferrer *C*_{3} is demoted by 1. Equivalently, the current weight vector is updated by adding to its first and second components and by adding −1 to its third component. In other words again, the current weight vector is updated by adding component by component the corresponding EWC 1 in (49a), as prescribed by the Perceptron update rule (48). The same holds for ERC 2. In conclusion, an update triggered by an ERC in (46a) according to the revised GLA update rule (45) is equivalent to the update triggered by the corresponding EWC in (49a) according to the Perceptron update rule (48).

*Step 3*. Suppose that the run (46b) of the revised GLA is continued by feeding the algorithm with ERC 1, as in (50a). As the winner-preferrer *C*1 is ranked above the loser-preferrer *C*3 according to the current weight vector *θ*_{3}, the revised GLA does nothing and the current weight vector *θ*_{4} is identical to *θ*_{3}. At this point, the Perceptron fails at mimicking the revised GLA. In fact, suppose that the run (49b) of the Perceptron is likewise continued by feeding the algorithm with EWC 1, as in (50b). As the current weight vector *θ*_{3} is not HG-compatible with this EWC (because · + (−) · · 0 · (−1) = 0), the Perceptron will perform an update, unlike the revised GLA. As a result, the weight vector *θ*_{4} entertained by the revised GLA at this iteration is different from the weight vector *θ*_{4} entertained by the Perceptron.

The difficulty just highlighted is nonetheless insubstantial. If the current weight vector entertained by the revised GLA admits some derived ranking that is not OT-compatible with the current input ERC, an update is performed. Otherwise, nothing happens and the algorithm just waits for more data. In the latter case, nothing would have been different if the current ERC had not been fed to the algorithm to start with. In other words, data that do not trigger an update are irrelevant and can therefore be ignored. We can thus restrict ourselves without loss of generality to runs of the revised GLA where an update is triggered at every iteration.

We are now ready to put these pieces together. The run (46b) of the revised GLA and the run (49b) of the Perceptron start from the same initial weight vector—namely, the null vector. By step 3, I can assume without loss of generality that the current weight vector is always updated by the revised GLA. Furthermore, every time the current weight vector is updated in the run (46b) of the revised GLA, it is also updated in the run (49b) of the Perceptron, by step 1. Moreover, they are updated in exactly the same way in the two runs, by step 2. It thus follows that the two runs entertain exactly the same sequence of weight vectors. As shown in appendix D of the online supplementary materials, this reasoning can be extended from the concrete example (46)/(49) considered here to the general case, obtaining . This lemma says that a run of the revised GLA can be mimicked with a run of the Perceptron. The next section uses this conclusion in order to reduce the analysis of the revised GLA to the analysis of the Perceptron.

LEMMA 3.

Consider a run (51a) of the revised GLA on an input setAof ERCs. Assume that an update is triggered at every time (i.e.,θ≠_{t}θ_{t+}_{1}at every time t). Then, there exists a run (51b) of the Perceptron on an input setĀof EWCs such that the sequences of weight vectors in the two runs (51a) and (51b) coincide (i.e.,θ_{t}=θ_{t}at every time t), provided that the two runs start from the same initial weight vector (i.e.,θ_{1}=θ_{1}).

This setĀof input EWCs for the Perceptron run (51b) is finite; furthermore, it is HG-consistent whenever the setAof ERCs fed to the revised GLA in the run (51a) is OT-consistent. ▓

### 4.6 Convergence of the Revised GLA

The OT acquisition literature has endorsed error-driven learning as a plausible model of the child acquisition of phonology. Two implementations of error-driven learning have been devised so far within the OT computational literature: Boersma’s (1998) GLA (35) and Tesar and Smolensky’s (1998) (gradual) EDCD (36). The crucial difference between the two algorithms is that the GLA performs both constraint demotion and promotion while EDCD performs only constraint demotion. Lack of constraint promotion allowed Tesar and Smolensky to show that EDCD converges in the general case. Convergence of the GLA remained an open problem for almost a decade, as pointed out by Keller and Asudeh (2002), until Pater (2008) showed that constraint promotion prevents the GLA from converging in the general case. Yet constraint promotion seems to be needed from a modeling perspective, as argued in section 4.2. The following theorem solves this impasse: it says that convergence is not incompatible with constraint promotion, as long as the promotion component of the update rule is properly calibrated as in the revised GLA (45).

THEOREM 2.

The revised GLA with the properly calibrated promotion/demotion update rule (45) converges foranyOT-compatible set of input ERCs.▓

The convergence follows straightforwardly from the reinterpretation of the revised GLA in terms of the Perceptron, stated by . In fact, suppose by contradiction that the revised GLA did not converge. This means that there exists an OT-compatible set **A** of input ERCs such that at every iteration we can pick from **A** an ERC that forces the revised GLA to perform an update of the current weight vector. ensures that we can get the Perceptron to mimic that nonconvergent run of the revised GLA. In other words, the lemma ensures that there exists a finite and HG-compatible set **Ā** of EWCs such that at every iteration we can pick from **Ā** an EWC that forces the Perceptron to perform an update of the current weight vector. The latter conclusion contradicts the well-known Perceptron convergence theorem, recalled for completeness in appendix E of the online supplementary materials. In conclusion, the convergence for the revised GLA follows as a translation from HG into OT of the convergence theorem for the Perceptron, providing a first specific application of the algorithmic portability from HG into OT established in section 3. For a different approach to the GLA convergence problem, see Magri 2012a.

### 4.7 What Goes Wrong with the Original GLA

Now that we have seen why the revised GLA (45) converges, it is instructive to go back to the original GLA (35) and understand what goes wrong. As an illustration, consider the run (37b) of the original GLA on the set of input ERCs (37a), repeated in (52a). Consider the corresponding set of EWCs in (52b), obtained by replacing each W with 1, each L with −1, and each *e* with 0. Every time an ERC from (52a) triggers an update according to the original GLA update rule (35), winner-preferrers are promoted by 1 and loser-preferrers are demoted by 1. Equivalently, the current weight vector is updated by adding component by component the corresponding EWC in (52b), as prescribed by the Perceptron update rule (48).

Thus, step 2 of the reasoning outlined in section 4.5 for the revised GLA also holds for the original GLA: each update performed by the original GLA ( just like each update performed by the revised GLA) can be reinterpreted as a Perceptron update, provided that the corresponding EWCs are properly defined as in (52b).

What crucially fails in the case of the original GLA is step 1 of the reasoning outlined in section 4.5, which followed from the result established in section 3. Namely, it is not true that, whenever a weight vector admits a derived ranking not OT-compatible with an ERC in (52a), it is also not HG-compatible with the corresponding EWC in (52b). Here is a counterexample. The weight vector (53) admits derived rankings (such as *C*_{3} >> *C*_{1} >> *C*_{2}) that are not OT-compatible with ERC 1 in (52a). Yet this weight vector is HG-compatible with the corresponding EWC 1 in (52b): even though the weights of *C*_{1} and *C*_{2} are each smaller than the weight of *C*_{3}, they can gang up to overcome the latter.

Suppose now that at a certain point in the run of the original GLA, the current weight vector is indeed (53). If the original GLA is fed ERC 1 from (52a), an update is triggered. But this update cannot be mimicked in the corresponding run of the Perceptron: as the current weight vector (53) is HG-compatible with the corresponding EWC 1 in (52b), no update will be triggered in the Perceptron run. In general, the run of the original GLA can contain many more updates than the corresponding run of the Perceptron. Thus, the Perceptron convergence theorem does not entail convergence of the original GLA, even though updates by the original GLA can be described as Perceptron updates.

### 4.8 Number of Updates

In section 4.6, I focused on the issue of convergence. An important related issue is that of the number of updates required for convergence. Here is a way to address the latter issue. guarantees that a run of the revised GLA can be mimicked with a run of the Perceptron. Thus, the number of updates performed by the revised GLA on a set of input ERCs is always smaller than or at most equal to the number of updates performed by the Perceptron on the corresponding set of derived EWCs, and bounds on the number of updates performed by the Perceptron yield bounds on the number of updates performed by the revised GLA.

Let’s look at a couple of concrete examples. Following Riggle (2009), let the *diagonal* set of ERCs corresponding to *n* constraints be the set **A** consisting of *n* − 1 ERCs such that the *k*th ERC has all entries equal to *e* except for the *k*th entry, which is a W, and the following entry, which is an L. To illustrate, I give in (54a) the diagonal sets of ERCs for *n* = 4, 5 constraints.

Assume that we keep feeding to the revised GLA input ERCs sampled from **A** . What is the largest number of updates that we can force the algorithm to perform, before it converges to the final ranking *C*_{1} >> *C*_{2} >> . . . >> *C** _{n}* consistent with

**A**?

^{12} ensures that the number of updates performed by the revised GLA is at most the number of updates performed by the Perceptron on the corresponding derived set **Ā** of input EWCs constructed according to the correspondence (29) considered in section 3.2: each L is replaced with −1 and each W is replaced with +1 (as each ERC contains a unique W). To illustrate, I provide in (54b) the sets of EWCs corresponding to the two diagonal sets of ERCs in (54a). The Perceptron convergence theorem recalled in appendix E of the online supplementary materials offers a bound on the number of updates performed by the Perceptron on the derived set **Ā** of input EWCs in terms of parameters that quantify certain geometric properties of **Ā** (namely, its *radius* and *margin*). And these parameters can be computed explicitly as a function of the number *n* of constraints. I thus obtain the bound on the worst-case number of updates performed by the revised GLA on the diagonal set **A** of input ERCs stated in the following corollary. The proof is presented in appendix F of the online supplementary materials.

COROLLARY 1.

The revised GLA (45) performs no more than n(n^{2}− 1)/6updates on the diagonal setAof input ERCs corresponding to n constraints (starting from null initial weights).▓

Here is another example. Let *Pater’s set* of ERCs corresponding to *n* constraints be the set **A** of *n −* 1 ERCs obtained from the diagonal set by ‘‘adding’’ a W at the right of every L (except in the last ERC). To illustrate, I give in (55a) Pater’s sets of ERCs corresponding to *n* = 4, 5 constraints.

Assume that we keep feeding to the revised GLA input ERCs sampled from **A** . What is the largest number of updates that we can force the algorithm to perform, before it converges to the final ranking *C*_{1} >> *C*_{2} >> . . . >> *C** _{n}* consistent with

**A**? Again, we consider the corresponding set

**Ā**of EWCs obtained according to the correspondence (29) in section 3.2: every L is replaced with −1, every

*e*with 0, and every W with 1 or , depending on whether that W belongs to an ERC with 1 or 2 winner-preferrers. To illustrate, I provide in (55b) the sets of EWCs corresponding to the two sets of ERCs in (55a). Again, the number of updates performed by the revised GLA on Pater’s set

**A**of input ERCs is bounded by the number of updates performed by the GLA on the corresponding set

**Ā**of EWCs, leading to the following corollary. The proof is presented in appendix F of the online supplementary materials.

COROLLARY 2.

The worst-case number of updates performed by the revised GLA (45) on Pater’s setAof input ERCs grows with the number n of constraints at most as n^{5}(starting from null initial weights).▓

The mistake bounds for the revised GLA obtained through this strategy are admittedly worse than the quadratic mistake bound obtained by Tesar and Smolensky (1998) for EDCD. Further more, my mistake bounds are not general, as they are tailored to specific sets of input ERCs, while Tesar and Smolensky’s bound holds for any set of ERCs corresponding to *n* constraints (see also Heinz and Riggle 2011:71 for relevant discussion). Yet the analysis presented here has two advantages over Tesar and Smolensky’s classical analysis of EDCD. First, it is compatible with constraint promotion, which was argued to be necessary from a modeling perspective in section 4.2. Second, even though both analyses rest on the idealization of OT-compatible data, my analysis based on importing machine learning results into OT might allow us to take advantage of the extensive body of literature on the Perceptron robustness to noise, as I elaborate in the next section.

## 5 Conclusions

The peculiar notion of OT-compatibility (7) enforces *strict domination*, according to which the highest-ranked relevant constraint ‘‘takes it all.’’ Because of this property, OT does not seem to have any close correspondent within core machine learning. The current toolkit for computational OT thus consists mainly of combinatorial algorithms tailored to OT, developed with few connections to methods and results from machine learning. This classical approach to computational OT corresponds to the top horizontal arrow in the scheme (56).

In order to bridge this gap between computational OT and machine learning, various scholars have started to explore the alternative framework of HG, since HG can make use of well-established algorithms from machine learning, namely, algorithms for *linear classification*. from section 3 now enters the scene. It says that machine learning algorithms for HG can actually be systematically translated into algorithms for OT according to the scheme (56). In step (56a), the set of ERCs **A** given with an instance RP(**A**) of the ranking problem is translated into a properly defined set **Ā** of derived EWCs, according to (29). In step (56b), machine learning algorithms for HG are used to solve the corresponding weighting problem WP(**Ā** )—namely, to find a weight vector ** θ** HG-compatible with the data

**Ā**or else to determine that the data are not HG-compatible. Finally, in step (56c), the solution to the derived weighting problem WP(

**Ā**) is translated into a provably correct solution to the original ranking problem RP(

**A**): if the former admits no solutions, then the latter is declared to admit no solutions either; otherwise, the weight vector

**that solves the former is translated through (30) into derived rankings that are guaranteed to solve the latter. This is the new algorithmic strategy anticipated in section 1. Thus, theorem 1 provides computational OT with a new toolkit.**

*θ*To illustrate the fruitfulness of this new toolkit for computational OT, I presented a detailed application in section 4. Various modeling studies lend support to Boersma’s GLA implementation of the classical error-driven learning scheme (e.g., Boersma and Levelt 2000, Boersma and Hayes 2001, Curtin and Zuraw 2002). Yet Pater (2008) has shown that the GLA does not converge in the general case. Thus, one of the main open questions in computational OT is this: how could the GLA be modified in order to guarantee convergence, so as to retain its modeling virtues without sacrificing computational soundness? I have introduced a revised GLA, which differs from the original GLA because of a more careful calibration of the promotion component of the update rule. And I have offered a proof of the convergence of this revised GLA. The core idea of the proof is represented in (57).

Once ERCs are mapped to derived EWCs as in (29) and weight vectors to derived rankings as in (30), the revised GLA can be reinterpreted as an instance of the Perceptron algorithm. Convergence results for the Perceptron thus translate into convergence results for the revised GLA. The scheme (57) is thus a concrete illustration of the general approach (56). The potential of this new approach is further illustrated in the rest of this section, with a number of possible further applications left for future research.

As depicted in (57), the Perceptron can be translated into an error-driven OT ranking algorithm. But there is nothing special about the Perceptron. The reasoning developed in section 4 is completely general: any error-driven algorithm for HG can be translated into a corresponding error-driven ranking algorithm for OT. This new computational perspective thus greatly enriches the algorithmic tools at our disposal for implementing the error-driven learning scheme within OT. Let me make this point concrete with an example. All OT update rules considered so far in the literature, such as the original GLA’s update rule (35), the (gradual) EDCD’s update rule (36), and my revised GLA’s update rule (45), are *additive*, in the sense that winner-preferrers are promoted by adding to their current weight a small positive amount, and (undominated) loserpreferrers are demoted by adding to their current weight a small negative amount. These various implementations of the GLA thus correspond to the additive update rule (48) of the Perceptron: the weight vector is updated by adding component by component the EWC that is triggering the update. Another important class of HG error-driven algorithms has a multiplicative update rule (Kivinen, Warmuth, and Auer 1997). Here is an example. Suppose that the current weight vector *θ*^{old} = (*θ* , . . . , *θ*) is not HG-compatible with the current EWC **ā** = [*ā*_{1}, . . . , *ā** _{n}*]. Then let the updated weight vector

*θ*^{new}= (

*θ*, . . . ,

*θ*) be defined component by component as in (58), where

*Z*is the normalization coefficient and

**η**is a properly chosen positive constant (called

*plasticity*or

*stepsize*). Modulo normalization, the updated weight

*θ*is obtained by multiplying the current weight

*θ*by the (exponential of the

**η**-rescaled) corresponding entry

*ā*

*in the current EWC.*

_{k}The HG error-driven algorithm (47) with this multiplicative update rule (58) is called the *Winnow algorithm*. It is known to converge for a properly chosen stepsize **η** (Littlestone 1988).^{13}

The Winnow algorithm can now be ported from HG into OT in exactly the same way I adapted the Perceptron into the revised GLA in section 4. Suppose that the current weight vector ** θ**old = (

**, . . . ,**

*θ**θ*) entertained by the OT error-driven ranking algorithm (34) is not OTcompatible with the current ERC

**a**= [

*a*

_{1}, . . . ,

*a*

*]. We map this ERC into a derived EWC*

_{n}**ā =**[

*ā*

_{1}, . . . ,

*ā*

*] using the mapping (29) devised in section 3: each L is replaced with −1 and each W is replaced with , where*

_{n}*w*is the total number of winner-preferrers. We then update the current weight vector in response to the ERC

**a**according to the Winnow update rule (58) in response to the corresponding derived EWC

**ā**—namely, as in (59). Modulo normalization, the weight corresponding to winner-preferrers is multiplied by the (exponential of the

**η**-rescaled) promotion coefficient ; and the weight of loser-preferrers is multiplied by the (exponential of the

**η**-rescaled) demotion coefficient −1. Let me dub the OT error-driven algorithm (34) with this update rule (59) the

*multiplicative*GLA.

By reasoning as in section 4, I conclude that a run of the multiplicative GLA on a set of input ERCs can be mimicked by a run of the Winnow algorithm on the set of corresponding EWCs. Convergence of the Winnow algorithm thus translates into convergence of the multiplicative GLA. The reasoning presented so far can be summarized with the diagram in (60), which is one more instance of the new algorithmic strategy (56).

The Perceptron additive update rule (48) and the Winnow multiplicative update rule (58) have been compared extensively in the machine learning literature, with the two update rules outperforming each other on different types of data sets (Kivinen, Warmuth, and Auer 1997). Even though the additive update rule of the GLA has become very popular in the computational OT literature, substantial computational and modeling work will be needed in order to determine whether ranking algorithms based on the additive Perceptron fare better than ranking algorithms based on the multiplicative Winnow or vice versa. These new computational developments might lead to new, improved tools for modeling the acquisition of sound patterns in OT.

Most of the OT computational literature has focused so far on an idealized learning scenario, whereby the data are assumed to be consistent (no *noise*) and learning is assumed to be categorical (no *variation*). Tesar and Smolensky’s (1998) analysis of EDCD crucially relies on the assumption that the data are compatible with an OT grammar—that is, that the data have not been corrupted by noise and do not display variation. And the convergence proof for the revised GLA presented in section 4 crucially relies on the idealized assumption of consistent data too. Boersma’s (1997, 1998) GLA is purported in the literature to be robust to noise and to be able to model variation (in its stochastic version), but no analytical results are currently available. In conclusion, very little is currently known concerning the robustness of the error-driven learning model in more realistic learning scenarios. The algorithmic tools developed in this article have the potential to lead to substantial progress in this direction, as sketched in (61).

As seen in section 4, a revised version of the GLA can be interpreted as the classical Perceptron algorithm for HG. A very large body of literature has studied the computational properties of the Perceptron algorithm (see, e.g., Cesa-Bianchi and Lugosi 2006:chap. 12). In particular, robustness to noise on the part of the Perceptron as well as a number of variants thereof has been thoroughly investigated (see, e.g., Freund and Schapire 1999 (building on Klasner and Simon 1995) and Shalev-Shwartz and Singer 2005, as well as Khardon and Wachman 2007 for a review and experimental results). The reinterpretation of the GLA in terms of the Perceptron might thus allow current machine learning results on the Perceptron’s robustness to be translated into the first analytical results on the GLA’s robustness.

Throughout the second part of this article, I have focused on *error-driven* ranking algorithms. These are algorithms for the ranking problem RP(**A**) that ‘‘work by row’’—in other words, look at **A** one ERC at a time. On the contrary, *batch*-ranking algorithms ‘‘work by column’’ and thus need to look at the entire set of ERCs **A** at once. Tesar (1995) and Tesar and Smolensky (1998) develop an efficient batch-ranking algorithm, called *Constraint Demotion* (CD). In Magri 2012b, I note that CD for the ranking problem in OT ‘‘corresponds’’ to the classical *Fourier-Motzkin Elimination Algorithm* (FMEA) for the weighting problem in HG (see, e.g., Bertsimas and Tsitsiklis 1997:70–74), in the sense that the scheme in (62) holds: if we map ERCs into derived EWCs as in (29) and weight vectors into derived rankings as in (30), then CD turns into a special application of the FMEA.

This reinterpretation of CD in terms of the FMEA might turn out to be useful from the following perspective. CD works roughly as follows: it starts building the ranking from the top, and it assigns to the highest available slot in the ranking a constraint that is never loser-preferring among those ERCs that are not already accounted for by constraints already assigned to a higher position in the unfolding ranking. At the first iteration, CD thus assigns to the top slot a constraint that is never loser-preferring; at the next iteration, it assigns to the next highest slot a constraint that is loser-preferring only in ERCs where the first constraint is winner-preferring; and so on. At a certain iteration, the algorithm might have to choose among different constraints, each of which could be assigned to the highest currently available slot. Hayes (2004) and Prince and Tesar (2004) have suggested that in such cases, the algorithm should not choose at random, but instead should choose according to specific principles heuristically designed in order to *bias* CD toward a final ranking that ranks the faithfulness constraints as low as possible. Reinterpreting CD as a special case of the FMEA allows us to obtain a more general variant of CD, whereby the algorithm is no longer required to loop through the constraints in the order dictated by how they become available for ranking. In other words, it allows us to develop variants of CD, say, where the first constraint ranked by the algorithm is not necessarily one that is never loser-preferring, so that we don’t necessarily have to start to construct the ranking from the top. These developments might lead to a more efficient implementation of Hayes’s (2004) and Prince and Tesar’s (2004) heuristics for restrictiveness.

## Notes

I wish to thank Adam Albright for lots of help. The article has greatly benefited from comments by an anonymous *LI* reviewer. I also wish to thank Paul Boersma, Bruce Tesar, and Paul Smolensky for useful conversations on the material presented here. Earlier versions of this article have been presented at NECPhon 2 (Yale University, 15 November 2008), at DGfS 31 (Osnabrück, 6 March 2009), and at WCCFL 29 (University of Arizona, 23 April 2011); I wish to thank the audiences at those venues for useful discussion. Some of the material presented here has appeared as Magri 2011a. This work was supported in part by a ‘‘Euryi’’ grant from the European Science Foundation (‘‘Presupposition: A Formal Pragmatic Approach’’ to Philippe Schlenker).

^{1} A framework close in spirit to OT was popular in the *operations research* literature in the 1970s (Fishburn 1974). Indeed, Tesar and Smolensky’s (1998) Constraint Demotion algorithm was recently rediscovered within this literature (Dombi, Imreh, and Vincze 2007). The OT framework might also have connections with the machine learning field of *preference learning* (Fürnkranz and Hüllermeier 2010), although these possible connections have yet to be explored.

^{2} The supplementary online materials for this article are available at http://www.mitpressjournals.org/doi/suppl/10.1162/ling_a_00139.

^{3} I have chosen this name because it is analogous to *elementary ranking condition*, which has become the standard name in the literature for the analogous notion within OT; see (15) below. Bane, Riggle, and Sonderegger (2010) call EWCs *difference vectors*.

^{5} As discussed in detail in Tesar 2007, only holds as long as we consider a *finite* set of data triplets. Indeed, if the set is infinite, there might not exist any bound on the number of constraint violations, so that the constant Δ might not be defined.

^{6} The idea of a numerical representation of rankings is actually implicitly already present in Tesar and Smolensky’s (1998) notion of the *offset* of a constraint with respect to a ranking, defined as the number of strata above that constraint in that ranking.

^{7} The variant of the demotion-only update rule (36) that demotes all loser-preferrers (both the currently undominated and the dominated ones) converges too, but can require a very large number of updates (exponential in the number of constraints); see Magri 2009 for details. This suggests that it is always a good idea to restrict demotion to only the currently *undominated* loser-preferrers, rather than demoting all loser-preferrers.

^{8}Pater’s (2008) counterexample consists of the second set of ERCs in (55a) below.

^{9} Consider for instance the case of language (39a). In order for /ba/ not to be neutralized to [pa], the ranking (40ai) is needed. In order for /za/ to be neutralized to [sa], the ranking (40aii) is needed. Given ranking (40aii), in order for /abza/ not to be neutralized to [apsa], the ranking (40aiii) is needed. Furthermore, given the ranking (40aii), in order for /abza/ not to be neutralized to [absa], the ranking (40aiv) is needed too.

^{10} A remark on the issue of the nonnegativity of the weights is in order here. In the description of HG in section 2, I introduced the restriction (4) that the weights be nonnegative, in order to prevent undesired typological predictions. This restriction requires algorithms for HG to enforce nonnegativity of the weights. This is not trivial in the case of the Perceptron, as nothing in the design of the update rule (48) enforces nonnegativity of the weights. The recent HG computational literature usually tries to get around this problem by starting out with large positive weights (see, e.g., Jesney and Tessier 2009). But this trick does not guarantee that the weights will stay nonnegative at every iteration until convergence, as the number of updates depends on the size of the initial weights. Furthermore, cutting off the weights at zero jeopardizes the Perceptron convergence theorem, recalled in appendix E of the online supplementary materials. Strictly speaking, the Perceptron is thus not an algorithm for standard HG; rather, it is an algorithm for a variant of HG without the nonnegativity restriction (4). See footnote 13 for a better error-driven algorithm for standard HG that maintains the nonnegativity of the weights in a principled way.

^{11} Strictly speaking, this is not completely correct. In fact, in its current formulation requires the weights to be nonnegative, which is not necessarily the case for the current weights entertained by the (original or revised) GLA or by the Perceptron; see footnote 10. Yet, as noted in appendix C of the online supplementary materials, holds also without the nonnegativity restriction on the weights, provided that the input ERCs all have a unique L. And that is indeed the case for the set of input ERCs in (46a). The case of input ERCs with an arbitrary number of L’s is addressed in appendix D.2 of the online supplementary materials, where the reasoning sketched in this section is formalized.

^{12} More precisely, I am using here the formulation of provided in appendix D.1 of the online supplementary materials, which applies here because each input diagonal ERC has a unique L.

^{13} As noted in footnote 10, the Perceptron is strictly speaking not an algorithm for HG, as it does not ensure that the current weights stay nonnegative, even if the initial weights are large and positive. The Winnow algorithm instead maintains the nonnegativity of the weights at any iteration, provided that it is initialized with nonnegative weights. Winnow is thus a better algorithm for HG than the Perceptron, even though the Perceptron is more widely used in the HG computational literature.

## References

*Neurocomputing*. Vol. 1,

*Foundations of research*. Cambridge, MA: MIT Press