## Abstract

The consistency problem models language learning as the problem of finding a grammar consistent with finite linguistic data. The subset problem refines that formulation, asking for a consistent grammar that generates a smallest language. This article reviews results concerning the tractability of the consistency problem within Optimality Theory (OT) and shows that the OT subset problem is instead intractable. The subset problem thus needs to be restricted to plausible typologies, and solution algorithms need to take advantage of the additional structure brought about by these typological restrictions. These implications are illustrated with a discussion of the choice between batch and errordriven models of the child’s acquisition of phonotactics.

## 1 Introduction

This section introduces informally the main formulations of the language-learning problem considered in this article, previews the main results concerning their complexity, and outlines the implications of these results for modeling child language acquisition.

### 1.1 Results on the Complexity of the OT Consistency Problem

Generative linguistics assumes that the learner is provided with a typology E of possible grammars G1, G2, . . . . Each grammar G is a formal device that generates a corresponding language L(G). The language-learning task consists of reconstructing the target adult grammar within the typology, on the basis of a finite set of data generated by that grammar. To start, it makes sense to look for the target grammar among those grammars that are at least consistent with the data, as captured by the consistency problem informally stated in (1). Various formulations of the problem considered in the literature differ with respect to the structural properties of the underlying typology E provided as input (1ai), the type of data provided as input (1aii), and the notion of consistency between grammars and data used in the statement (1b) of the goal of the problem.

(1)

• a.

Given:

• i.

complete information about a typology E,

• ii.

a finite data set D generated by a target grammar in the typology E;

• b.

Return: a grammar G in the typology E consistent with the data set D.

Section 2 formalizes the consistency problem (1) within Optimality Theory (OT; Kager 1999, Prince and Smolensky 2004) and reviews what is currently known concerning its computational complexity. Here is a preview of the results presented.

The linguistic data generated by the target grammar and provided as the input (1aii) of the consistency problem could have been corrupted by noise or transmission error. Because of this, no grammar in the typology might actually be consistent with each single piece of data. In this case, a plausible formalization of the goal (1b) of the consistency problem asks for a grammar consistent with most of the data. Let me call this formulation of the problem strong—to distinguish it from an alternative weak formulation that I will introduce below. As stated in theorem 1, this strong formulation of the OT consistency problem is intractable. This means that no algorithm is able to solve an arbitrary instance of the problem efficiently—namely, in a number of computational steps that grows slowly (i.e., polynomially) with the complexity of that instance, properly defined.

Theorem 1.

The strong OT consistency problem (14) in section 2 is intractable.

This theorem is a restatement of results from the machine learning literature on preference learning (e.g., Galil and Megiddo 1977, Cohen, Schapire, and Singer 1999).

Prompted by this intractability result, I thus turn to an alternative weak formulation of the goal (1b) of the OT consistency problem. In the case of consistent data, both formulations ask for a grammar consistent with each single piece of data. The two formulations part ways in the case of inconsistent data: the strong formulation asks for a grammar consistent with most of the data, while the weak formulation only requires the learner to detect the inconsistency, without returning any grammar. Tesar (1995), Tesar and Smolensky (1998), and Tesar and Smolensky (2000:chap. 7) (henceforth T&S) show that such a weakening of the problem statement changes its complexity and makes it tractable, as stated in theorem 2.1 This means that the weak OT consistency problem admits an efficient solution algorithm—namely, an algorithm that solves any instance of the problem in a number of steps that grows slowly (i.e., polynomially) with the complexity of that instance—so that the algorithm is feasible also for instances of high complexity (say, with a large number of constraints, a large number of candidates, and a large number of data points).

Theorem 2.

The weak OT consistency problem (17) in section 2 is tractable.

This result is remarkable, because it establishes tractability for a formulation of the OT consistency problem that does not impose any restriction on the underlying OT typologies provided as input (1ai). A solution algorithm thus has to work efficiently both in the case of typologies that are linguistically well-motivated and in the case of pathological typologies that have no linguistic plausibility. This means in turn that a solution algorithm is allowed to avail itself only of the ‘‘structure’’ provided by the core ranking logic of OT (transitivity, domination, etc.), not of any additional structure that might be provided by whatever properties distinguish between plausible and implausible OT typologies.

### 1.2 Results on the Complexity of the OT Subset Problem

The consistency problem informally stated in (1) represents a crucial component of the languagelearning task. Indeed, the learned grammar should be consistent with the data. Yet a large literature has argued that it does not exhaust the language-learning task. In fact, many alternative grammars might be consistent with the data, so that some heuristics are needed in order to choose among them. One such heuristic is informally stated in the following passage from Fodor and Sakas 2005:516–517:

(2) ‘‘Choosing too broad a generalization can be fatal, since by . . . assumption . . . the learning mechanism lacks sufficient negative data to guide retreat from an incorrect superset choice. . . . In particular, there is little evidence that learners get trapped in superset hypotheses. . . . The conclusion must be that children have some means of either avoiding or curing superset errors. Many learning models invoke [the Subset Principle] for this purpose: it is intended to prevent superset errors by prohibiting the learning mechanism from hypothesizing a more inclusive language than is warranted by the evidence. . . .An informal version . . . is that the learning mechanism must never hypothesize a language which is a proper superset of another language that is equally compatible with the available data. . . . The effect of this is to guarantee that if a wrong language is hypothesized, that language is either a subset of the target or intersects with the target, and in either case there will be at least one target language sentence that the learner could encounter which is not licensed by the wrong grammar and so could serve as a trigger for change to an alternative hypothesis.’’

The argument illustrated in (2) is usually called the Subset Principle. Its relevance for the acquisition of syntax has been widely acknowledged in the literature (see, e.g., Angluin 1980, Berwick 1985, Manzini and Wexler 1987, Safir 1987, Wexler and Manzini 1987, Clark 1992, as well as Fodor and Sakas 2005 for review). Dell (1981) notes that the case of phonology is rather different (but see Hale and Reiss 2003). In fact, phonological alternations implicitly provide the negative evidence that is missing in the case of syntax. And the bare consistency problem thus provides a plausible formalization of the learning task, as long as the learner is provided with a sufficiently rich set of alternations. In order to motivate the subset problem in the case of phonology, one thus needs to look at special learning circumstances. Section 4 discusses one such circumstance, related to the early stage of the child’s acquisition of phonotactics (Hayes 2004, Prince and Tesar 2004).2

The logic of the Subset Principle requires the consistency problem (1) to be strengthened through the addition of the subset condition (3bii): no other consistent grammar G′ generates a subset language. The problem informally stated in (3) is called the subset problem: induce a grammar in the typology consistent with a given set of linguistic data that furthermore generates a smallest language. Various formulations of the problem considered in the literature differ with respect to the structural properties of the underlying typology E provided as input (3ai), the type of data provided as input (3aii), the notion of consistency between grammars and data used in the statement of the consistency condition (3bi), and the generative mechanism used to associate a grammar G with its language L(G), as required in the statement of the subset condition (3bii).

(3)

• a.

Given:

• i.

complete information about a typology E,

• ii.

a finite data set D generated by a target grammar in the typology E;

• b.

Return: a grammar G in the typology E such that

• i.

G is consistent with the data set D,

• ii.

there is no other consistent grammar G′ in the typology such that L(G′) ⫋ L(G).

Section 3 formulates the subset problem within OT and presents new results on its computational complexity, which can be informally summarized as follows.

Of course, the subset problem (3) is at least as hard as the consistency problem (1), as it enforces the additional subset condition (3bii). But how much harder is it? Suppose the data provided as input (3aii) are inconsistent. By theorem 1, we cannot interpret the consistency condition (3bi) as requiring consistency with the largest subset of consistent data. By theorem 2, we need to settle for a less demanding task: the simple detection of the inconsistency. And in this case, the additional subset condition (3bii) has no bite. Thus, I am only interested in comparing the complexity of the consistency and subset problems when the data are consistent. Theorem 3 shows that the consistency and subset problems in OT have a completely different complexity. T&S’s theorem 2 says that the weak OT consistency problem is tractable: there exists a solution algorithm that solves an arbitrary instance of the problem in a number of steps that grows slowly with the complexity of the instances. On the contrary, theorem 3 says that, even with the restriction to consistent data, the OT subset problem is intractable: any algorithm that runs in a number of steps that grows slowly with the complexity of the task will fail on some instances of the problem.

Theorem 3.

The OT subset problem (33) in section 3 is intractable.

A small (large) language is likely to arise when the faithfulness (markedness) constraints are ranked low. Prince and Tesar (2004) thus consider an alternative formulation of the subset problem, which asks for a consistent grammar that ranks the faithfulness constraints as low as possible. I show that the latter formulation is intractable as well, as stated in theorem 4. The latter two theorems, 3 and 4, are the main result developed in the article.

Theorem 4.

Prince and Tesar’s (2004) alternative formulation of the OT subset problem (34) in section 3 is intractable.

The latter two theorems are proven by showing that the strong formulation of the OT consistency problem can be reduced to the OT subset problem—in the sense that, for each instance of the strong OT consistency problem, I can devise a derived instance of the OT subset problem in such a way that, if I knew how to solve the derived instance of the OT subset problem, then I would straightforwardly get a solution for the original instance of the strong OT consistency problem. This means in turn that, if the OT subset problem were tractable, then the strong OT consistency problem would be tractable too. As the latter problem is intractable by theorem 1, I conclude that the OT subset problem is intractable. This line of reasoning is formalized with elementary tools from Complexity Theory, which are recalled for completeness in appendix A of the supplementary online materials for this article.3

### 1.3 Implications for Modeling Child Language Acquisition

What is the proper interpretation of the intractability of the OT subset problem established by theorems 3 and 4? What are the implications for OT modeling of child language acquisition? Intractability results within OT have often been interpreted as challenges against the OT framework itself.4 I submit that the intractability of the OT subset problem should not be interpreted this way, as it plausibly has nothing to do with the choice of the grammatical framework, in the sense that an analogous result plausibly holds for the corresponding problem within alternative frameworks, such as derivational frameworks or Harmonic Grammar (HG; see Legendre, Miyata, and Smolensky 1998a,b). Indeed, the intractability of the OT subset problem will be shown to follow from the intractability of the strong OT consistency problem, as anticipated above. And it is well-known that the intractability of the strong consistency problem carries over to alternative frameworks such as HG (Johnson and Preparata 1978), suggesting that the intractability argument for the subset problem should carry over to alternative frameworks as well. I thus submit that the intractability result provided by theorems 3 and 4 captures the intrinsic computational difficulty posed by the subset condition, independently of the grammatical framework considered. From this perspective, this intractability result explains Fodor and Sakas’s (2005:518) anecdotal observation that ‘‘to the best of our knowledge, there is no extant learning model which succeeds in implementing the Subset Principle while remaining within reasonable human resource limits. Either . . . problems [related to the Subset Principle] are glossed over . . . , or the solutions proposed would not be realistic within a psychological mode.’’

I submit that the relevance of the intractability result provided by theorems 3 and 4 lies not with its implications for framework selection (OT vs. HG vs. derivational frameworks); rather, it lies with its implications for computational modeling of child language acquisition. The core idea is nicely stated as follows by Barton, Berwick, and Ristad (1987:4):

(4) ‘‘[Complexity Theory] can help pinpoint the exact way in which our formalized systems seem to allow too much latitude. . . . Especially deserving of closer scrutiny are formal devices that can express problems requiring blind, exhaustive, and computationally intractable search for their solution. Informally, such computationally difficult problems don’t have any special structure that would support an efficient solution algorithm, so there’s little choice but brute force, trying every possible answer combination until we find one that works. Thus, it’s particularly important to examine features of a framework that allow such problems to be encoded—making sure there’s not some special structure to the natural problem that’s been missed in the formalism. In fact problems that require combinatorial search might well be characterized as unnaturally hard problems. . . . There is every reason to believe that natural language has an intricate computational structure that is not reflected in combinatorial search methods. Thus, a formalized problem that requires such search probably leaves unmentioned some constraints of the natural problem.’’

Indeed, it is not hard to guess what is missing in the formulation of the subset problem sketched in (3) and formalized below in section 3. According to that formulation, an alleged solution algorithm is required to work for an arbitrary typology provided as input (3ai) to the problem. In the case of OT, that means (essentially) that a solution algorithm is required to work for an arbitrary constraint set. This is a common feature of the computational problems considered in the OT computational literature. For instance, Eisner (2000:23) writes, ‘‘[W]e follow Tesar and Smolensky (2000) in supposing that the learner already knows the correct set of constraints . The assumption follows from the OT philosophy that is universal across languages, and only the [rankings] of constraints differ. The algorithms for learning a ranking, however, are designed to be general for any , so they take as an input. That is, these methods are not tailored (as others might be) to exploit the structure of some specific, putatively universal [constraint set] .’’ This strategy of letting the constraint set (or, equivalently, the underlying typology) vary arbitrarily as an input to the problem is justified in certain cases. The weak OT consistency problem is one such case, as shown by T&S’s theorem 2: the task is easy enough that the structure provided by the bare OT framework (transitivity, domination, etc.) is sufficient to support efficient solution algorithms. Theorems 3 and 4 say that the task is much more demanding in the case of the OT subset problem and that the structure provided by OT’s bare ranking logic is not sufficient. Further structure needs to be introduced into the problem, by restricting the typologies provided as input (3ai) to those that are linguistically plausible. Moreover, efficient solution algorithms need to take advantage of the additional structure brought about through these linguistically motivated restrictions. This intractability result thus motivates a rather new approach to computational OT, focused on distilling the algorithmic implications of broad linguistic generalizations.

Section 4 illustrates these considerations, by discussing an application to modeling the child’s acquisition of phonotactics at an early stage (around nine months), when morphology is still lagging behind. The subset problem has been argued to offer a proper formalization of the learning challenge faced by a child in this early developmental stage (Hayes 2004, Prince and Tesar 2004). Two competing algorithmic schemes have been considered for modeling this early acquisition stage within OT. One approach is based on error-driven ranking algorithms, such as T&S’s Error-Driven Constraint Demotion and Boersma’s (1998) Gradual Learning Algorithm. Informally, these algorithms start from a restrictive initial grammar, are trained on a stream of data from the target adult language, and keep slightly reranking the constraints whenever they make a mistake on the current piece of data. Error-driven learning has been endorsed by the OT acquisition literature, especially because the predicted sequences of intermediate rankings can be matched with attested acquisition paths, thus modeling the gradualness of child acquisition. Yet the only provision toward restrictiveness of error-driven learning consists of the choice of a restrictive initial grammar. One hopes that throughout learning, this grammar will be enlarged just as much as needed, so that the final grammar will be restrictive enough to solve the instance of the subset problem posed by this early child acquisition stage. But many authors in the OT computational literature have doubted that this could indeed be the case. They have suggested that the behavior of an error-driven learning algorithm is mainly determined by the stream of data, so that the algorithm feels like a leaf in the wind of data, with few guarantees about the quality of its final grammar. The OT computational literature has thus developed an alternative modeling approach based on batch ranking algorithms, such as Prince and Tesar’s (2004) Biased Constraint Demotion and Hayes’s (2004) Low Faithfulness Constraint Demotion. These algorithms are more powerful because they are allowed to glimpse the entire set of data at once, unlike error-driven algorithms. And they have powerful built-in provisions toward restrictiveness, which are enforced at each iteration of the algorithm.

There is thus a tension between the OT acquisition and computational literature. The former has focused on phenomenological properties of child language acquisition such as acquisition paths, and has thus endorsed error-driven learning. The latter has focused instead on learnability issues such as the Subset Principle, and has dismissed error-driven learning as algorithmically too weak. The intractability result provided by theorems 3 and 4 now enters the scene, and it might provide a way to reconcile these two opposite perspectives. This result says that it is unfair to hold against error-driven learning the fact that it is too weak to enforce restrictiveness in the general case. In fact, the intractability result shows that this is indeed the case for any algorithm, no matter whether it is batch or error-driven. It is just impossible to construct an efficient algorithm powerful enough to enforce restrictiveness in the general case. No matter which algorithmic scheme we choose, it will have to crucially rely on the specific, special structure introduced by carefully stated restrictions on phonologically plausible constraint sets. It could indeed turn out that those restrictions introduce structure of a very special sort, which error-driven ranking algorithms are particularly well-suited to exploit, thus boosting their algorithmic strength. I review from Magri 2011, 2012b some initial evidence that this might indeed be the case.

## 2 On the Complexity of the Consistency Problem in OT

A computational problem in language learning is a mapping of an input (5a) consisting of certain linguistic information into an output (5b) consisting of a grammar that satisfies certain conditions (see appendix A in the online supplementary materials for more details).

(5)

• a.

Given: linguistic information (typological information, linguistic data, . . . )

• b.

Return: a grammar satisfying certain conditions (consistency, Subset Principle, . . . )

• c.

Size: a number that quantifies the complexity of the task, depending on the size of the typology, the number of data points, . . .

The complexity of a computational problem is measured in terms of the smallest number of computational steps it takes to solve an instance of the problem. Of course, the number of steps depends on the ‘‘size’’ of each instance: more steps should be allowed for instances of larger size. If the problem admits an algorithm that solves any instance of the problem in a number of steps that grows slowly (i.e., polynomially) with the size of the instances, then the problem is called tractable and that solution algorithm is called efficient. If instead the problem admits no efficient solution algorithm, then it is called intractable. Of course, the complexity of a problem depends on how the size of its instances is measured. Thus, the statement of a computational problem needs to be completed with the definition of the size (5c) of its instances (that was omitted in the informal discussion of section 1).

This section formulates the consistency problem within OT as an explicit computational problem (5). It then reviews what is currently known concerning the computational complexity of that problem. In particular, section 2.1 defines the input (1a) to the OT consistency problem. Section 2.2 defines the size (1c) of its instances. And sections 2.3 and 2.4 consider two possible definitions of the output condition (1b), which yield the strong and weak formulations of the problem. The complexity of these two formulations is characterized by 4, anticipated in section 1. The proof of the former theorem is postponed to appendix B of the online supplementary materials. T&S’s proof of the latter theorem is presented in detail in sections 2.5 and 2.6.

### 2.1 Input to the Consistency Problem

An OT typology is defined in terms of a triplet (,Gen, ) of universal specifications. The first element of the triplet is a set of underlying forms. The second element is a generating function Gen that maps an underlying form x ∈ [ into a set Gen(x) of candidates. The third element is a set of constraints, which assign (nonnegative) numbers of violations to pairs of an underlying form and a corresponding candidate. The union of all candidate sets is also called the set of surface forms. An example is provided in (6). The set of underlying forms consists of voiced and voiceless obstruents in onset and coda position. The generating function modifies obstruent voicing. And the constraint set contains faithfulness and markedness constraints related to obstruent voicing.

(6)

• a.

• b.

Gen(/ta/) = Gen(/da/) = {[ta], [da]}

• c.

= {Ident[voice]/onset, Ident[voice], *[+voice, −sonorant]}

I assume that the typological information provided as input (1ai) to the OT consistency problem consists of a triplet ( ,Gen, ) of universal specifications, as just described.

Given the universal specifications ( , Gen, ), an OT grammar in the corresponding typology is a function that maps an underlying form x into a corresponding candidate y [ Gen(x). The linguistic data provided as input (1aii) to the OT consistency problem thus consist of a set D containing a finite number of pairs (x, y) of an underlying form x and a corresponding candidate yGen(x). Two examples of data sets corresponding to the typological specifications in (6) are provided in (7).

(7)

• a.

D = {(/da/, [da]), (/rat/, [rat])}

• b.

D = {(/da/, [da]), (/rad/, [rat])}

Being exposed to instances of the surface forms [da] and [rat], the learner will conclude that these forms are realized faithfully and thus will posit the data set (7a). Or perhaps the learner had access to some alternations and has thus noticed that the target grammar enforces final devoicing, leading to the data set (7b).

In conclusion, the typological and linguistic information provided as input to the consistency problem can be formalized within the OT framework as in (8a).5

(8)

• a.

Given: i. an OT typology, described through universal specifications (, Gen, ),

ii. a finite data set D of underlying forms and corresponding candidates;

• b.

Return: . . .

• c.

Size: . . .

Following Barton, Berwick, and Ristad (1987:secs. 1.4.4, 2.3) and Heinz, Kobele, and Riggle (2009), the formulation of the OT consistency problem developed in this section is universal, because no restrictions are imposed on the universal specifications provided as input (8ai). This formulation encompasses linguistically plausible typologies as well as implausible ones. Such an unrestrictive formulation of the problem means in turn that a solution algorithm can rely only on the structure provided by the bare OT framework (transitivity, domination, etc.), not on the structure that would have been provided by the restriction to special classes of universal specifications. This assumption will turn out to be warranted, as the structure provided by the bare OT framework is sufficient (at least for the weak formulation of the consistency problem), without any need for further structure provided by typological restrictions. The main result of this article is that the situation is rather different in the case of the OT subset problem: the subset condition makes the problem far more demanding, requiring further structure besides that provided by the bare OT framework.

#### 2.2 Size of the Consistency Problem

I have explicitly assumed finiteness of the data set D provided as input (8aii) to the OT consistency problem. Let me now assume that the OT typology ( ,Gen, ) provided as input (8ai) is finite as well. For the case of the consistency problem, I formalize the latter finiteness assumption as in (9)—a stronger formulation will be needed later on in the case of the subset problem.

(9)

• a.

The constraint set C is finite.

• b.

The candidate set Gen(x) is finite,

for every underlying form x that appears in a pair (x, y) in the data set D.

Assumption (9a), that there is only a finite number of constraints, is fairly standard in the OT computational literature. Assumption (9b), that candidate sets are finite, is not standard, but it can be easily motivated as follows. To start, if the constraint set is finite, then any OT grammar can distinguish only among a finite number of candidates for any given underlying form. Thus, even if the candidate set is infinite, the candidates will behave according to a finite number of equivalence classes. Furthermore, my discussion here is geared toward the claim that the OT subset problem is intractable. As the latter is a negative result, it can only be strengthened by adding restrictions on the underlying typology.

Given assumption (9b), that candidate sets are finite, let the cardinality of the generating function Gen on a data set D be the number |Gen(D)| defined in (10) as the cardinality of the largest candidate set over all underlying forms that appear in D.

(10)

For instance, the cardinality of the generating function defined in (6b) relative to either of the data sets in (7) is 2, as the candidate sets corresponding to the underlying forms /da/, /rad/, and /rat/ all have cardinality 2.

I assume in (11) that the size of an instance of the OT consistency problem depends on three parameters: the cardinality || of the constraint set, the cardinality |D| of the data set, and the cardinality |Gen(D)| of the generating function (on the data set D).6

(11)

• a.

Given: i. an OT typology (,Gen, ) that satisfies the finiteness assumption (9),

ii. a finite data set D of underlying forms and corresponding candidates;

• b.

Return: . . .

• c.

Size: the maximum among ||, |D|, and |Gen(D)|.

It is uncontroversial that the size of a given instance of the problem should depend on the cardinality || of the constraint set and on the cardinality |D| of the data set. It is more delicate to let it also depend on the number of candidates |Gen(D)|. This means that a solution algorithm is allowed to take the time to list and inspect all candidates. The potential difficulty with this assumption is that |Gen(D)| could be very large, potentially exponential in the number of constraints ||; thus, letting the size of an instance of the problem depend on |Gen(D)| might make the problem too easy, by loosening up too much the tight dependence on the number of constraints ||. But this difficulty is only apparent: the universal formulation of the problem requires a solution algorithm to work for any constraint set and any generating function, and thus also for cases where the number of constraints is large but the cardinality of the generating function is small.7

### 2.3 Output of the Strong OT Consistency Problem and Its Complexity

A ranking is a linear order over the constraint set, usually denoted by ≫ and variants thereof. An example of a ranking of the constraints in (6c) is provided in (12a): it sandwiches the markedness constraint in between the two faithfulness constraints, with the positional faithfulness constraint at the top.

(12)

• a.

Ident[voice]/onset ≫ *[+voice] ≫ Ident[voice]

• b.

I will also adopt the standard representation in (12b), where higher-ranked constraints are placed at the top of the diagram.

Let me denote by OT the OT grammar corresponding to a ranking ≫, as defined in Prince and Smolensky 2004. A ranking ≫ is consistent with a data pair (x, y) of an underlying form x and a corresponding winner candidate yGen(x) provided the corresponding grammar OT maps that underlying form x into that candidate y, namely, OT(x) = y. The latter condition can be made more explicit as in (13). A ranking is consistent with a data set D provided it is consistent with every pair in the data set. And a data set D is consistent provided it is consistent with at least one ranking.

(13) For every loser candidate zGen(x) (I adopt the convention of striking out loser candidates), there exists a constraint that prefers the winner (i.e., it assigns fewer violations to the winner mapping xy than to the loser mapping xz) and is ≫-ranked above every constraint that instead prefers the loser (i.e., it assigns more violations to the winner mapping xy than to the loser mapping xz).

For instance, the data set D in (7b) is consistent only with the ranking (12), as only the corresponding grammar neutralizes voicing in codas but not in onsets.

The intuition behind the consistency problem is that the data set D has been generated by a target grammar in the typology that we would like to reconstruct on the basis of those data. Let’s distinguish two cases. In an ideal case, the data generated by the target grammar have not been corrupted by noise or transmission error. Hence, the data set is consistent—that is, consistent with the target grammar as well as with possibly many other grammars. In this case, the consistency problem asks for a grammar consistent with the data, as any such grammar could be the target grammar that has generated the data. In a more realistic scenario, some of the data might have been corrupted by noise, resulting in a data set D that might be inconsistent. Plausibly, though, these corrupted data are sparse, while the large majority of the data are correctly transmitted and thus consistent with the target grammar. In this case, the consistency problem thus asks for a grammar consistent with most of the data, as stated in the strong formulation (14) of the problem.8

(14)

• a.

Given:

• i.

an OT typology (,Gen,C) that satisfies the finiteness assumption (9),

• ii.

a finite data set D of underlying forms and corresponding candidates;

• b.

Return: a ranking ≫ of the constraint set that is consistent with a largest consistent subset of D;

• c.

Size: the maximum among |C|, |D|, and |Gen(D)|.

By requiring a solution to maximize consistency, condition (14b) effectively requires the learner to be insensitive to noise.

This strong formulation of the OT consistency problem is intractable, as stated in theorem 1. This means that there is no algorithm that solves an arbitrary instance of the problem efficiently, namely, in a number of computational steps that grows slowly (i.e., polynomially) with the size of the problem.

Theorem 1.

The strong OT consistency problem (14) in section 2 is intractable.

This result is essentially due to Galil and Megiddo (1977); see also Cohen, Schapire, and Singer 1999. The detailed proof is postponed to appendix B of the online supplementary materials. Here is an outline of the reasoning. Given an arbitrary set A = {a, b, . . . } with finite cardinality |A|, consider a set T of triplets of elements of A. The set T is called linearly cyclically compatible provided there exists a linear order < on A such that for every triplet (a, b, c) ∈ T either a < b < c or b < c <a or c < a < b. To illustrate, the set T in (15a) is linearly cyclically compatible; the one in (15b) is not.

(15)

• a.

T = {(a, b, c), (b, c, d)}

• b.

T = {(a, b, c), (a, c, b)}

Galil and Megiddo (1977) prove that the cyclic ordering problem in (16) is intractable: any solution algorithm needs a very large number of computational steps on at least some of the instances of the problem.

(16)

• a.

Given:

• i.

a finite set A,

• ii.

a collection TA × A × A of triplets of elements of A;

• b.

Return: a linear order < on A that is linearly cyclically compatible with a largest subset of T;

• c.

Size: the cardinality |A| of A.

The cyclic ordering problem (16) can be straightforwardly reduced to the strong OT consistency problem (14), in the sense that, for each instance of the cyclic ordering problem, there exists a corresponding instance of the strong OT consistency problem such that a solution to the latter can be straightforwardly turned into a solution to the former. If the strong OT consistency problem were tractable, then the cyclic ordering problem would therefore be tractable too. As the latter is intractable, the former is intractable as well.

### 2.4 Output of the Weak OT Consistency Problem and Its Complexity

Prompted by the intractability theorem 1, I thus explore a less demanding formulation of the OT consistency problem. In the case of consistent data, the learner is required to return a grammar consistent with each piece of data, as any such grammar could have generated the data. In the case of inconsistent data, the learning problem is effectively declared unsolvable, and the learner is just required to detect the inconsistency, without committing to any grammar. This proposal is formalized as the alternative weak formulation (17) of the OT consistency problem.

(17)

• a.

Given:

• i.

an OT typology (,Gen, ) that satisfies the finiteness assumption (9),

• ii.

ii. a finite data set D of underlying forms and corresponding candidates;

• b.

Return: a ranking ≫ of the constraint set that is consistent with the data set D, if any such ranking exists; otherwise, output ⊥;

• c.

Size: the maximum among ||, |D|, and |Gen(D)|.

The switch from the strong to the weak formulation of the OT consistency problem has a drastic effect on its computational complexity: while the former formulation is intractable by theorem 1, the latter admits efficient solution algorithms, as stated in T&S’s theorem 2. This result has had a profound impact on the field. In fact, it represents the first explicit learnability result in OT. Furthermore, it provides a concrete case where no harm comes from adopting the universal formulation of a learning problem, whereby no restrictive assumptions are made on the typology provided as input (17ai).

Theorem 2.

The weak OT consistency problem (17) in section 2 is tractable.

In the rest of this section, I present T&S’s proof of theorem 2, split into two steps. The first step reduces the consistency problem to an abstract combinatorial problem (lemma 1 in section 2.5), exploiting Prince’s (2002) elementary ranking condition (ERC) notation. The second step provides an efficient solution algorithm for the latter problem (lemma 2 in section 2.6). The restatement of the consistency problem in ERC notation will play a crucial role in my proof of the intractability of the OT subset problem.

### 2.5 First Step: Reduction to ERC Notation

Consider the underlying/winner form pair (/rad/, [rat]) provided with the data set (7b). The Gen function in (6b) provides only one corresponding loser candidate, namely, [rad] (recall that I strike out loser candidates). We usually represent the relevant information concerning this data item as the tableau (18a).

(18)

This representation (18a) encodes the actual numbers of constraint violations. Yet these numbers are not really needed in order to determine OT consistency (13). All the information that is needed is just (18b): for every constraint, we just need to know whether it prefers the winner (namely, it assigns more violations to the loser [rad] than to the winner [rat]), prefers the loser (namely, it assigns more violations to the winner than to the loser), or is even (namely, it assigns the same number of violations to the winner and to the loser). Let me abbreviate (18b) as in (18c), marking each constraint with a W, an L, or an e depending on whether it is winner- or loser-preferring or even.

If we adopt the same representation for the other pair (/da/, [da]), then we end up representing the data set D in (7b) with the matrix in (19). Its elements are all W’s, L’s, and e’s; it has as many columns as there are constraints; it has as many rows as there are relevant triplets of an underlying form, the intended winner, and a corresponding loser.

(19)

As noted above, this data set D in (7b) is only consistent with the ranking Ident[voice]/On ≫ *[+voice] ≫ Ident[voice] in (12). If the columns of the matrix (19) are ordered according to that ranking from left to right in decreasing order, then the leftmost entry different from e is a W in both rows. No other ordering of the columns of the matrix (19) has this property. In other words, the set of rankings that solve the instance of the consistency problem corresponding to this data set D coincides with the set of reorderings of the columns of the matrix (19) that place a W to the left of each L. These simple considerations can be straightforwardly generalized as follows.

Following Prince (2002), an elementary ranking condition (ERC) is a tuple whose entries are taken from the three symbols W, L, and e. An ERC is usually represented as a row, as illustrated in (18c). A certain number of ERCs can be organized one on top of another into an ERC matrix, as illustrated in (19). I will denote by A an arbitrary ERC matrix and by a an arbitrary ERC. I will often omit e’s for the sake of readability. I assume that the columns of A are labeled by constraint names C1, C2, . . . . A ranking ≫ over these constraints is called consistent with the ERC matrix provided condition (20) holds, and an ERC matrix A is called consistent provided it is consistent with at least one ranking.

(20) Once the columns of the ERC matrix A are reordered from left to right in decreasing order according to ≫, the leftmost non-e entry of every row is a W.

With these preliminaries in place, I now introduce the purely combinatorial problem (21).

(21)

• a.

Given: an ERC matrix A;

• b.

Return: a ranking ≫ that is OT-consistent with the ERC matrix A according to (20), if the matrix is consistent; otherwise, return ⊥;

• c.

Size: the maximum between the number of columns and of rows of A.

The input to the problem is an ERC matrix; the task is to find a ranking consistent with that matrix; the size of an instance of the problem depends on the size of the ERC matrix.

The relationship between the weak OT consistency problem (17) and the combinatorial problem (21) can be brought out as follows. The data set D given with an instance (17) of the consistency problem can be paired up with its corresponding ERC matrixAD defined as in (22), which generalizes the procedure illustrated in (19).

(22)

• For every underlying/winner form pair (x, y) in the data set D and for every loser candidate z [ Gen(x), construct the corresponding ERC a = (a1, . . . , an) as follows:

• Organize all these ERCs one underneath the other into an ERC matrix AD

The number of columns of the ERC matrix AD constructed in (22) coincides with the number || of constraints, as stated in (23a), and the number of its rows can be bound as in (23b). In fact, we get an ERC for each one of the |D| pairs in the data set and for each one of the corresponding candidates, where the number of candidates is upper-bound by the cardinality |Gen(D)| of the largest candidate set.

(23)

• a.

The number of columns of AD is equal to |C|.

• b.

The number of rows of AD is at most |D||Gen(D)|.

By (23), an instance of the OT consistency problem (17) can be transformed through (22) into an instance of the ERC problem (21) with comparable size. Comparability of the sizes of the two problems hinges on the assumption (11c) that the size of an instance of the OT consistency problem generously depends not only on || and |D| but also on |Gen(D)|.

A straightforward generalization of the reasoning below (19) shows that a ranking ≫ is OT-consistent with a data set D according to (13) if and only if that ranking ≫ is consistent with the corresponding ERC matrix AD according to (20). And a data set D is OT-consistent if and only if the corresponding ERC matrix AD is consistent. In other words, (20) is a graphic description of the original notion of OT consistency (13). And the ERC problem (21) extracts the combinatorial core of the OT consistency problem (17), abstracting away from phonological details. Any algorithm that efficiently solves the former problem can thus be turned into an algorithm that efficiently solves the latter, as stated in the following lemma:

Lemma 1.

Given an algorithm Solve(21)that solves the ERC problem (21) efficiently, we can construct an algorithm Solve(17)that solves the OT consistency problem (17) efficiently as follows:

namely, by running Solve(21)on the ERC matrixADcorresponding to the data set D provided with an instance of problem (17).

This lemma represents the first step of T&S’s proof of theorem 2: it says that, in order to prove that the original OT consistency problem (17) is tractable, it is sufficient to prove that the combinatorial reformulation (21) in terms of ERC matrices is tractable.

### 2.6 Second Step: Recursive Constraint Demotion

T&S next develop a simple solution algorithm for the combinatorial ERC problem (21). Let me illustrate the idea with an example, as in (25).

(25)

Suppose that the ERC matrix is the one constructed in (19). Our goal is to come up with a ranking ≫ consistent with it according to (20)—that is, such that the ≫-reordered matrix has the property that the leftmost non-e entry of each row is a W. The top-ranked constraint must thus head a column that does not contain a single L. In our case, the only such constraint is Ident[voice]/Onset, which thus is assigned to the top stratum at the first iteration (25a). The constraint that can be assigned to the next stratum must head a column whose only L’s belong to rows where the top-ranked constraint Ident[voice]/Onset has a W. In other words, it must head a column that does not contain a single L once we strike out the rows where the top-ranked constraint Ident[voice]/Onset has a W. In our case, the only such constraint is *[+voice], which thus is assigned to the second stratum at the next iteration (25b). The constraint that can be assigned to the next stratum must head a column that does not contain a single L once we strike out rows where at least one of the two top-ranked constraints Ident[voice]/Onset and *[+voice] has a W. In our case, the only such constraint is Ident[voice], which thus is assigned to the bottom (third) stratum at iteration (25c). As all constraints have been ranked, the algorithm stops.

The procedure just illustrated extends to a general ERC matrix A as indicated in (26), which is T&S’s Recursive Constraint Demotion (RCD) algorithm. Step (26a) corresponds to the vertical arrows in (25) and step (26b) corresponds to the horizontal arrows.

(26) Repeat until the if-condition in step (a) is met.

• a.

If there are yet un-struck-out constraints whose columns in A do not contain any un-struck-out L, assign an arbitrary such constraint to the highest available rank.

• b.

Strike out every row of A that has a W corresponding to the constraint just picked in step (a) and then strike out its entire column.

If all constraints are struck out, return the ranking assembled; otherwise, return '.

If the input ERC matrix A is consistent, then RCD returns a ranking consistent with A. And all rankings consistent with A belong to the search space of RCD.9 If instead the input ERC matrix A is not consistent, then the loop (26a)–(26b) ends before all constraints are ranked and RCD outputs ⊥. If the input ERC matrix has m rows and n columns, then RCD repeats steps (26a)–(26b) for n times and each repetition takes at most nm time (the algorithm needs to scan n columns with m entries each). We have thus proven the following:

Lemma 2.

RCD (26) is an efficient solution algorithm for the combinatorial ERC problem (21), which is therefore tractable.

Lemma 1 guarantees that tractability of the ERC combinatorial problem (21) ensures tractability of the weak OT consistency problem (17). Lemma 2 then ensures that the former combinatorial problem is indeed tractable. Theorem 2 concerning tractability of the weak OT consistency problem (17) thus follows from these two lemmas.

## 3 On the Complexity of the Subset Problem in OT

In this section, I turn to the subset problem, which asks for a consistent grammar generating a smallest language, as informally stated in (3). I carefully formulate the problem within OT and present new results concerning its computational complexity.

### 3.1 Input to the OT Subset Problem

As stated in (27a), the input provided with an instance of the OT subset problem is the same as in the case of the OT consistency problem: it consists of complete information concerning the OT typology together with a finite set of linguistic data.

(27)

• a.

Given:

• i.

an OT typology, described through universal specifications (, Gen, );

• ii.

a finite consistent data set D of underlying/winner form pairs;

• b.

Return: . . .

• c.

Size: . . .

Again as in the case of the consistency problem, the typological information is provided in (27ai) through a triplet (, Gen, ) consisting of a set of underlying forms, a generating function Gen, and a constraint set . No restrictions are placed on the underlying OT typology, so that again I am focusing on the universal formulation of the problem. In the case of the consistency problem, the universality of the formulation turned out not to impede efficient solution algorithms. This section will show that the situation is very different in the case of the subset problem.

The linguistic information is provided in (27aii) through a finite set D of pairs of an underlying form x [ and a corresponding candidate y [ Gen(x). In principle, two cases need to be considered. One case is that the data set D is consistent with some grammar in the typology: for each pair (x, y) in D, that grammar predicts the candidate y to be the winner for the underlying form x. Alternatively, a portion of the data set D might have been corrupted by noise and transmission error, with the result that the whole data set D is inconsistent. In the latter case, theorem 1 says that finding a ranking that is consistent with most of the data yields an intractable problem. The complexity of the task can only get worse if we add to the consistency condition further conditions, such as the subset condition. Indeed, in the case of inconsistent data, the only feasible request is just to detect the inconsistency, by theorem 2. But in this case, the additional subset condition has no bite, and the consistency and subset problems thus collapse. For this reason, in my analysis of the complexity of the OT subset problem I will assume that the data set D is always consistent, as explicitly stated in (27aii). As I am aiming for an intractability result, restricting the problem to consistent data can only strengthen my point.

### 3.2 Output of the OT Subset Problem

The target grammar that has generated the linguistic data is thereby consistent with those data, as long as they have not been corrupted by noise. In order to reconstruct the target grammar, it thus makes sense to look among the grammars consistent with the data, as enforced by the consistency problem. Yet there might be many consistent grammars, and we thus need further heuristics in order to select among those. As recalled in (2), a large literature has suggested the following Subset Principle: whenever two languages L and L′ are both consistent with the data and are in a subset relation as in (28a), the subset language L represents a better guess than the superset language L′.

(28)

Note that the Subset Principle does not discriminate between the two languages L and L′ in (28b), as they are not in a subset relation, even though L′ is more inclusive than L and thus should in some sense be dispreferred. As Fodor and Sakas (2005:540) put it, ‘‘[A]ll smallest [consistent] languages . . . stand on equal footing from the perspective of the Subset Principle. They do not stand in subset relations to each other, so differences between them, such as their size, are of no concern to the Subset Principle. Since the Subset Principle deals only with subset relations, it cannot favor one such language over another even if one is very large and the other is very small.’’ For discussion of cases such as (28b) within the linguistic literature, see Jarosz 2006.

In order to formalize the Subset Principle, let me denote by L the language corresponding to a ranking ≫—namely, the set of those candidate surface forms y that are attainable through ≫, in the sense that there exists at least one underlying form x ∈ [ such that the OT grammar OT maps that underlying form x into that surface form y.10 To illustrate, the ranking Ident[voice]/Onset ≫ *[+voice] ≫ Ident[voice] in (12), which allows for voicing contrasts in onsets but not in codas, generates the language L = {[da], [ta], [rat]}.

The subset condition can now be explicitly stated as in (29bii). The learner needs to find a ranking that not only is OT-consistent with the data set D, by the consistency condition (29bi), but also generates a language that is as small as possible (with respect to set inclusion) among consistent rankings, by the subset condition (29bii).

(29)

• a.

Given: i. an OT typology, described through universal specifications (, Gen, ),

ii. a finite consistent data set D of underlying/winner form pairs;

• b.

Return: a ranking ≫ over the constraint set such that

• i.

≫ is consistent with the data set D,

• ii.

there is no other consistent ranking ≫′ such that L′ ⊊ L;

• c.

Size: . . .

For discussion of this subset condition (29bii) within a variety of theoretical frameworks, see for instance Angluin 1980, Dell 1981, Berwick 1985, Manzini and Wexler 1987, Safir 1987, Wexler and Manzini 1987, Clark 1992, Hale and Reiss 2003, Hayes 2004, and Prince and Tesar 2004, as well as Fodor and Sakas 2005 and Heinz and Riggle 2011 for a review. In section 4, I will illustrate a concrete modeling application of the subset problem (29), as it properly formalizes the learning challenge faced by the child acquiring the target adult phonotactics at an early developmental stage.

### 3.3 Output of the Subset Problem in Terms of Restrictiveness Measures

Fodor and Sakas (2005:519) note that ‘‘in order to make decisions with respect to the Subset Principle [(29bii)], LM [i.e., the learning mechanism] must be able to recognize when a subset/ superset choice presents itself. That LM has access to this information is often assumed without discussion, but it is far from obvious how it can be achieved. At worst, it could require LM to know (innately or by computation) . . . all the subset relations. . . . Is this feasible? There seem to be three broad alternatives. (i) LM might directly compare . . . the candidate languages. Or (ii) LM might be innately equipped with a specification of all language pairs that stand in subset relations. Or (iii) LM might be able to compare the grammars of the candidate languages, and choose between them on the basis of some general formal criterion.’’ Expanding on option (iii), they ask ( p. 521), ‘‘Could LM examine the competing grammars, i.e., make intensional rather than extensional comparisons? Is there a formal property of grammars that would reveal which stand in subset relations? This is an attractive possibility which holds promise of eliminating the workload excesses of alternative (i), while minimizing the extent of innate programming needed for alternative (ii). It amounts to the postulation of a [restrictiveness] measure’’ (R-measure) μ that pairs a grammar G with a number μ(G) that provides a relative measure of the size of the corresponding language L(G) and thus can be used to compute subset relations.

Prince and Tesar (2004) make the following concrete proposal concerning how such a restrictiveness measure could be defined within OT. An OT constraint set = is usually split up into the subset of faithfulness constraints and the subset of markedness constraints. Prince and Tesar suggest the R-measure μ defined in (30): it maps a ranking ≫ into the number μ(≫) of pairs of a faithfulness and a markedness constraint such that the former is ≫-ranked above the latter. To illustrate, the measure μ(≫) of the ranking Ident[voice]/Onset ≫ *[+voice] ≫ Ident[voice] in (12) is just 1, as there is one markedness constraint ranked underneath the faithfulness constraint Ident[voice]/Onset and none underneath the faithfulness constraint Ident[voice].

(30) μ(≫) = |{(F, M) ∊ × | F ≫ M}|

Suppose that a language L≫′ corresponding to a ranking ≫′ is a proper subset of a language L corresponding to a ranking ≫. This means that the OT grammar OT≫′ that generates the subset language neutralizes more forms than the grammar OT that generates the superset language. Faithfulness constraints work toward the preservation of underlying contrasts while markedness constraints work against it and in favor of contrast neutralization. We thus expect the ranking ≫′ that neutralizes more forms to have a smaller R-measure μ(≫′) than the measure μ(≫) of the ranking ≫ that preserves more of the underlying contrasts. On the basis of these heuristic considerations, Prince and Tesar (2004) suggest restating the subset condition (29bii) in terms of the R-measure μ as in (31bii).11

(31)

• a.

Given:

• i.

an OT typology, described through universal specifications (, Gen, ),

• ii.

a finite consistent data set D of underlying/winner form pairs;

• b.

Return: a ranking ≫ over the constraint set C such that

• i.

≫ is consistent with the data set D,

• ii.

there is no other consistent ranking ≫′ such that μ(≫′) is strictly smaller than μ(≫);

• c.

Size: . . .

According to this alternative formulation (31), the OT subset problem is the problem of minimizing the R-measure μ defined in (30) over all rankings consistent with the data.12

### 3.4 Size of the OT Subset Problem

The subset problem (29) makes reference to the language L generated by a ranking ≫. The number of steps needed to compute this language depends on the size of the constraint set and on the size of the candidate sets, as well as on the size of the set of underlying forms. Thus, let me assume that these sets are all finite, as stated in (32). The latter finiteness assumption is substantially stronger than the finiteness assumption (9) needed for the consistency problem. In fact, (32c) requires the set of underlying forms to be finite too, and furthermore (32b) requires all candidate sets to be finite, not just those corresponding to underlying forms that appear in the data set D. Of course, once this strong finiteness assumption is in place, the requirement that the data set D be finite is automatically satisfied and can thus be dropped from the formulation of the problem.

(32)

• a.

The constraint set is finite.

• b.

The candidate set Gen(x) is finite, for every underlying form x.

• c.

The set of underlying forms is finite.

The size of an instance of the OT subset problem thus depends on || and |Gen()| as in (33c), rather than on |D| and |Gen(D)| as in the case of the consistency problem (11c).13

(33)

• a.

Given:

• i.

an OT typology (, Gen, ) that satisfies the finiteness assumption (32),

• ii.

a consistent data set D of underlying/winner form pairs;

• b.

Return: a ranking ≫ over the constraint set such that

• i.

≫ is consistent with the data set D,

• ii.

there is no other consistent ranking ≫′ such that L≫′L;

• c.

Size: the maximum among ||, ||, and |Gen()|.

Assumption (32c), that the set of underlying forms is finite, is overly restrictive. But such an overly restrictive assumption, together with the generous definition of the size in (33c), strengthens (rather than weakens!) the intractability result stated below in theorem 3.

The case of Prince and Tesar’s alternative formulation of the subset condition considered in section 3.3 is rather different. A proper R-measure μ makes it possible to determine the relative restrictiveness of two rankings ≫ and ≫′ by comparing just the two corresponding values μ(≫) and μ(≫′), without having to compute and compare the two corresponding languages L and L≫′. In this case, a solution algorithm thus does not have to go through every single underlying form in in order to compute the language corresponding to a certain ranking. It therefore makes sense to let the size of an instance of the problem depend only on the cardinality |D| of the data set and on the cardinality |Gen(D)| of the largest candidate set over underlying forms in D, as in (34c). This definition of the size is more restrictive than (33c), which depends instead on the cardinality || of the entire set of forms and on the cardinality |Gen()| of the largest candidate set over all underlying forms in . In the case of Prince and Tesar’s (2004) problem, the original finiteness assumption (9) thus suffices, with no need for the set of underlying forms to be finite too.

(34)

• a.

Given:

• i.

an OT typology (, Gen, ) that satisfies the finiteness assumption (9),

• ii.

a finite consistent data set D of underlying/winner form pairs;

• b.

Return: a ranking ≫ over the constraint set such that

• i.

≫ is consistent with the data set D,

• ii.

there is no other consistent ranking ≫′ such that μ(≫′) < μ(≫);

• c.

Size: the maximum among |D|, ||, and |Gen(D)|.

As in the case of the consistency problem, letting the size (34c) of an instance of the problem depend on the cardinality |Gen(D)| of the largest candidate set means that a solution algorithm can afford the time to list and inspect all candidates. Again, this generous definition of the size makes the intractability result that follows stronger.14

### 3.5 Complexity of the OT Subset Problem

The main contribution of this article consists of theorems 3 and 4, which characterize the computational complexity of the subset problem in OT.

Theorem 3.

The OT subset problem (33) in section 3 is intractable.

Theorem 4.

Prince and Tesar’s (2004) alternative formulation of the OT subset problem (34) in section 3 is intractable.

Throughout this article, I have focused on the universal formulation of learning problems—that is, formulations that impose no restrictions on the underlying typologies. Exploring the computational complexity of these universal formulations means addressing the following question: does the bare ranking logic of OT (transitivity, dominance, etc.) provide enough structure in order for a solution algorithm to succeed? Or is it instead the case that additional structure needs to be introduced, by carefully restricting the typologies to those that are linguistically plausible? In section 2, we have seen that the consistency problem is easy enough that a solution algorithm can solve it by using just the structure provided by the bare OT ranking logic: T&S’s theorem 2 guarantees that the consistency problem is tractable, even in its universal formulation (leaving aside the issue of inconsistent data, as shown by theorem 1). Theorems 3 and 4 say that the situation is very different for the case of the subset problem. The additional subset condition drastically changes the computational complexity of the problem. And the structure provided by the bare OT ranking logic no longer suffices. Additional structure needs to be introduced, by restricting the formulation of the subset problem to linguistically plausible typologies. Section 4 will illustrate a modeling implication of this result.

Proofs of theorems 3 and 4 are provided in appendices C and D of the online supplementary materials. Here is an outline of the reasoning. Again, the proofs hinge on the fact that the cyclic ordering problem (16) already considered in section 2 is intractable, as shown by Galil and Megiddo (1977). Appendix C shows that an alleged solution algorithm for Prince and Tesar’s (2004) problem (34) could be turned into an efficient solution algorithm for the cyclic ordering problem. As the latter problem admits no efficient solution algorithm, Prince and Tesar’s (2004) problem cannot admit any either and is therefore intractable, as stated in theorem 4. Appendix D then shows that an alleged solution algorithm for the original formulation (3) of the OT subset problem could be turned into an efficient solution algorithm for Prince and Tesar’s (2004) problem (34). As the latter problem admits no efficient solution algorithm, the OT subset problem (3) cannot admit any either and is therefore intractable, as stated in theorem 3.

Because of the generous dependence (33c) of the size of the subset problem on the number of candidates, intractability holds even though an alleged solution algorithm can afford the time to list all candidates. In other words, this intractability result is orthogonal to other intractability results in OT available in the literature; see Eisner 1997, 2000, Wareham 1998, and Idsardi 2006, as well as Heinz, Kobele, and Riggle 2009 for discussion. Furthermore, a close look at the proofs reveals that the OT subset problem remains intractable even when restricted to data with the simplest ‘‘disjunctive structure,’’ in the sense that for each underlying/winner/loser form triplet there are at most two winner-preferring constraints.15

## 4 Implications for Modeling the Early Acquisition of Phonotactics

From a review of the psycholinguistic and linguistic literature, Hayes (2004:161) concludes that ‘‘at more or less [eight to ten months], infants start to acquire knowledge of the legal . . . sequences of their language’’ as shown by the fact that ‘‘in carefully monitored experimental situations, [they] come to react differently to legal phoneme sequences in their native language than to illegal . . . ones.’’ On the other hand, ‘‘certainly we can say that there are at least some morphological processes which are acquired long after the system of contrasts and phonotactics is firmly in place’’ ( p. 165). In conclusion, ‘‘it seems a reasonable guess that in general, the learning of patterns of alternation [which only comes with knowledge of morphology] lags the learning of the contrast and phonotactic systems’’ ( p. 165). There is therefore an early stage of the acquisition of phonotactics characterized by the two properties in (35).

(35)

• a.

Properties of the input: Throughout the early stage, morphology lags behind and the child is thus blind to alternations.

• b.

Properties of the output: By the end of the early stage, the child is able to distinguish legal versus illegal structures; in other words, the child has acquired the adult phonotactics.

Of course, (35) is the informal statement of a computational problem, namely, a mapping from an input (35a) to an output (35b). How should this problem be properly formalized? What is its complexity? What should a proper model of this early acquisition stage look like? This section discusses how theorems 14 bear on these questions, thus concretely illustrating their modeling implications for child language acquisition.

### 4.1 Modeling the Learning Task as a Subset Problem

Suppose that the underlying OT typology is the one defined in (36), based on Lombardi 1999 and Prince and Tesar 2004. The set of phonological forms (36a) consists of stops and fricatives that differ in voicing, both in isolation and in a stop + fricative sequence.

(36)

• a.

{pa, ba, sa, za, apsa, apza, absa, abza}

• b.

The constraint set (36b) consists of dedicated faithfulness and markedness constraints for the two features [stop-voicing] and [fricative-voicing]—namely, F1,M1 and F2,M2, respectively—plus a markedness constraint M that makes these two features interact, requiring two adjacent obstruents to agree in voicing.

The learner has access to some data generated by some target adult OT grammar in this typology. For instance, the data set could look like the one in (37). The learner might have been exposed to some alternations that show that the target adult grammar devoices underlying fricatives, thus providing evidence for the data pair (/za/, [sa]). Also, the learner might have been exposed to instances of [ba] and [abza], thus providing evidence that they are not neutralized, as encoded in the data pairs (/ba/, [ba]) and (/abza/, [abza]).

(37) D = {(/za/, [sa]), (/ba/, [ba]), (/abza/, [abza])}

The learner’s task is to reconstruct the target grammar that might have generated those data. Assuming that the data have not been corrupted, the target grammar is consistent with the data. A straightforward learning approach is thus to look for a consistent grammar—namely, to solve the corresponding instance of the consistency problem. A ranking is consistent with the data set (37) provided it enforces the ranking conditions (38).16

(38)

Crucially, the rankings that satisfy these ranking conditions (38) all generate the same grammar, namely, the grammar that enforces the mappings in (37) and furthermore reduces /absa/ and /apza/ to [abza] and [apsa], respectively (/pa/, /sa/, and /apsa/ are of course mapped faithfully to themselves, as they are unmarked). This unique grammar thus has got to be the target grammar. And the consistency problem thus offers a proper formalization of the learning task in the case considered. As has long been noted, the problem of learning the target adult phonology plausibly reduces to the consistency problem, as long as the learner is provided with a sufficiently rich set of alternations (see, e.g., Dell 1981).

Yet the input property (35a) of the early stage says that the child is blind to alternations throughout this stage, as morphology is lagging behind. Thus, the child’s evidence consists just of surface forms, without any information on the corresponding underlying forms, which could only come through alternations. Being exposed to the surface form [sa], the child has to pick the corresponding underlying form, thus constructing either the faithful data pair (/sa/, [sa]) or the unfaithful pair (/za/, [sa]). The choice of the unfaithful data pair is equivalent to the assumption that the target phonology enforces fricative devoicing. But this assumption might be dangerous. As the child is still blind to alternations, he is not in a position to evaluate such an assumption. The choice of the nonfaithful data pair (/za/, [sa]) might thus turn out to fool the child into positing an inconsistent data set. And theorem 1 says that even the bare consistency problem becomes intractable in this case. A large literature has thus assumed that the child posits underlying forms faithful to the corresponding winner, as illustrated in (39). Indeed, Tesar (2008) shows that assuming faithful underlying forms ensures consistent data sets, under only mild conditions on the constraint set.

(39) D = {(/sa/, [sa]), (/ba/, [ba]), (/abza/, [abza])}

This assumption of faithful underlying forms models property (35a) of the early stage, namely, the fact that the child is blind to alternations (Hayes 2004, Prince and Tesar 2004).

Any ranking that enforces the previous ranking conditions (38), repeated in (40a), is consistent with the new data set (39). Yet there are a number of further rankings that are consistent with the latter data set, such as the two in (40b) and (40c). These rankings generate different grammars, which in turn correspond to different languages. As noted above, ranking (40a) neutralizes /za/, /apza/, and /absa/ and thus generates language (41a). Ranking (40b) only neutralizes /absa/ and /apza/ and thus generates language (41b). Finally, ranking (40c) neutralizes nothing and thus generates the entire language (41c).

(40)

(41)

In other words, the three rankings (40a–c) are all solutions to the consistency problem, even though they correspond to very different grammars and thus generate very different languages (41a–c). Crucially, these languages are in a subset relation.

So far, I have looked at the input property (35a) of the early stage of the acquisition of phonotactics: I have modeled the child’s blindness to alternations through the assumption that the data set consists of faithful mappings, as in (39). Let me now turn to the output property (35b) of the early stage, according to which the child manages to learn the target adult phonotactics, despite lack of alternations. Knowledge of phonotactics is twofold: the child needs to learn to rule in licit forms and to rule out illicit ones.

Suppose the target language is (41c) and the child incorrectly entertains the hypothesis that it is the subset language (41a) instead; that is, the child incorrectly assumes the forms [apza], [absa], and [za] to be illicit. In this case, the child will be able to withdraw from his faulty hypothesis. In fact, he will likely be provided, say, with the form [apza], posit the faithful data pair (/apza/, [apza]), and realize that his current grammar (40a) is inconsistent with this piece of data. Incorrect subset hypotheses thus do not pose a learning threat because they can be corrected on the basis of the type of evidence the child has available at this early stage.

Next, consider the reverse case. Suppose that the target language is (41a) and the child incorrectly entertains the hypothesis that it is the superset language (41c) instead; that is, the child incorrectly assumes the forms [apza], [absa], and [za] to be licit. In this case, the child will not be able to withdraw from his faulty hypothesis. In fact, withdrawing from the hypothesis would require a data pair such as (/apza/, [apsa]), which provides evidence that /apza/ is neutralized rather than produced faithfully. But such data are unavailable at this early developmental stage: as morphology is lagging behind, the child is blind to alternations and therefore only posits faithful data pairs. Incorrect superset hypotheses thus do pose a learning threat because they cannot be corrected on the basis of the type of evidence the child has available at this early developmental stage.

These considerations concretely illustrate the Subset Principle (2). They show that the consistency problem does not provide a proper formalization of the early stage (35) of the acquisition of phonotactics, as it would allow for superset solutions, which would prevent the child from acquiring the target phonotactics. The subset problem provides a better formulation, as it is designed to protect the learner from the threat of superset solutions.

### 4.2 Error-Driven or Batch Models of the Early Stage?

Having characterized the computational nature of the learning task raised by the early stage (35) of the acquisition of phonotactics, I now turn to the issue of its proper modeling. Two modeling schemes have been considered in the OT literature. One scheme is based on batch ranking algorithms, such as Prince and Tesar’s (2004),Biased Constraint Demotion (BCD) and Hayes’s (2004) Low Faithfulness Constraint Demotion (LFCD). These algorithms are based on the following common intuition. T&S’s RCD algorithm described in section 2.6 solves the consistency problem iteratively. At each iteration, it picks a constraint that is currently available for ranking, assigns that constraint to the highest available rank, and gets rid of the data that are accounted for by that constraint. If at every iteration there is only one constraint that is currently available for ranking, then the data are consistent with a single ranking, and the two corresponding instances of the consistency and subset problems collapse. Otherwise, the algorithm needs to choose one among the possibly many constraints that are available for ranking at the current iteration. According to T&S’s original formulation (26), RCD chooses at random. Indeed, any choice is just as good as any other from the perspective of the consistency problem. But that is not the case from the perspective of the subset problem, which is the problem that properly formalizes the early stage (35) we want to model. In fact, some choices will lead to a superset ranking, others to a restrictive one. LFCD and BCD are obtained from RCD by adding dedicated, specific subroutines for the choice of the constraint that is chosen at each iteration, in order to bias the choice toward a restrictive consistent ranking.

An alternative modeling scheme for the early stage (35) of the acquisition of phonotactics is based on error-driven ranking algorithms, such as T&S’s Error-Driven Constraint Demotion (EDCD), Boersma’s (1998),Gradual Learning Algorithm (GLA), or my (2012a) calibrated errordriven ranking algorithm (CEDRA). According to this algorithmic scheme, the learner entertains a current ranking, which represents its current hypothesis about the target adult phonotactics. This current hypothesis is initialized to a restrictive initial ranking—say, one that ranks all markedness constraints above all faithfulness constraints and thus predicts only unmarked forms to be licit. Data come in a stream, one piece of data at a time. The learner checks whether its current ranking is consistent with the current piece of data. If that is not the case, the current ranking is slightly modified. As a result of these rerankings, the initial strictest language is progressively enlarged, hopefully only as much as needed.

Let me illustrate these two algorithmic approaches with a concrete example. Suppose the data set fed to the learner is again (39), consisting of faithful underlying/winner form pairs, in compliance with property (35a) of the early stage. It is useful to construct the corresponding set of ERCs, according to (22). For the data pair (/ba/, [ba]), we need to consider only the loser [pa]. The ERC corresponding to the underlying/winner/loser form triplet (/ba/, [ba], [pa]) is ERC 1 in (42a).

(42)

For the data pair (/abza/, [abza]), we need to consider the three losers [apsa], [apza], and [absa]. The ERCs corresponding to the three underlying/winner/loser form triplets (/abza/, [abza], [apsa]), (/abza/, [abza], [apza]), and (/abza/, [abza], [absa]) are ERCs 2–4 in (42a).17

Here is how the batch approach works. At the first iteration, the constraints F1, F2, and M are all available to be assigned to the top (first) stratum, as they all head columns that contain no L’s. RCD (26) would choose one at random, as the choice has no effects for the consistency problem. However, the choice does have an effect for the subset problem. On the basis of Prince and Tesar’s (2004) intuition recalled in section 3.3, markedness (faithfulness) constraints should be ranked high (low) in order to achieve restrictiveness. Both BCD and LFCD thus have subroutines that select M over F1 and F2 in this case, and assign it to the top (first) stratum. The ERC matrix can thus be simplified as in (42b), striking out the two bottom ERCs, where the top-ranked constraint M has a W. At the second iteration, only F1 and F2 are available for ranking, as they are the only constraints heading columns that contain no L’s. Again, RCD (26) would choose at random between the two. But the choice of F1 is better than F2, as the former ‘‘frees up’’ both M1 andM2, while the latter would ‘‘free up’’ only M2. Both BCD and LFCD thus have subroutines that select F1 over F2 in this case and assign it to the second stratum. All ERCs can now be struck out, as in (42c). RCD would rank the remaining constraints in an arbitrary order. BCD and LFCD instead are geared toward ranking markedness (faithfulness) constraints as high (low) as possible; thus, they first rank M1 and M2 and then finally rank F2 at the bottom. The final ranking MF1M1M2F2 satisfies the restrictive ranking conditions in (40a), as desired.

Here is how the error-driven approach works. The data come in a stream. For concreteness, assume that the data are sampled uniformly from the data set (39). Each data pair is completed with a corresponding loser form into an underlying/winner/loser form triplet, or equivalently its corresponding ERC. For concreteness, assume that the loser is set equal to the candidate predicted by the current ranking entertained by the algorithm. Following Boersma (1997, 1998), this current ranking is represented by assigning to each constraint a numerical ranking value, with the understanding that a constraint is ranked above another constraint provided the ranking value of the former is larger than the ranking value of the latter (for details, see Magri 2012a:sec. 2). The ranking values are initialized by assigning a small initial ranking value to the faithfulness constraints F1 and F2 (say, 0 for concreteness) and a large initial ranking value to the markedness constraints M1, M2, and M (say, 100 for concreteness), so that the initial grammar only allows for unmarked forms. Whenever the ranking represented by the current ranking values is inconsistent with the current ERC, the ranking values are slightly modified. More precisely, the algorithm slightly demotes the loser-preferring constraints that are responsible for the current failure, namely, the undominated ones that are not already ranked underneath a winner-preferring constraint. And it furthermore promotes the winner-preferring constraints. For concreteness, I assume here that the undominated loser-preferrers are demoted by 1 while each of the w winner-preferrers is promoted by , as in CEDRA. The dynamics of the current ranking values of the five constraints over time can be plotted as in figure 1, with time on the horizontal axis and ranking values on the vertical axis. The final ranking values satisfy the restrictive ranking conditions in (40a), as desired. It turns out that error-driven learning succeeds on every language in the typology (36) under very mild assumptions on the frequencies with which the data are sampled and fed to the algorithm; see Magri 2011 for details.

Figure 1.

Dynamics of the ranking values of the five constraints in (36b) in a run of the calibrated errordriven ranking algoritthm on the data set (39)

Figure 1.

Dynamics of the ranking values of the five constraints in (36b) in a run of the calibrated errordriven ranking algoritthm on the data set (39)

Although the mechanics of the two approaches are quite different, it turns out that, at least in the case considered, the error-driven approach manages to implement in an automatic fashion the subroutines that are hardwired into the batch approach. Let me bring out this parallelism. At the first iteration (42a), the batch approach assigns the markedness constraint M at the top of the ranking. This is due to a dedicated subroutine that picks markedness over faithfulness constraints whenever possible. Also in the case of the error-driven approach, M ends up at the top of the ranking (i.e., ends up with the largest final ranking value). This follows automatically from the fact that it starts at the top and is never demoted, as it is never loser-preferring. The batch approach then deletes the two ERCs 3 and 4 right at the first iteration, so that they play no role in the construction of the rest of the ranking. This is due to RCD’s algorithmic logic, which simplifies the data already accounted for by the constraints already ranked (in this case, by the the topranked constraint M). Also in the case of the error-driven approach, these two ERCs play no role; that is, they trigger no updates in the run depicted in figure 1. This follows automatically from the fact that M sits at the top of the ranking throughout learning, ensuring that the current ranking values are always consistent with ERCs 3 and 4. At the second iteration (42b), the batch approach recognizes that F1 needs to be ranked higher than F2. This is due to a dedicated subroutine that is sensitive to the fact that F1 accounts for both ERCs 1 and 2, while F2 accounts only for ERC 2. Also in the case of the error-driven approach, F1 outranks F2 (i.e., the final ranking value of F1 is larger than that of F2). This follows automatically from the fact that F1 is promoted by both ERCs 1 and 2 while F2 is promoted only by ERC 2, so that a few updates by ERC 1 are sufficient to ensure that F1 will always outrank F2. Finally, the batch approach sandwiches the two markedness constraints M1 and M2 in between the two faithfulness constraints F1 and F2. This is due again to a dedicated subroutine that picks markedness constraints whenever possible and thus picks M1 and M2 before F2. Also in the case of the error-driven approach, M1 and M2 end up in between F1 and F2. This follows automatically from the facts that F1 is ranked above F2 (as just discussed), that M1 and M2 need to be ranked underneath F1 for consistency, and that they will stop right before dropping below F2 as well, because by the time they cross F1 the current ranking has become consistent with the data and thus no more updates are performed.

The error-driven approach has been endorsed by the OT acquisition literature. In fact, the model describes a sequence of intermediate rankings that can be matched with child acquisition paths, thus modeling the observed gradualness of child acquisition. Yet error-driven learning has been dismissed with suspicion by the OT computational literature, which has instead focused on the batch approach. In fact, error-driven learning is unlikely to be able to enforce restrictiveness in the general case, beyond cases with some special structure, such as the typology (36) discussed above. The only provision toward restrictiveness of error-driven learning consists of the choice of a restrictive initial ranking, namely, one that ranks faithfulness (markedness) constraints low (high), yielding a smallest language. Restrictiveness of the final grammar only holds as long as this initial bias exerts an effect throughout the entire learning process, so that the initial restrictive language is enlarged only as much as needed. But the sequence of rerankings performed by the algorithm crucially depends on the sequence of data. As a result, the model feels like a leaf in the wind of data, with few guarantees that its final grammar will be restrictive. A malicious stream of data might thus fool the algorithm into superset languages. For instance, Prince and Tesar (2004:251) write, ‘‘Learning takes place over time as more data accumulate. The choice between faithfulness and markedness solutions recurs at each stage of the process. It is not enough to set up an initial state in which [markedness constraints outrank faithfulness ones]; rather, this must be enforced throughout learning, at each step.’’ In conclusion, error-driven learning might be able to mimic the restrictiveness biases of the batch approach in certain cases with special structure, such as the typology (36) discussed above. But these biases need to be hardwired into the learner as in the batch approach, in order for the learner to succeed beyond these special cases.

This is where the intractability result provided by theorems 3 and 4 enters the scene. In this article, I have focused on the universal formulation of the learning problems, whereby the OT typology is not constrained in any way and figures as an input to the problem. This means that a solution algorithm is required to work efficiently both in the case of phonologically plausible typologies and in the case of phonologically implausible and bizarre typologies. Theorems 3 and 4 say that such an unconstrained formulation of the subset problem is intractable. In other words, the structure provided by the bare ranking logic of OT (transitivity, dominance, etc.), which suffices for the case of the easier consistency problem, does not suffice in the case of the more demanding subset problem. Additional structure needs to be provided in order for solution algorithms to work efficiently. This additional structure should plausibly be provided in the form of restrictions on phonologically plausible typologies. These theorems thus say that any algorithm needs to take advantage of this additional structure in order to tackle the subset problem and thus enforce the restrictiveness needed for a proper model of the early stage (35) of the child’s acquisition of phonotactics. It is thus unfair to hold against error-driven learning the fact that it is unable to enforce restrictiveness in the general case and needs instead to rely on special typological structure, as that is indeed the case for any algorithmic scheme, including batch algorithms. No matter which algorithmic scheme we choose, it will have to crucially rely on the special structure introduced by carefully stated restrictions to phonologically plausible typologies. This conclusion motivates the following conjecture: is it the case that those phonologically plausible restrictions suffice in particular to ensure restrictiveness of the error-driven ranking model? In other words, is it the case that phonologically plausible OT typologies happen to have the property just observed for the typology (36), namely, that every language in the typology is able to train the error-driven ranking algorithm toward a restrictive ranking? If this conjecture turns out to be correct, it will provide formidable support for the hypothesis that error-driven learning is a proper model of the child’s acquisition of phonotactics. And theorems 3 and 4 guarantee that the heavy reliance on special typological properties is not due to the algorithmic weakness of error-driven learning, but is an unavoidable condition for the success of any algorithmic scheme.

In Magri 2012b, I have started to pursue this conjecture. The starting point is the intuition that the relative ranking of the faithfulness constraints is mainly relevant for the way that illicit structures are repaired. Only rarely does it turn out to be crucial for the distinction between licit and illicit structures. That is, it only rarely matters for phonotactics. Thus, let me informally say that a language is F-irrelevant provided its phonotactics does not require any specific relative ranking of the faithfulness constraints. I then prove that properly designed error-driven ranking algorithms are restrictive on every such language; that is, they are restrictive on the vast majority of languages.18 The upshot of this result is that restrictiveness of error-driven ranking algorithms now only needs to be investigated for those few special cases that require a special relative ranking of the faithfulness constraints. Language (41a) is one such case, as it crucially requires F1 to be ranked above F2 as in (40a). As seen above, that case has a special structure, which makes the necessary relative ranking of F1 above F2 transparent to error-driven learning, despite the lack of any specific bias. Do other phonologically plausible cases display a similar structure? In Magri 2011, I report an encouraging preliminary result in this direction. I consider all possible constraints of the type of M in (36a), which is responsible for the interaction between the two features that define the typology. And I show that error-driven ranking algorithms (that perform both constraint demotion and promotion) manage to learn the relative ranking of the two faithfulness constraints whenever it is needed, but for phonologically implausible models of feature interaction.

## 5 Conclusions

Language learning is the problem of reconstructing the target adult grammar within the typology of possible grammars, on the basis of a finite set of data generated by that grammar. In this article, I have investigated the complexity of various formulations of this problem within OT.

Of course, the target grammar that has generated the data is consistent with those data. Yet a small portion of the data might have been corrupted by noise or transmission error, resulting in an inconsistent data set. As only a small portion of the data is likely to have been corrupted, it makes sense to look for the target adult grammar among the grammars that are consistent with the large majority of the data. These considerations led to a first formalization of the languagelearning problem, namely, the strong formulation (14) of the OT consistency problem. Theorem 1 (which is a restatement of results from the machine learning literature on preference learning; Galil and Megiddo 1977, Cohen, Schapire, and Singer 1999) says that this problem is intractable: no algorithm can solve an arbitrary instance of the problem efficiently. This intractability result is well known to extend to alternative frameworks such as HG (Johnson and Preparata 1978). Indeed, intractability plausibly does not depend on the choice of the framework and captures instead the intrinsic difficulty of the learning task, namely, of maximizing consistency over the data set.

Prompted by this intractability result, I have thus looked at the weaker formulation (17) of the OT consistency problem. In the case of consistent data, the problem asks for a grammar consistent with each piece of data. But in the case of inconsistent data, the problem just asks for detection of the inconsistency. In section 2, I have presented T&S’s theorem 2, which ensures that this formulation of the language-learning problem is tractable, as any instance is solved efficiently by their RCD algorithm.

The consistency problem is intrinsically ill-posed: there are in general multiple grammars that are consistent with a data set and thus count as a solution. Further heuristics are thus needed in order to choose among those grammars. The Subset Principle articulated in (2) provides one such heuristic: the child needs to avoid superset generalizations, from which it is hard to withdraw owing to lack of negative evidence. The formulation of the language-learning problem is thus refined into the OT subset problem (33), which compounds the consistency condition (33bi) with the additional subset condition (33bii). I have also considered Prince and Tesar’s (2004) alternative formulation (34) of the problem, which approximates the relative size of the language generated by a ranking with the relative height of the faithfulness constraints according to that ranking. In section 3, I have presented the main results of this article: theorems 3 and 4 say that the subset problem is intractable, both in its original formulation and in Prince and Tesar’s (2004) alternative formulation in terms of R-measures. As for the strong consistency problem, I have conjectured that this intractability result extends to alternative frameworks (such as HG or derivational frameworks) and thus captures the intrinsic computational difficulty of the subset condition.

All problem formulations considered in the article are universal (Barton, Berwick, and Ristad 1987:secs. 1.4.4 and 2.3, Heinz, Kobele, and Riggle 2009): no restrictions are posed on the underlying typology of grammars. A solution algorithm is therefore required to work efficiently both for linguistically plausible typologies and for completely implausible ones. T&S’s theorem 2 ensures that the OT consistency problem is solvable even in its universal formulation. In other words, the bare ranking logic of OT provides enough structure to support efficient solution algorithms. Theorems 3 and 4 say that this structure is instead insufficient for the more demanding subset problem. Further structure needs to be introduced into the problem through explicit restrictions on OT typological specifications (i.e., on the generating function, on the constraint set, etc.) in order to support efficient solution algorithms. In section 4, I have discussed the implications of this finding for the choice between batch and error-driven models for the instance of the subset problem posed by the early stage of the child’s acquisition of phonotactics.

## Notes

I wish to thank Adam Albright for lots of help and discussion. I also wish to thank Alan Prince for detailed comments on an earlier draft of this article, as well as Mark Johnson, Jason Riggle, and Donca Steriade for useful discussion. Parts of the material have been presented at NECPhon 2 (Yale University; 15 November 2008) and at SIGMorPhon 11 (University of Uppsala, Sweden; 15 July 2010; see also Magri 2010), whose audiences provided valuable feedback. I wish to thank the anonymous reviewers of SIGMorPhon 11 and Linguistic Inquiry for detailed comments that greatly improved the article. This work was supported in part by a ‘‘Euryi’’ grant from the European Science Foundation (‘‘Presupposition: A Formal Pragmatic Approach’’ to Philippe Schlenker).

2Dell (1981) motivates the Subset Principle in phonology by looking at another special learning circumstance: learning optional rules.

3 The supplementary online materials for this article are available at http://www.mitpressjournals.org/doi/suppl/10.1162/ling_a_00134.

4 For instance, Eisner (2000:32–33) writes, ‘‘Our main conclusion is a warning that OT carries large computational burdens. When formulating the OT learning problem, even small nods in the direction of realism quickly drive the complexity . . . up . . . into the higher complexity classes. . . . Hence all OT generation and learning algorithms should be suspect.’’ Echoing that conclusion, Idsardi (2006:273) writes, ‘‘I have offered a simple proof that the generation problem for OT is NP-hard. . . . [T]his makes OT in general computationally intractable. In contrast, rule-based derivational systems are easily computable, belonging to the class of polynomial-time algorithms.’’

5 The set of underlying forms actually does not play any role in the OT consistency problem (see footnotes 6 and 8). The input (8ai) to the problem could thus have been defined (more precisely) as just the pair (Gen, C), without any mention of the set of underlying forms. Furthermore, it is sufficient to specify the constraints on those underlying forms that appear in the data set D. Let me make this point explicit. A constraint C is a function from pairs of an underlying form x and a candidate yGen(x) into the corresponding number ξ = C(x,y) of violations. Equivalently, a constraint is a set of triplets of the form (x, y, ξ). Let me denote by CD the restriction of the constraint C to the underlying forms in D, namely, the subset of those triplets (x, y, ξ) such that x appears in some pair in the data set D. And let me denote by CD the set of restricted constraints. The input (8ai) to the consistency problem could thus have been defined (more precisely) as (Gen, CD), just in terms of restricted constraints.

6 According to the finiteness assumption (9), the constraint set and the generating function need to be finite, but the set of underlying forms need not be finite. And indeed, the size (11c) of an instance of the OT consistency problem does not depend on the cardinality of . As anticipated in footnote 5, the set of underlying forms does not play any role in the consistency problem.

7 Letting the size of an instance of the OT consistency problem depend on |Gen(D)|, as well as on |C| and |D|, ensures that the corresponding decision problem ( provided in (68) in appendix B of the online supplementary materials) belongs to NP—namely, that it admits a polynomial time verification algorithm. See appendix A of the online supplementary materials for details.

8 The consistency condition (13) only looks at the underlying forms that appear in the data pairs in D and at their candidates, while it does not in any way depend on the remaining underlying forms listed in the set . As anticipated in footnote 5, the set of underlying forms does not play any role in the definition of the output condition (14b) of the consistency problem.

9 The definition of RCD given in (26) is slightly different from T&S’s original definition. The difference shows up for input ERC matrices that have more than one constraint with no un-struck-out L’s. According to definition (26), RCD arbitrarily chooses one such constraint and assigns it to the highest available rank. According to T&S’s original definition, RCD assigns all such constraints to the highest available stratum and then outputs a total ranking that is an arbitrary refinement of the nontotal hierarchy thus constructed. To see how the two definitions differ, consider the ERC matrix in (i).

(i)

T&S’s original RCD first computes the nontotal hierarchy {C1, C2} ≫ {C3} and then outputs one of its two refinements, either C1C2C3 or C2C1C3. Thus, the ranking C1C3C2 lies outside the search space of the original RCD, even though this ranking too is consistent with the given ERC matrix (i). Instead, the version of RCD defined in (26) might output such a ranking, provided that the algorithm chooses C1 at the first iteration, C3 at the next iteration, and C2 at the last iteration. More generally, the version of RCD defined in (26) imposes no artificial restrictions on the search space of the algorithm, which indeed contains any ranking consistent with the given ERC matrix.

10 In other words, L is the range of the corresponding OT grammar OT, construed as a function from underlying forms into surface forms.

11 The consistency problem corresponds to Empirical Risk Minimization in the Statistical Learning Theory, while Prince and Tesar’s (2004) problem (31) corresponds to a regularized version thereof, with regularization function μ.

12Prince and Tesar (2004) actually define their R-measure as μ̂ in (i), rather than as in (30). In other words, they count the number of markedness constraints ranked above (rather than underneath) each faithfulness constraint. A ranking ≫ is thus expected to generate a small language provided it has a large R-measure μ̂ (≫). They thus restate condition (31bii) in terms of μ̂ by requiring that there be no other consistent ranking with a larger R-measure μ̂ , so that the OT subset problem becomes a maximization (rather than a minimization) problem.

(i) μ̂ (≫) = |{(F, M) ∈ F × M| FM}|

But the two formulations are equivalent. In fact, the number of markedness constraints ranked underneath a certain faithfulness constraint coincides with the total number of markedness constraints minus the number of those that are ranked above that faithfulness constraint. Thus, the two R-measures μ and μ̂ are connected as in (ii) for any ranking ≫, where m is the total number of markedness constraints and f is the total number of faithfulness constraints. As mf is a constant, the identity (ii) says that minimizing μ or maximizing μ̂ yields the same set of rankings, so that the two formulations are equivalent. Departing from Prince and Tesar’s (2004) original formulation, I have chosen (30) over (i) with an eye toward the proof presented in appendix C in the online supplementary materials.

(ii) μ(≫) = mfμ̂(≫)

Note, though, that these two R-measures (30) and (i) are not equivalent within the framework actually adopted by Prince and Tesar (2004). In fact, they work within the slightly extended OT framework of T&S, which relaxes the notion of ranking into that of a stratified hierarchy, which allows for multiple constraints to be assigned to the same stratum. The reason for this is that Prince and Tesar (2004) work with a version of RCD slightly different from the one described in section 2.6, which returns stratified hierarchies rather than rankings (see footnote 9 for some details). The R-measures defined in (30) and (i) for a ranking ≫ trivially extend to the case where ≫ is a stratified hierarchy. But the equivalence in (ii) does not carry over from rankings to stratified hierarchies, as discussed by Prince and Tesar (2004:252–253).

13 |Gen()| is defined analogously to (10), as the cardinality maxx∈[ |Gen(x)| of the largest candidate set Gen(x) over all underlying forms x [. Letting the size of an instance of problem (33) depend on |C|, ||, and |Gen()| ensures that the corresponding decision problem ( provided in (86) in appendix D of the online supplementary materials) is in NP—namely, it admits an efficient verification algorithm. See appendix A for details.

14 Letting the size of an instance of problem (34) depend not only on |C| and |D|, but also on |Gen(D)|, straightforwardly ensures that the corresponding decision problem ( provided in (84) in appendix C of the online supplementary materials) is in NP—namely, that it admits an efficient verification algorithm. See appendix A for details.

15 Of course, the subset problem is trivial for data sets that have a unique winner-preferring constraint per underlying/winner/loser form triplet, as those data are OT-consistent with a unique ranking, and thus the consistency problem reduces to the subset problem in this case.

16 In fact, a ranking is consistent with the pair (/ba/, [ba]) provided F1 = Ident[stop-voicing] is ranked above M1 = *[stop-voicing], as in (38a); it is consistent with the pair (/za/, [sa]) provided F2 = Ident[fricative-voicing] is ranked below M2=*[fricative-voicing], as in (38b); and it is therefore consistent with the pair (/abza/, [abza]) provided both M = Agree and F2 = Ident[stop-voicing] are ranked above M2 = *[fricative-voicing], as in (38c) and (38d).

17 I ignore the remaining data pair (/sa/, [sa]) and the ERC for the corresponding underlying/winner/loser form triplet (/sa/, [sa], [za]), as it is consistent with any ranking and thus does not contribute to learning.

18 This result holds for any constraint set, for both plausible and implausible ones. This result fits well with the complexity analysis developed in this article. As is clear from the proofs in appendices C and D in the online supplementary materials, what is driving the intractability result of theorems 3 and 4 is indeed the difficulty of learning the correct relative ranking of the faithfulness constraints. Thus, we expect tractability in those cases where the relative ranking of the faithfulness constraints does not matter.

## References

Angluin
,
Dana
.
1980
.
Inductive inference of formal languages from positive data
.
Information Control
45
:
117
135
.
Barton
,
G. Edward
,
Robert
Berwick
, and
Eric Sven
.
1987
.
Computational complexity and natural language
.
Cambridge, MA
:
MIT Press
.
Berwick
,
Robert
.
1985
.
The acquisition of syntactic knowledge
.
Cambridge, MA
:
MIT Press
.
Boersma
,
Paul
.
1997
.
How we learn variation, optionality and probability
. In
Proceedings of the Institute of Phonetic Sciences (IFA) 21
, ed. by
Rob
van Son
,
43
58
.
Amsterdam
:
University of Amsterdam, Institute of Phonetic Sciences
.
Boersma
,
Paul
.
1998
.
Functional phonology. Doctoral dissertation, University of Amsterdam
.
The Hague
:
.
Clark
,
Robin
.
1992
.
The selection of syntactic knowledge
.
Language Acquisition
2
:
83
149
.
Cohen
,
William
,
Robert E.
Schapire
, and
Yoram
Singer
.
1999
.
Learning to order things
.
Journal of Artificial Intelligence Research
10
:
243
270
.
Dell
,
François
.
1981
.
On the learnability of optional phonological rules
.
Linguistic Inquiry
12
:
31
37
.
Eisner
,
Jason
.
1997
.
Efficient generation in primitive Optimality Theory
. In
Annual Meeting of the Association for Computational Linguistics (ACL) 35
, ed. by
Philip R.
Cohen
and
Wolfgang
Wahlster
,
313
320
.
San Francisco, CA
:
Morgan Kaufmann/ACL
.
Eisner
,
Jason
.
2000
.
Easy and hard constraint ranking in Optimality Theory
. In
Finite-State Phonology: ACL Special Interest Group in Computational Phonology (SIGPhon) 5
, ed. by
Jason
Eisner
,
Lauri
Karttunen
, and
Alain
Thériault
,
22
33
.
San Francisco, CA
:
Morgan Kaufmann/ACL
.
Fodor
,
Janet Dean
, and
William Gregory
Sakas
.
2005
.
The Subset Principle in syntax: Costs of compliance
.
Linguistics
41
:
513
569
.
Galil
,
Zvi
, and
Nimrod
Megiddo
.
1977
.
Cyclic ordering is NP-complete
.
Theoretical Computer Science
5
:
179
182
.
Hale
,
Mark
, and
Charles
Reiss
.
2003
.
The Subset Principle in phonology: Why the tabula can’t be rasa
.
Journal of Linguistics
39
:
219
244
.
Hayes
,
Bruce
.
2004
.
Phonological acquisition in Optimality Theory: The early stages
. In
Constraints in phonological acquisition
, ed. by
René
Kager
,
Joe
Pater
, and
Wim
Zonneveld
,
158
203
.
Cambridge
:
Cambridge University Press
.
Heinz
,
Jeffrey
,
Gregory M.
Kobele
, and
Jason
Riggle
.
2009
.
Evaluating the complexity of Optimality Theory
.
Linguistic Inquiry
40
:
277
288
.
Heinz
,
Jeffrey
, and
Jason
Riggle
.
2011
.
Learnability
. In
Blackwell companion to phonology
, ed. by
Marc
van Oostendorp
,
Colin J.
Ewen
,
Elizabeth V.
Hume
, and
Keren
Rice
,
54
78
.
Oxford
:
Wiley-Blackwell
.
Idsardi
,
William
.
2006
.
A simple proof that Optimality Theory is computationally intractable
.
Linguistic Inquiry
37
:
271
275
.
Jarosz
,
Gaja
.
2006
.
Rich lexicons and restrictive grammars: Maximum Likelihood learning in Optimality Theory
.
Doctoral dissertation, Johns Hopkins University, Baltimore, MD. Rutgers Optimality Archive, ROA 884. Available at http://roa.rutgers.edu
.
Johnson
,
David S
., and
Franco P.
Preparata
.
1978
.
The densest hemisphere problem
.
Theoretical Computer Science
6
:
93
107
.
Kager
,
René
.
1999
.
Optimality Theory
.
Cambridge
:
Cambridge University Press
.
Legendre
,
Géraldine
,
Yoshiro
Miyata
, and
Paul
Smolensky
.
1998a
.
Harmonic Grammar: A formal multilevel connectionist theory of linguistic well-formedness: An application
. In
Proceedings of the 12th annual conference of the Cognitive Science Society
, ed. by
Morton Ann
Gernsbacher
and
Sharon J.
Derry
,
884
891
.
Mahwah, NJ
:
Lawrence Erlbaum
.
Legendre
,
Géraldine
,
Yoshiro
Miyata
, and
Paul
Smolensky
.
1998b
.
Harmonic Grammar: A formal multilevel connectionist theory of linguistic well-formedness: Theoretical foundations
. In
Proceedings of the 12th annual conference of the Cognitive Science Society
, ed. by
Morton Ann
Gernsbacher
and
Sharon J.
Derry
,
388
395
.
Mahwah, NJ
:
Lawrence Erlbaum
.
Lombardi
,
Linda
.
1999
.
Positional faithfulness and voicing assimilation in Optimality Theory
.
Natural Language and Linguistic Theory
17
:
267
302
.
Magri
,
Giorgio
.
2010
.
Complexity of the acquisition of phonotactics in Optimality Theory
. In
ACL Special Interest Group in Computational Morphology and Phonology (SIGMorPhon) 11
, ed. by
Jeffrey
Heinz
,
Lynne
Cahill
, and
Richard
Wicentowski
,
19
27
.
Stroudsburg, PA
:
Association for Computational Linguistics
.
Magri
,
Giorgio
.
2011
.
An online model of the acquisition of phonotactics within Optimality Theory
. In
Expanding the space of cognitive science: Proceedings of the 33rd annual conference of the Cognitive Science Society
, ed. by
Laura
Carlson
,
Christoph
Hölscher
, and
Thomas F.
Shipley
,
2012
2017
.
Austin, TX
:
Cognitive Science Society
.
Magri
,
Giorgio
.
2012a
.
Convergence of error-driven ranking algorithms
.
Phonology
29
:
213
269
.
Magri
,
Giorgio
.
2012b
.
The error-driven ranking model of the child acquisition of phonotactics
.
Ms., Université Paris 8
.
Manzini
,
M
. Rita, and Ken Wexler.
1987
.
Parameters, binding theory, and learnability
.
Linguistic Inquiry
18
:
413
444
.
Prince
,
Alan
.
2002
.
Entailed ranking arguments
.
Ms., Rutgers University, New Brunswick, NJ. Rutgers Optimality Archive, ROA 500. Available at http://roa.rutgers.edu
.
Prince
,
Alan
, and
Paul
Smolensky
.
2004
.
Optimality Theory: Constraint interaction in generative grammar
.
Oxford
:
Blackwell
.
Rutgers Optimality Archive, ROA 537. Available at http://roa.rutgers.edu. Initially published in 1993 as Technical Report CU-CS-696-93, Department of Computer Science, University of Colorado at Boulder; and as Technical Report TR-2, Rutgers Center for Cognitive Science, Rutgers University, New Brunswick, NJ
.
Prince
,
Alan
, and
Bruce
Tesar
.
2004
.
Learning phonotactic distributions
. In
Constraints in phonological acquisition
, ed. by
René
Kager
,
Joe
Pater
, and
Wim
Zonneveld
,
245
291
.
Cambridge
:
Cambridge University Press
.
Safir
,
Ken
.
1987
.
. In
Parameter setting
, ed. by
Thomas
Roeper
and
Edwin
Williams
,
77
89
.
Dordrecht
:
Reidel
.
Tesar
,
Bruce
.
1995
.
Computational Optimality Theory
.
Doctoral dissertation, University of Colorado, Boulder. Rutgers Optimality Archive, ROA 90. Available at http://roa.rutgers.edu
.
Tesar
,
Bruce
.
2008
.
Output-driven maps
.
Ms., Rutgers University, New Brunswick, NJ. Rutgers Optimality Archive, ROA 956. Available at http://roa.rutgers.edu
.
Tesar
,
Bruce
, and
Paul
Smolensky
.
1998
.
Learnability in Optimality Theory
.
Linguistic Inquiry
29
:
229
268
.
Tesar
,
Bruce
, and
Paul
Smolensky
.
2000
.
Learnability in Optimality Theory
.
Cambridge, MA
:
MIT Press
.
Wareham
,
Harold Todd
.
1998
.
Systematic parameterized complexity analysis in computational phonology
.
Doctoral dissertation, University of Victoria, British Columbia
.
Wexler
,
Ken
, and
Rita
Manzini
.
1987
.
Parameters and learnability in binding theory
. In
Parameter setting
, ed. by
Thomas
Roeper
and
Edwin
Williams
,
41
76
.
Dordrecht
:
Reidel
.