The lexicon of a natural language does not contain all of the phonological structures that are grammatical. This presents a fundamental challenge to the learner, who must distinguish linguistically significant restrictions from accidental gaps (Fischer-Jørgensen 1952, Halle 1962, Chomsky and Halle 1965, Pierrehumbert 1994, Frisch and Zawaydeh 2001, Iverson and Salmons 2005, Gorman 2013, Hayes and White 2013). The severity of the challenge depends on the size of the lexicon (Pierrehumbert 2001), the number of sounds and their frequency distribution (Sigurd 1968, Tambovtsev and Martindale 2007), and the complexity of the generalizations that learners must entertain (Pierrehumbert 1994, Hayes and Wilson 2008, Kager and Pater 2012, Jardine and Heinz 2016).
In this squib, we consider the problem that accidental gaps pose for learning phonotactic grammars stated on a single, surface level of representation. While the monostratal approach to phonology has considerable theoretical and computational appeal (Ellison 1993, Bird and Ellison 1994, Scobbie, Coleman, and Bird 1996, Burzio 2002), little previous research has investigated how purely surface-based phonotactic grammars can be learned from natural lexicons (but cf. Hayes and Wilson 2008, Hayes and White 2013). The empirical basis of our study is the sound pattern of South Bolivian Quechua, with particular focus on the allophonic distribution of high and mid vowels. We show that, in characterizing the vowel distribution, a surface-based analysis must resort to generalizations of greater complexity than are needed in traditional accounts that derive outputs from underlying forms. This exacerbates the learning problem, because complex constraints are more likely to be surface-true by chance (i.e., the structures they prohibit are more likely to be accidentally absent from the lexicon). A comprehensive quantitative analysis of the Quechua lexicon and phonotactic system establishes that many accidental gaps of the relevant complexity level do indeed exist.
We propose that, to overcome this problem, surface-based phonotactic models should have two related properties: they should use distinctive features to state constraints at multiple levels of granularity, and they should select constraints of appropriate granularity by statistical comparison of observed and expected frequency distributions. The central idea is that actual gaps typically belong to statistically robust feature-based classes, whereas accidental gaps are more likely to be featurally isolated and to contain independently rare sounds. A maximum-entropy learning model that incorporates these two properties is shown to be effective at distinguishing systematic and accidental gaps in a whole-language phonotactic analysis of Quechua, outperforming minimally different models that lack features or perform nonstatistical induction.