Abstract
A central goal of linguistic theory is to find a precise characterization of the notion “possible human language”, in the form of a computational device that is capable of describing all and only the languages that can be acquired by a typically developing human child. The success of recent large language models (LLMs) in NLP applications arguably raises the possibility that LLMs might be computational devices that meet this goal. This would only be the case if, in addition to succeeding in learning human languages, LLMs struggle to learn “impossible” human languages. Kallini et al. (2024) conducted experiments aiming to test this by training GPT-2 on a variety of synthetic languages, and found that it learns some more successfully than others. They present these asymmetries as support for the idea that LLMs’ inductive biases align with what is regarded as “possible” for human languages, but the most significant comparison has a confound that makes this conclusion unwarranted.
A central goal of linguistic theory, since at least Chomsky (1965, p. 25), has been to find a precise characterization of the notion “possible human language”. Researchers have pursued this goal by attempting to identify a kind of computational device that is capable of describing all and only the possible human languages, i.e., those languages that can be acquired by a typically developing human child. To the extent that a particular kind of computational device meets this goal, it constitutes a plausible hypothesis about the mental machinery that underlies the human capacity for language.
The success of recent large language models (LLMs) in NLP applications raises the possibility that LLMs might be devices that meet this goal. They have been found to be remarkably successful at tasks that, let us grant—controversially, but innocuously for present purposes—require learning certain human languages in a relevant sense. The other side of the coin, however, is whether LLMs are similarly successful at learning languages that humans cannot, i.e., “humanly impossible languages”. If they are, this would tell against the hypothesis that human linguistic capacities take a form that resembles an LLM.
Kallini et al. (2024) cite a number of claims to the effect that LLMs will successfully learn such impossible languages, and set out to test this. They develop a set of synthetic languages that are unlike what has been observed in any human language, and find that “GPT-2 struggles to learn impossible languages when compared to English as a control, challenging the core claim” (p. 14691). The most interesting impossible languages, and the ones that Kallini et al. address most extensively, are those that involve count-based rules. Sentences of the language called WordHop, for example, are like sentences of English except that inflectional affixes on verbs are replaced with distinguished marker tokens (🅂 for singular, 🄿 for plural) which appear to the right of the (uninflected) verb, separated by exactly four words; see Table 1. For a minimal comparison with WordHop, Kallini et al. also construct a minor variant of English called NoHop, which uses the same distinguished markers but places them immediately adjacent to the verb.
Illustration of how sentences of WordHop and NoHop are derived from English sentences.
. | Singular agreement example . | Plural agreement example . |
---|---|---|
English | He cleans his very messy bookshelf . | They clean his very messy bookshelf . |
WordHop | He clean his very messy bookshelf 🅂 . | They clean his very messy bookshelf 🄿 . |
NoHop | He clean 🅂 his very messy bookshelf . | They clean 🄿 his very messy bookshelf . |
. | Singular agreement example . | Plural agreement example . |
---|---|---|
English | He cleans his very messy bookshelf . | They clean his very messy bookshelf . |
WordHop | He clean his very messy bookshelf 🅂 . | They clean his very messy bookshelf 🄿 . |
NoHop | He clean 🅂 his very messy bookshelf . | They clean 🄿 his very messy bookshelf . |
It is widely agreed that the count-based placement of the 🅂 and 🄿 markers in WordHop is indeed outside the bounds of “possible human languages” (whereas NoHop, being essentially analogous to English, is not), and Kallini et al. show that GPT-2 is less successful at learning WordHop than NoHop. This finding is presented as the main challenge to the claims that GPT-2 models are insufficiently human-like.
The comparison between WordHop and NoHop, however, does not actually test the critical point. The problem, to a first approximation, is a confound between whether a rule is count-based and whether that rule creates non-adjacent dependencies: The comparison is between adjacency and count-based non-adjacency. The crucial observation that linguists have repeatedly remarked on regarding count-based non-adjacent dependencies is their absence relative to constituency-based non-adjacent dependencies, not relative to adjacent dependencies. The corresponding claim about the human language faculty is that it can naturally accommodate or express constituency-based non-adjacent dependencies to a degree that does not hold for count-based non-adjacent dependencies. It would be interesting to know whether LLMs show this same asymmetry, but a comparison between WordHop and NoHop sheds no light on this question.
In Section 1 I will rehearse some standard arguments illustrating the difference between count-based and constituency-based rules. With some specifics of the relevant phenomena in hand, Section 2 lays out more carefully why the comparison between WordHop and NoHop misses the mark. This logic will lead to some suggestions for more appropriate comparisons in Section 3.
1 Review of the Underlying Issues
The frequently used example of question-formation in English provides a relevant starting point.1 Consider the relationship that the sentences in (1a) and (2a) stand in to their corresponding yes-no questions. The question form of (1a) consists of the same words rearranged, as in (1b); we can describe this by saying that the word “will” has been displaced to the front of the sentence. One could imagine that this was an instance of a count-based rule that formed questions by displacing the third word of a sentence, but we can see that this is not the case because applying this rule to (2a) yields (2b). The actual rule under investigation somehow yields (2c), with the sixth word displaced.
(1) a. The dog will bark (2) a. The dog in the corner will bark
b. Will the dog bark? b. * In the dog the corner will bark?
c. Will the dog in the corner bark?
Considering now (3a), the question-forming rule displaces neither the third word nor the sixth word (which would yield (3b) and (3c), respectively). What (1b) and (2c) have in common is that in both cases the displaced word is “will”, and this also holds for the desired form (3d)—where the displaced “will” was the eighth word. But the rule under investigation somehow excludes moving the other “will” to produce (3e).
(3) a. The dog that will chase the cat will bark
b. * That the dog will chase the cat will bark?
c. * The the dog that will chase cat will bark?
d. Will the dog that will chase the cat bark?
e. * Will the dog that chase the cat will bark?
And it is not as simple as always moving the last/rightmost occurrence of “will” (or more generally, an auxiliary verb), as illustrated by the pattern in (4).
(4) a. The dog in the corner will chase the dog that will bark
b. Will the dog in the corner chase the dog that will bark?
c. *Will the dog in the corner will chase the dog that bark?
The operative rule cannot be formulated in count-based terms, i.e., no description of the form “the nth word of the sentence” or “the nth occurrence of ‘will’ from the end of the sentence” will consistently pick out the word that is to be displaced. The correct generalization can however be expressed in terms of hierarchical constituency: Given the structural analyses in Figure 1 for the declaratives in (3) and (4), the displaced word is the Aux that is the granddaughter of the root S node.
Hierarchical structural descriptions for the declaratives in (3) and (4).
This example from English is entirely representative: Patterns like this that conform to a constituency-based rule, but where no count-based characterization has been found, are ubiquitous in natural languages. And the reverse situation, where a pattern follows a count-based rule but has no constituency-based characterization, is unheard of. The conventional linguistic explanation for this striking asymmetry is that (languages with) count-based rules are “humanly impossible”—outside the capacity of the mental faculties that are recruited in naturalistic language development.2 Of course, given a simple enough artificial grammar-learning experiment, a human may show some success at learning and applying a count-based rule, perhaps by recruiting other mental faculties to the task; somewhat similarly, a proponent of the idea that LLMs embody a human-like ill-suitedness to count-based rules is not committed to the prediction that an LLM will always show zero evidence of having extracted any count-based rule from training data. Rather than any raw measure of successful learning of any single kind of rule, the critical issue is an asymmetry between count-based and constituency-based rules.
Testing for such an asymmetry obviously requires controlling for other factors. While the rule for the placement of the 🅂 and 🄿 markers in Kallini et al.’s WordHop is a canonical example of a count-based rule—the kind that turns out to be insufficient to describe the pattern in (1)–(4)—the rule for placing these markers in NoHop is not an appropriately representative constituency-based rule to compare it against. The NoHop rule is extremely simple: The marker is placed immediately after the verb. It’s true that the full-fledged English system of verbal inflections involves crucially constituency-based rules, which are in fact closely intertwined with the phenomena in (1)–(4) above, and one of the configurations that this system produces is the one illustrated in Table 1, with the inflected verb “cleans”. But the constituency-based parts of that system are not probed by a comparison between WordHop and NoHop, which differ only in whether the 🅂 and 🄿 markers are separated from the verb by four words or zero words.
2 Constituency and English Verbal Inflections
A mistaken impression that NoHop can serve as a representative of constituency-based rules might arise, in part, from the fact that the behavior of verbal inflections is intertwined with the question-forming rule that is used in the classical illustration of constituency-sensitivity rehearsed in Section 1.
This connection can be established by observing that these inflections (e.g., the suffixes in “cleans” and “cleaned”) do not co-occur with words like “will” that are displaced by the question-forming rule. A finite clause must include either one of these inflections or a word that behaves like “will” (e.g. “may”, “must”, “can”), but not both.
(5) a. *He clean (6) a. He cleans (7) a. He cleaned
b. He will clean b. *He will cleans b. *He will cleaned
c. He may clean c. *He may cleans c. *He may cleaned
So we have identified a three-way dependency between (i) the sentence-initial position occupied by “will” in the questions in section 1, (ii) the position occupied by “will” in non-questions, in section 1 and in (5)–(7), and (iii) the position occupied by the inflectional affixes in (5)–(7). This can be formalized in various ways (see Chomsky [1957] for the original analysis3); the diagram in (8) is precise enough for our purposes.
To complete the picture, notice that the question-forming rule does not make any distinction between the affixes that appear in position (iii) in declaratives and the words like “will” that appear in position (ii): the affixes are also displaced to position (i) in questions, where their pronunciation is supported by a form of the dummy verb “do”.
(9) a. Does he clean? (cf. (6a)) a. Will he clean? (cf. (5b))
b. Did he clean? (cf. (7a)) b. May he clean? (cf. (5c))
No matter how these details are formalized, the crucial and uncontroversial point is that these three interdependent positions are identified in constituent-based terms, not count-based terms. We saw in Section 1 that the relationship between positions (i) and (ii) is not defined via a number of intervening words, but rather with reference to the hierarchical structure. Similarly, although the word that an affix in position (iii) attaches to has been adjacent to position (ii) in all the examples so far, this is not true in general: Additional words can intervene here too, as illustrated by (10). The presence of a direct object after the verb in these sentences also demonstrates that position (iii) cannot be defined linearly as “the end of the string”.
(10) a. He will without doubt clean his very messy bookshelf.
b. He without doubt cleans his very messy bookshelf.
Furthermore, although the discussion in Section 1 emphasized only the hierarchical determination of the auxiliary that should be displaced to the front of the sentence, this target position is in fact defined in hierarchical terms too: (11) shows examples of questions where “will” is in position (i) despite not being sentence-initial.4
(11) a. Which very messy bookshelf will he clean?
b. How will he clean his very messy bookshelf?
c. Though his bookshelf is very messy, will he clean it?
Another way in which English verbal inflections are intertwined with crucially hierarchical notions concerns number agreement with the subject; recall the two columns in Table 1. There is a single hierarchically defined position that the agreement-controlling noun “gift(s)” occupies in all of the examples in (12)–(13). The rule needs to pick out the second word (and the first of the two nouns) in (12), but the third word (and the second of the two nouns) in (13), so again no count-based formulation is possible.
(12) a. The gift from the man (13) a. The man’s gift
b. The gift from the men b. The men’s gift
c. The gifts from the man c. The man’s gifts
d. The gifts from the men d. The men’s gifts
Both of these phenomena have (with good reason) been prominent test cases in work investigating connectionist systems’ treatment of constituency-based generalizations. Studies using the question-forming rule as a probe into this issue include Frank and Mathis (2007), McCoy, Frank, and Linzen (2020), and Warstadt and Bowman (2020), and those using subject-verb agreement include Linzen, Dupoux, and Goldberg (2016), Kuncoro et al. (2018), and Lakretz et al. (2021). And as illustrated in this section, the constituency-sensitive rules underlying both of these phenomena bear on the distribution of English inflected verb forms (e.g., “cleans” and “cleaned”) that Kallini et al. manipulate in order to create WordHop and NoHop. But sentences with those inflected verb forms are a shared “starting point” for these two artificial languages, which differ only in whether the 🅂 and 🄿 markers occur in the hierarchically-defined position (iii) or at a count-based offset from that position. The constituency-based patterns in which verbal inflections participate—the three-way dependency in (8), and the hierarchy-sensitive agreement in (12)–(13)—are irrelevant for any comparison between WordHop and NoHop. WordHop contains just as much constituency-based question-formation, and just as much hierarchically sensitive agreement, as NoHop does. A comparison between the two just amounts to a comparison between the count-based displacement in WordHop, and the absence of any analogous displacement in NoHop.
3 Towards a Better Comparison: Counting vs. Constituency
The problem with the comparison between WordHop and NoHop is that the count-based rule in WordHop is not the counterpart of any constituency-based rule in NoHop. There are two ways we might seek to rectify this. The first is to keep WordHop as our representative count-based language, and introduce a constituency-based rule to be the necessary counterpart: Compare WordHop, where the 🅂 and 🄿 markers are placed at a count-based offset from position (iii), against a new synthetic language where these markers are placed at a constituency-based offset from position (iii). The second possibility is to keep NoHop as our representative constituency-based language, and replace one of the constituency-based rules governing the placement of the 🅂 and 🄿 markers with a count-based rule. I consider both routes here, but my aim is only to clarify the logic of what is needed, not to fully resolve all the issues that arise.
As a constituency-based counterpart to WordHop’s count-based rule, suppose we formulate a rule where markers are placed at the right edge of the sister constituent of position (iii)’s parent V node; this will be the right edge of the direct object, in many cases. (No such constituent is shown in (8), but notice the relevant NP constituents in Figure 1.) The resulting language would be the constituency-based side of the comparison illustrated in Table 2, where the count-based side is unchanged from WordHop.
A comparison between WordHop and a language with a constituency-based rule.
Count-based (= WordHop) . | Constituency-based . |
---|---|
He clean his very messy bookshelf 🅂 | He clean his very messy bookshelf 🅂 |
He clean the bookshelf with glee 🅂 | He clean the bookshelf 🅂 with glee |
He clean it with a big 🅂 red broom | He clean it 🅂 with a big red broom |
He clean the bookshelf that is 🅂 messy | He clean the bookshelf that is messy 🅂 |
Count-based (= WordHop) . | Constituency-based . |
---|---|
He clean his very messy bookshelf 🅂 | He clean his very messy bookshelf 🅂 |
He clean the bookshelf with glee 🅂 | He clean the bookshelf 🅂 with glee |
He clean it with a big 🅂 red broom | He clean it 🅂 with a big red broom |
He clean the bookshelf that is 🅂 messy | He clean the bookshelf that is messy 🅂 |
One challenge here is that synthesizing examples of this constituency-based pattern requires determining what counts as a sister of the relevant V node, which will sometimes be controversial. For example, one would need to decide on an appropriate structure for verb-particle constructions such as “look up the number” (e.g., Johnson 1991, pp. 590–595), and whether the arguably subcategorized adverb in “behave well” is in the position of a typical object NP (e.g., McConnell-Ginet 1982, pp. 164–166).5
A more subtle concern is whether the constituency-based pattern in Table 2 is necessarily describable only in terms of a constituency-based offset from position (iii), or whether it has an alternative characterization in terms of a constituency-based offset from position (ii). If the marker position in this new language were definable in hierarchical terms relative to position (ii), then it would be no better than NoHop: The comparison in Table 2 would again be a comparison between the composition of a constituency-based and a count-based offset from position (ii), and an only constituency-based offset from position (ii). The underlying question here is whether the composition of two constituency-based relations is always another valid constituency-based relation. The answer will depend on the details of one’s theory of linguistically possible dependencies, which remains an active research topic. (It may bear repeating here that the exclusion of count-based dependencies is not one of the points of disagreement.)
Consider now the other route, where we pit a count-based rule against one of the existing constituency-based rules underlying NoHop. Let’s suppose the relevant count-based rule placed the 🅂 and 🄿 markers at a four-word offset from position (ii) (i.e., the position of auxiliary verbs in declaratives, typically the right edge of the subject), as a counterpart to the hierarchically defined relationship between position (ii) and position (iii). This comparison is illustrated in Table 3. Synthesizing the count-based examples here only requires identifying position (ii), which is likely less controversial than the issues that arose for Table 2 regarding sister constituents of the verb.
A comparison between NoHop and a language with a count-based rule.
Count-based . | Constituency-based (= NoHop) . |
---|---|
He clean his messy bookshelf 🅂 | He clean 🅂 his messy bookshelf |
He always clean his messy 🅂 bookshelf | He always clean 🅂 his messy bookshelf |
He without doubt clean it 🅂 | He without doubt clean 🅂 it |
He clean it with a 🅂 broom | He clean 🅂 it with a broom |
Count-based . | Constituency-based (= NoHop) . |
---|---|
He clean his messy bookshelf 🅂 | He clean 🅂 his messy bookshelf |
He always clean his messy 🅂 bookshelf | He always clean 🅂 his messy bookshelf |
He without doubt clean it 🅂 | He without doubt clean 🅂 it |
He clean it with a 🅂 broom | He clean 🅂 it with a broom |
A questionable aspect of the comparison in Table 3 is that, in the constituency-based pattern, the word immediately preceding the marker is always of the same category (namely, a verb), whereas in the count-based pattern the words preceding the marker are heterogeneous in syntactic category. (This is also a characteristic of the comparison between WordHop and NoHop.) This could be thought to make the constituency-based pattern more “predictable” or “simple” in a sense that we would like to control for. Notice that this consistency of an adjacent category is not a general property of constituency-based rules: In the constituency-based pattern in Table 2 the marker follows “bookshelf”, “it”, and “messy”, which belong to distinct syntactic categories. Rather it is a consequence of the fact that the rule relating position (ii) and position (iii) in English (“affix hopping”) is somewhat anomalous in ways that lead to divided opinions over whether it is best considered a morphological or syntactic rule (e.g., Halle and Marantz 1993, pp. 134–138; Embick and Noyer 2001, pp. 584–591).
As mentioned above, I make no attempt to resolve all these issues here; the main goal of presenting Table 2 and Table 3 is to lay out the logic of what would make an informative comparison between count-based and constituency-based rules, and in doing so clarify the earlier critiques of the comparison that Kallini et al. report.
4 Conclusion
In natural languages, words that are linked by some grammatical dependency do not always appear adjacent to each other. What linguists have taken to be striking is that the rules governing these non-adjacent configurations of co-dependent words are never describable in terms of (relative) numerical positions in the string; instead, the positions involved are characterized in constituency-based terms. This is hypothesized to be a consequence of an important difference in the status of count-based versus constituency-based rules in the human mind. Kallini et al. present their comparison between WordHop and NoHop as a test of whether GPT-2 shows an analogous asymmetry, but these two artificial languages do not differ in the appropriate way for this interpretation: The count-based rule in WordHop has no counterpart (constituency-based or otherwise) in NoHop, and so differences in learning success reflect the presence of this additional rule, not an asymmetry between two kinds of rules.
Of course, nothing I have said amounts to any claim about the underlying question of whether an LLM might exhibit a human-like asymmetry between count-based and constituency-based rules. The claim here is just that the experiments reported by Kallini et al. leave the issue untouched.
Notes
This argument has appeared in numerous places, virtually unchanged, going back to at least Chomsky (1971, pp. 26–29). Freidin (1991) gives a version that emphasizes the contrast with count-based rules. Other sources include Chomsky (1975, pp. 30–33), Chomsky (1980, pp. 39–40), and Chomsky (1988, pp. 41–45). For textbook expositions, see, e.g., Akmajian et al. (2001, pp. 156–168), Lasnik, Depiante, and Stepanov (2000, pp. 5–7), and Radford (1988, pp. 31–34). Many of these discuss this question-formation rule as part of a “poverty of stimulus” argument, which need not concern us here: What’s relevant here is just the initial point that linguists can test and disprove hypothesized count-based rules, not the subsequent question of how or why language-learners converge on the non-count-based rules that they do.
The idea is not that a count-based language would “die out” because of a failure on the part of human learners to perpetuate it; rather, the idea is that no human’s linguistic development would ever give rise to such a language in the first place.
Relevant examples here are restricted by the fact that, in most varieties of English, subject-auxiliary inversion only occurs in matrix clauses. In some varieties spoken in Ireland, for example, the same operation applies in embedded clauses, yielding examples like “I wonder will he clean it?” (McCloskey 1992, 2006; Henry 1995).
Under any reasonable assumptions there will be many English examples where no such sister constituent exists, and these would need to be excluded—just as Kallini et al. excluded sentences where an inflected verb was too close to the right edge for their WordHop rule to apply.
References
Author notes
Action Editor: Michael White