Abstract
Formal constraints on crossing dependencies have played a large role in research on the formal complexity of natural language grammars and parsing. Here we ask whether the apparent evidence for constraints on crossing dependencies in treebanks might arise because of independent constraints on trees, such as low arity and dependency length minimization. We address this question using two sets of experiments. In Experiment 1, we compare the distribution of formal properties of crossing dependencies, such as gap degree, between real trees and baseline trees matched for rate of crossing dependencies and various other properties. In Experiment 2, we model whether two dependencies cross, given certain psycholinguistic properties of the dependencies. We find surprisingly weak evidence for constraints originating from the mild context-sensitivity literature (gap degree and well-nestedness) beyond what can be explained by constraints on rate of crossing dependencies, topological properties of the trees, and dependency length. However, measures that have emerged from the parsing literature (e.g., edge degree, end-point crossings, and heads’ depth difference) differ strongly between real and random trees. Modeling results show that cognitive metrics relating to information locality and working-memory limitations affect whether two dependencies cross or not, but they do not fully explain the distribution of crossing dependencies in natural languages. Together these results suggest that crossing constraints are better characterized by processing pressures than by mildly context-sensitive constraints.
1 Introduction
The syntactic structure of natural language sentences can be captured to a large extent using dependency trees: directed trees drawn over words indicating what words are dependent on what other words (Tesnière 1959; Hays 1964; Mel’čuk 1988; Hudson 1990; Nivre 2015). An example is shown in Figure 1. A number of key formal questions in linguistics boil down to questions about the structure of these dependency trees. In particular, recent work has concluded that the characterization of natural language in formal language theory depends on the constraints that can be placed on crossing dependencies in dependency trees (Kuhlmann 2007).
Here we use recently available massively crosslinguistic dependency treebanks (Nivre et al. 2015; Gerdes et al. 2018, 2019) to take up the question, what distinguishes natural language dependency trees within the space of all possible tree structures, in terms of crossing dependencies? We investigate two kinds of proposed constraints on dependency trees. First, we investigate a set of graph-theoretic constraints originating in the formal language theory and dependency parsing literatures, which have bearing on the location of natural language within the Chomsky hierarchy. Comparing natural language dependency trees to random trees of various kinds, we find little evidence that linguistic trees are constrained by the kinds of formal properties studied in formal language theory literature, but we do find strong evidence for under-studied constraints originating from the parsing literature. Second, we explore a set of performance-based and psycholinguistically motivated soft constraints, motivated in terms of empirically measured human online processing difficulty, finding evidence that these constrain crossing dependencies.
1.1 Background
The attempt to characterize the complexity of natural language in terms of formal language theory has been an extraordinarily productive enterprise joining linguistics, computer science, and mathematics (Chomsky 1956; Chomsky and Schützenberger 1963; Hopcroft and Ullman 1979). In recent decades, a consensus has emerged that the syntactic structure of natural languages is well characterized in terms of the mildly context-sensitive hierarchy of languages (Weir 1988; Joshi, Shanker, and Weir 1991; Michaelis 1998; Kuhlmann 2013), a complexity class lying between context-free and context-sensitive and characterized by formal restrictions on various kinds of discontinuity in constituents. In dependency frameworks, these discontinuous constituents correspond to crossing dependencies (see Figure 1 for a simple example). Therefore, formal restrictions on discontinuous constituents correspond to formal restrictions on crossing dependencies in the ordered dependency tree (Kuhlmann 2013).
A number of formal restrictions on crossing dependencies have been proposed in the last 20 years, going beyond the simple observation that crossing dependencies are rare (Havelka 2007; Ferrer-i-Cancho, Gómez-Rodríguez, and Esteban 2018). We call these formal constraints on crossing dependencies crossing constraints. For example, Kuhlmann (2013) has proposed that dependency trees have limited gap degree and are usually well-nested (see Figure 2b). Pitler, Kannan, and Marcus (2013) propose that crossing dependency configurations have a property called 1-end-point-crossing. Other formal restrictions such as edge degree, multiplanarity, and heads’ depth difference have also been proposed (Yli-Jyrä 2003; Kuhlmann and Nivre 2006; Nivre 2007; Yadav, Vaidya, and Husain 2017). Among these crossing constraints, the constraint on gap degree is especially interesting, because gap degree defines the position of a formal language in the mildly context-sensitive hierarchy: A mildly context-sensitive language is defined by a finite upper bound on gap degree, with gap degree zero corresponding to a context-free grammar (Kuhlmann 2013; Marcus 1965).
The empirical arguments that crossing dependencies are constrained by factors such as gap-degree, and so forth, have typically come from demonstrations that crossing dependencies in a large number of observed trees in dependency corpora violate the constraints never or rarely (e.g., Kuhlmann and Nivre 2006; Havelka 2007). For example, Kuhlmann and Nivre (2006) show that only 0.17% of analyses in a Danish Dependency Treebank have gap-degree 2 and 99% of all non-projective structures are well-nested. These findings do not rule out the possibility that crossing constraints might manifest as epiphenomena of other, more general properties of dependency trees.
To appreciate this possibility, consider gap degree as an example. Gap degree is defined as the number of discontinuities in the projection of a node (see Figure 2 for an example), so it is upper-bounded by the number of discontinuities in the sentence. Given that crossing dependencies / discontinuous constituents are rare (Ferrer-i-Cancho, Gómez-Rodríguez, and Esteban 2018), we might expect to measure a low gap degree even if there is no true causally active constraint against gap degree.
This observation opens the possibility that crossing constraints such as gap degree, edge degree, and so on, could be epiphenomena of generic graph-theoretic properties of dependency trees, such as their height, arity, and so forth, and that together these generic factors drive the observable formal properties of crossing dependencies in natural language. If true, this would mean that apparent bounds on gap degree are accidental, and that formalisms such as mildly context-sensitive grammars fit linguistic data only because the structures that would violate them are rare by chance. It is this possibility that we explore in depth in Section 2.
Formal constraints such as mild context-sensitivity are usually associated with a competence-based approach to defining the generative capacity of language, where it is posited that there are (possibly innate) formal constraints on possible mental grammars. Such an approach can be evaluated against a performance-based approach. For example, a well-known cross-linguistic phenomenon termed syntactic islands is typically explained via constraints on structural configurations (Chomsky 1981). Broadly construed, “islands” prohibit formation of certain crossing dependencies and hence are important in circumscribing the formal domain of natural language. Notwithstanding the competence-based explanation, it has been argued that island constraints arise not due to structural ill-formedness, but rather due to constraints on human online language processing (e.g., Hofmeister and Sag 2010).
In a similar vein, the distribution of crossing dependencies in natural languages could also be understood in terms of processing-related constraints. It is a well known fact that, cross-linguistically, simple linguistic codes are more frequent, while complex linguistic codes are rare (e.g., Zipf 1949; Mahowald et al. 2013; Piantadosi 2014; Ferrer-i-Cancho et al. 2013; Ferrer-i-Cancho, Bentz, and Seguin 2020). Such a pattern has been argued to highlight the communicative efficiency of natural language (Jaeger and Tily 2011; Gibson et al. 2019). On this account, the restrictions on crossing dependencies in natural language could arise because such syntactic configurations are difficult (but not impossible) to produce and comprehend (Bresnan et al. 1982; Ades and Steedman 1982; Bach, Brown, and Marslen-wilson 1986; Joshi 1990; Ferrer-i-Cancho 2006,2014; Gómez-Rodríguez and Ferrer-i-Cancho 2017; Gómez-Rodríguez, Christiansen, and Ferrer-i-Cancho 2019). For example, given the incremental nature of language production (e.g., Ferreira and Henderson 1998), it could be assumed that production of crossing dependencies, which necessarily involves a discontinuity in a phrasal boundary, would incur an increased processing cost compared with non-crossing dependencies. Although there has been some experimental work on investigating the processing cost incurred during comprehension of sentences with crossing dependencies (e.g., Levy et al. 2012; Staub et al. 2018), a corpus-based empirical study investigating the influence of psycholinguistic factors beyond dependency distance on crossing dependencies is lacking. If certain psycholinguistic factors (e.g., working-memory constraints) can explain the occurrence of crossing dependencies, that could suggest a functional motivation of why such dependencies are rare in natural language.
In this article, we conduct a cross-linguistic corpus investigation into crossing dependencies both from the perspective of formal crossing constraints as well as from the perspective of processing constraints. In the first set of experiments we investigate if well-known crossing constraints (e.g., gap-degree, edge-degree) can account for crossing dependencies attested in various dependency treebanks. In particular, Section 2 compares the formal properties of crossing dependencies in real dependency trees with various random baselines matched in the number of crossing dependencies and other global graph-theoretic factors. In the second set of experiments, we conduct preliminary analyses to investigate whether certain psycholinguistic factors can account for crossing dependencies attested in various dependency treebanks. In particular, Section 3 models the tendency of two dependency arcs to cross given certain psycholinguistic metrics that are computed using the local configuration of the two arcs.
2 Global Graph-Theoretic Factors
In this section we investigate evidence for global graph-theoretic constraints on crossing dependencies originating from the literatures on formal language theory and dependency parsing. Our goal is to determine which, if any, of these constraints really distinguish natural language dependency trees from the space of all possible directed trees. To do so, we study how often these contraints are violated in (1) real dependency trees and (2) a collection of random baseline trees controlling various properties of trees.
2.1 Background
Chomsky (1956, 1957) first posed the question of how to characterize grammars of natural languages as computational objects, launching a research program that has attempted to describe sets of grammatical sentences using tools from mathematical logic, graph theory, and automaton theory. Since the late 1980s, a consensus has emerged that natural language is well characterized as falling within the mildly context-sensitive class of languages, a formal language class that is larger than the context-free languages, yet without taking advantage of the full expressive power of context-sensitive languages (Weir 1988; Joshi, Shanker, and Weir 1991).
These mildly context-sensitive languages are defined by constraints which turn out to be equivalent to constraints on crossing dependencies. Therefore, by answering the question of what constrains crossing dependencies, we can make progress toward understanding human languages at a computational level.
Crossing dependencies are related to formal language theory because they correspond to displacement phenomena in languages—structures that cannot be captured by a context-free grammar. Across grammatical formalisms, displacement phenomena and are modeled using a distinct kind of structure from non-crossing dependencies. Displacement phenomena (encompassing both extraposition and wh-dependencies) have been modeled in various ways in different syntactic frameworks:
In the Minimalist tradition, non-crossing dependencies correspond to structures that can be built by the computational operation merge (or “external merge”), while crossing dependencies arise from the action of a distinct structure-building operation move (or “internal merge”) (Chomsky 1995; Stabler 1997; Michaelis 1998). A grammar with only merge would generate context-free languages and projective dependency trees.
In phrase structure-based frameworks such as Lexical Functional Grammar (Bresnan 1982), Head-driven Phrase Structure Grammar (Pollard and Sag 1994), and Combinatory Categorial Grammar (Steedman and Baldridge 2011), displacement phenomena are handled using phrase structure rules defined in a way that allows information to percolate through a tree in a non-local manner, a mechanism called “slash-passing.”
Some theories of dependency grammar invoke the idea that each word in a sentence has both a syntactic “head” and a syntactic “governor,” which may coincide. Arcs drawn from governors to dependents may cross, but arcs drawn from heads to dependents never cross (Groß and Osborne 2009). Crossing dependencies correspond to cases where the syntactic governor is distinct from the syntactic head. What we are calling “heads” in this article would be “governors” in such theories.
These various formalisms allow for crossing dependencies by different mechanisms, yet with constraints which turn out to be similar or equivalent across formalisms. In particular, most mildly context-sensitive formalisms end up instantiating bounds on a quantity called gap degree. This quantity goes by different names depending on the grammar formalism. It is equivalent (up to additive constants) to block degree or fan-out in linear context-free rewriting systems (Kuhlmann 2007, 2013), the number of components in multiple context-free grammars (Seki et al. 1991), the maximal rank of a coupled context-free grammar (Hotz and Pitsch 1996), the number of licensee features in Minimalist Grammars (Michaelis 1998; Boston, Hale, and Kuhlmann 2010), and others. It was first introduced in a dependency framework by Holan et al. (1998), and shown to relate to mild context-sensitivity by Kuhlmann (2007). Some mildly context-sensitive formalisms also induce a constraint called well-nestedness, which can also be reduced to constraints on crossing dependencies (Bodirsky, Kuhlmann, and Möhl 2005).
Crossing constraints have also been of interest for those studying the development of efficient dependency parsing algorithms. Such algorithms are generally only available for trees with constrained crossings. For example, if we assume that all trees are projective, then we can perform exact parsing in time cubic in the sentence length (O(n3)) by reducing the dependency grammar to a lexicalized context-free grammar (Eisner and Satta 1999). If we assume all trees are well-nested and gap degree is bounded, then we can generally parse in polynomial time. Without the constraint of well-nestedness, parsing becomes NP-hard (Satta 1992; Gómez-Rodríguez, Carroll, and Weir 2011).
The parsing literature has also been the source of a number of new formal constraints on crossing dependencies, beyond those introduced in the formal syntax literature. For example, Pitler, Kannan, and Marcus (2013) propose a constraint called 1-end-point-crossing. If we assume that all dependency trees are 1-end-point-crossing, then we can parse in quartic time (O(n4)) (see also Gómez-Rodríguez, Shi, and Lee 2018).
2.2 Constraints Considered
Our goal is to determine if there is really evidence for formal graph-theoretic crossing constraints on crossing dependencies in depedency treebanks beyond what can be explained in terms of more generic properties of dependency trees. Below, we list and define the formal crossing constraints that we test. In our terminology we strive to follow Kuhlmann and Nivre (2006).
Gap degree.
The projection of a node X is the ordered list of all the nodes transitively dominated by X plus X itself. For example, in the dependency tree in Figure 2, is the projection of the node Xh. A projection is discontinuous if it forms a discontinuous substring of the sentence. For example, the projection of Xh has two discontinuities, one between Xd and Xh, and another between Xg and Xd. The gap degree of a tree is the largest number of discontinuities in the projection of any node in the tree.
Well-nestedness.
The subtree rooted at a node X is the set of all the transitive nodes dominated by X plus X itself. For example, in the dependency tree (a) and (b) in Figure 3, {Xa,Xb,Xe} is the subtree rooted at node Xe, and {Xc,Xd} is the subtree rooted at node Xd. Two subtrees with nodes {P,Q} and {R,S} interleave if the nodes are in linear order such that P < R < Q < S. A dependency tree is ill-nested if and only if two of its disjoint subtrees interleave. For example, in (3a), {Xa,Xb,Xe} and {Xc,Xd} are two disjoint subtrees but they do not interleave as the nodes are in the order Xa < Xb < Xc < Xd < Xe. Therefore, tree (3a) is well-nested. In (3b), the disjoint subtrees {Xa,Xb,Xe} and {Xc,Xd} interleave as the order of the nodes is Xa < Xc < Xb < Xd < Xe. The dashed red arc creates the ill-nestedness. Ill-nestedness implies gap degree >1.
Edge degree.
Let e be the span of dependency arc . The span e consists of nodes between a head Xh and its dependent Xd, which are Xi, Xa, and Xb in Figure 4. The edge degree of a dependency arc is the number of nodes in the span e that are neither transitively dominated by some node in the span e nor transitively dominated by the head Xh (Kuhlmann and Nivre 2006). For example, the arc in Figure 4a and 4b has an edge degree of 2 because nodes Xi and Xb are not dominated by any node in the span e. In addition, they are also not dominated by the head Xh. The edge degree of a dependency tree is the highest edge degree among the arcs of the tree.
There are cognitive reasons to suspect that edge degree might be limited in natural language. From an online processing perspective, higher edge degree in a subtree results in a need to maintain an unresolved crossing dependency across a longer span of words, which may result in online processing difficulty due to higher working memory load (Gibson 1998).
End-point crossings.
The end-point crossings of a dependency arc is the number of distinct heads of all edges that cross the arc. More formally, given an arc with a span e, the end-point crossings of arc is defined as the number of distinct heads of nodes in e that are not part of the projection of Xh nor of any element of e. The end-point crossings of a tree is the maximum end-point crossings of any arc in the tree. For example, in Figure 4a, the number of heads modified by Xi and Xb is 1 (corresponding to Xj), therefore, the end-point crossing is 1. In Figure 4b, the number of heads modified by Xi and Xb are 2 (corresponding to Xj and Xr, respectively); therefore, the end-point crossing is 2.
It has been argued that natural language dependency trees tend to have not more than one end-point crossing, which is called the 1-end-point-crossing constraint. Pitler, Kannan, and Marcus (2013) argue that this constraint is related to the Phase Impenetrability Condition from Minimalist syntax (Chomsky 2007). From a processing based perspective, higher end-point crossings in a subtree should lead to multiple heads/dependents being maintained/stored at the same time in the parse stack. This should lead to increased storage cost (Gibson 1998). In addition, a longer span of the crossing dependency could lead to similarity-based interference (Lewis and Vasishth 2005) at the head.
Heads’ depth difference.
In terms of formal syntax, HDD can correspond to the hierarchical depth between a filler and a gap in a long distance dependency (e.g., wh movement). Based on the theoretical syntax literature, HDD should be unbounded, at least for leftward wh-dependencies (Sag, Wasow, and Bender 1999). However, increasing HDD seems to correlate with increased online processing difficulty for humans (Phillips, Kazanina, and Abada 2005). More generally, HDD has been proposed (see Yadav, Vaidya, and Husain 2017) to formalize the experimental findings that increased embedding depth leads to processing difficulty (e.g., Yngve 1960; Gibson and Thomas 1999). Therefore, it is possible that HDD is restricted in dependency trees due to cognitive constraints.
2.2.1 Example Tree.
In order to demonstrate our crossing constraints, Figure 6 shows an example of a complex dependency tree in Hungarian, drawn from the Szeged Dependency Treebank. The tree has 3 crossing dependencies, with gap degree 2: The projection from kiejteni has two discontinuities. The tree has edge degree 2, end-point crossings 2, and HDD 2. It has sentence length 10, arity1 2, and tree height2 6.
2.3 Methods
To test if crossing constraints (e.g., gap degree, edge degree) can account for crossing dependencies attested in natural language treebanks, we compare the distribution of crossing constraints in natural languages with the random baselines matched in number of crossing dependencies and other tree properties. For example, to test if gap degree is a constraint—over and above the constraint on number of crossings and dependency lengths—we compare the distribution of gap degree in natural languages with the random baseline matched in number of crossings and dependency lengths. We describe the random baselines and statistical method used for comparison next (cf. Yadav, Husain, and Futrell 2019).
2.3.1 Random Baselines.
We use four random baselines to assess whether crossing constraints occur independent of rate of crossing dependencies and other tree properties like tree height or dependency lengths in natural languages. They include random trees, random linear arrangements, dependency length (DL)-controlled random trees, and DL-controlled RLAs. Each baseline controls a particular set of tree properties as shown in Table 1.
Baseline . | Controlled tree properties . | ||
---|---|---|---|
Number of crossings . | Tree topology . | Dependency length . | |
Random trees | ✓ | ||
RLAs | ✓ | ✓ | |
DL-controlled random trees | ✓ | ✓ | |
DL-controlled RLAs | ✓ | ✓ | ✓ |
Baseline . | Controlled tree properties . | ||
---|---|---|---|
Number of crossings . | Tree topology . | Dependency length . | |
Random trees | ✓ | ||
RLAs | ✓ | ✓ | |
DL-controlled random trees | ✓ | ✓ | |
DL-controlled RLAs | ✓ | ✓ | ✓ |
In order to generate a random tree corresponding to a real language tree of sentence length n, we first sample from a uniform distribution over tree structures with n nodes using Prüfer codes (Prüfer 1918), and then use rejection sampling to obtain a tree that matches the real tree in the number of crossing dependencies. The resulting distribution is uniform over all tree structures with the specified length and number of crossing edges. A DL-controlled random tree must match a real tree both in terms of its number of crossings and in terms of the distribution of dependency lengths within the tree (for details, see Yadav, Husain, and Futrell 2021). This procedure samples from the uniform distribution on trees with specified length, number of crossings, and distribution of dependency lengths.
Random linear arrangements (RLAs) are generated by permuting the linear order of nodes in a real languages tree. A reordered tree that matches in number of crossings with a real tree is accepted as a valid sample for RLAs. To generate DL-controlled RLAs, we sample from RLAs that match in the distribution of dependency lengths with the real trees.
These baseline trees are all generated by a rejection sampling procedure that rejects the vast majority of samples. As such, it is only possible to generate sentences with length up to 11 using this method given currently available computing resources.
All baselines studied in this article control for the number of crossings per sentences. For baselines that control for dependency lengths but not the number of crossings, see Yadav, Husain, and Futrell (2021).
2.3.2 Data.
For natural languages data, we use treebanks from Surface-syntactic Universal Dependencies (SUD) v.2.4 (Gerdes et al. 2018, 2019). We test on treebanks from 56 languages, excluding treebanks with less than 500 sentences and ancient languages.
The data used for the analysis contained 9 head-final and 47 head-initial languages for the analysis, as determined by Yadav et al. (2020), which focuses on verb–object relations. The head-final languages were: Afrikaans, Dutch, German, Hindi, Japanese, Korean, Persian, Tamil, and Urdu. The head-initial languages were: Amharic, Arabic, Bulgarian, Bambara, Catalan, Czech, Danish, Greek, English, Spanish, Estonian, Basque, Finnish, Faroese, French, Irish, Galician, Hebrew, Croatian, Upper Sorbian, Hungarian, Armenian, Indonesian, Italian, Kazakh, Northern Kurdish, Lithuanian, Latvian, Maltese, Erzya, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Northern Sami, Serbian, Swedish, Thai, Turkish, Uyghur, Ukrainian, Vietnamese, Wolof, and Chinese.
The SUD treebanks have been converted from the Universal Dependencies (UD) treebanks (Nivre 2015) to reflect syntactic dependencies rather than the more semantic dependencies favored by UD. We also did an additional analysis on the corresponding UD treebanks. We found that the rate of violating crossing constraints is lower in UD trees but the overall pattern of results is the same as for SUD treebanks with two exceptions: (1) the constraint on end-point crossings receives weaker evidence, and (2) the constraint on well-nestedness receives stronger evidence. Regarding point (1), this is likely due to UD’s flat structure, which means that many nodes share a head. See supplementary materials S1 for a comparison of results obtained from SUD vs. UD data.
2.3.3 Statistical Method.
To test whether real language trees differ from random baseline trees in the distribution of crossing constraints, we fit mixed-effect Poisson regression models. Poisson regression is appropriate for modeling nonnegative integer-valued data such as the formal crossing properties. The dependent variables in the models are the rate of violations of crossing constraints (gap degree, edge degree, etc.).
To evaluate evidence for the hypothesis whether gap degree is lower in real trees compared with baseline trees, we compare the model in Equation (2) with a null model that lacks the Rij term, using a likelihood ratio test. We report log-likelihood ratio values, interpreting them as strength of evidence for a difference between real and random trees.
The log-likelihood ratios can be interpreted as logarithmic Bayes factors comparing two hypotheses with equal prior probability: H0, that there is no distinction between real and random trees; and H1, that there is a distinction as given by the regression coefficients β2 and β3. A higher log-likelihood ratio indicates stronger evidence for H1.
2.4 Results
The distributions of the formal measures in real trees and random baselines are shown in Figure 7. Table 2 summarizes results in terms of log-likelihood ratios. We find that there is uniformly strong evidence that edge degree, end-point crossings, and HDD are different between real and random trees. This means that the distribution of these formal properties cannot be explained solely in terms of generic constraints on number of crossings, tree topology, and dependency length.
Baseline . | Evidence for constraint on . | ||||
---|---|---|---|---|---|
Gap degree . | Well-nestedness . | Edge degree . | End-point crossings . | HDD . | |
Random trees | |||||
Sentence length | 4 | 24 | 740 | 240 | 663 |
Maximum arity | 251 | 45 | 1,142 | 465 | 1,021 |
Tree height | 66 | 28 | 296 | 37 | 139 |
Random linear arrangements | |||||
Sentence length | 1 | 8 | 446 | 161 | 590 |
Maximum arity | 1 | 3 | 390 | 133 | 513 |
Tree height | 3 | 5 | 431 | 167 | 642 |
DL-controlled random trees | |||||
Sentence length | 0.16 | 16 | 17 | 13 | 327 |
Maximum arity | 43 | 21 | 83 | 75 | 498 |
Tree height | 252 | 58 | 140 | 148 | 49 |
DL-controlled random linear arrangements | |||||
Sentence length | 0.01 | 2 | 11 | 7 | 181 |
Maximum arity | 0.04 | 1 | 7 | 4 | 152 |
Tree height | 0.02 | 2 | 9 | 5 | 164 |
Baseline . | Evidence for constraint on . | ||||
---|---|---|---|---|---|
Gap degree . | Well-nestedness . | Edge degree . | End-point crossings . | HDD . | |
Random trees | |||||
Sentence length | 4 | 24 | 740 | 240 | 663 |
Maximum arity | 251 | 45 | 1,142 | 465 | 1,021 |
Tree height | 66 | 28 | 296 | 37 | 139 |
Random linear arrangements | |||||
Sentence length | 1 | 8 | 446 | 161 | 590 |
Maximum arity | 1 | 3 | 390 | 133 | 513 |
Tree height | 3 | 5 | 431 | 167 | 642 |
DL-controlled random trees | |||||
Sentence length | 0.16 | 16 | 17 | 13 | 327 |
Maximum arity | 43 | 21 | 83 | 75 | 498 |
Tree height | 252 | 58 | 140 | 148 | 49 |
DL-controlled random linear arrangements | |||||
Sentence length | 0.01 | 2 | 11 | 7 | 181 |
Maximum arity | 0.04 | 1 | 7 | 4 | 152 |
Tree height | 0.02 | 2 | 9 | 5 | 164 |
However, for gap degree and well-nestedness, the picture is different. When comparing real trees against DL-controlled random linear arrangements, we do not find substantial evidence for differences in well-nestedness or gap degree. When comparing against random linear arrangements, we do not find substantial evidence for differences in gap degree. These log-likelihood ratios are dramatically smaller than those for, for example, HDD. If there is a distinction in gap degree between real and random trees, the evidence for this distinction is dramatically small compared with other constraints.
2.5 Discussion
We tested whether there is evidence for graph-theoretic constraints on crossing dependencies beyond what can be explained in terms of other, generic properties of natural language trees.
We find two key results:
There is decisive evidence that edge degree, end-point crossings, and HDD are different between real and random trees, suggesting that any constraints on these quantities cannot be explained merely in terms of the number of crossings, tree topology, and dependency length distribution in natural language trees.
We find insubstantial evidence for gap degree and well-nestedness constraint after controlling for the number of crossings, tree topology, and dependency length.
Taken together, these results suggest that, despite the massive literature on gap degree and well-nestedness and their connections to formal language theory, these constraints are not the formal properties that most strongly characterize crossing dependencies. In comparison, edge degree, end-point crossings, and HDD—measures that have emerged from the parsing literature, not the formal language theory literature—emerge as strongly characteristic of crossing dependencies. We additionally note an important caveat that our results hold only in short trees, and might be different in larger trees (cf. Ferrer-i-Cancho et al. 2021). The patterns for longer sentences can only be confirmed by baseline generation, which will be taken up in future work.
Among the crossing constraints investigated in the current work, the evidence for HDD as a crossing constraint is strongest. Recall that the HDD constraint is motivated by the findings in the psycholinguistic literature that increased embedding leads to processing difficulty. This suggests that constraints on crossing dependencies could be driven by processing considerations. We turn next to a preliminary investigation where this possibility is explored further.
3 Local Psycholinguistic Factors
In the previous section we investigated the role of formal graph-theoretic factors in determining the distribution of crossing dependencies in natural language. As stated earlier, these factors can be construed as capturing the competence-based constraints on grammar. However, natural language grammar can, in principle, also be influenced by processing-based constraints. One such early proposal can be found in Joshi (1985), where efficient (asymptotic) parsing complexity is a key design requirement for natural languages. More recent proposals include the role of efficient parsing in shaping word order in natural language (see, e.g., Hawkins 2004, although his notion of “efficiency” is quite different from Joshi’s). On this account, rarity of crossing dependencies across languages could be assumed to reflect processing difficulty in handling such configurations (also see Ferrer-i-Cancho 2014).
Given the incremental nature of language production (e.g., Ferreira and Henderson 1998), it is reasonable to assume that production of crossing dependencies, which necessarily involves a discontinuity in a phrasal boundary, would incur an increased processing cost compared with non-crossing dependencies. Indeed, recent work investigating filler–gap dependencies (Momma 2021) suggests that such crossing dependencies might require additional cognitive resources in planning. Given the evidence for the tight link between production and comprehension difficulty (MacDonald 2013; Scontras, Badecker, and Fedorenko 2017), it is therefore not far-fetched to assume that comprehenders should also find crossing dependencies difficult to process (Levy et al. 2012; Yadav, Vaidya, and Husain 2017; Husain and Vasishth 2015; Staub et al. 2018). Indeed, it is well known that, given an unbounded dependency, the comprehension system tries to resolve it as soon as possible, a constraint known as active-filler strategy (Frazier 1987). The experiments discussed in this section, therefore, explore whether processing factors such as working-memory constraints, predictability, and so on, could play a role in determining the occurrence of a crossing dependency in the input.
3.1 Motivation
The key motivation of this preliminary study is to investigate if processing related factors modulate the occurrence of a crossing dependency. We investigate two factors—namely, working-memory constraints and prediction processes.
The influence of working-memory limitations on sentence comprehension as well as production is well attested. Cross-linguistically, the linear distance between syntactically related words (i.e., dependency length) has been found to be minimized, in the phenomenon of dependency length minimization (for reviews, see Liu, Xu, and Liang 2017; Temperley and Gildea 2018). This dependency length minimization has recently been argued to be a manifestation of information locality (Futrell 2019; Hahn, Degen, and Futrell 2021), the key idea being that words with high pointwise mutual information (PMI) tend to be close to each other. The “closeness” between a pair of words is usually operationalized as linear distance between them, but more generally “closeness” would mean the simplicity of the structure that intervene a pair of words (see Yadav, Mittal, and Husain 2020). This generalized version of information locality can be termed as information-simplicity: The words with high PMI tend to have simpler intervening structure between them.
Together, dependency length minimization and information-simplicity have clear implications for crossing dependencies. First, the information-simplicity hypothesis predicts that a head-dependent pair with high PMI is less likely to be involved in a crossing dependency. Second, dependency length minimization implies that long phrases could be extraposed in order to avoid increased dependency distance in situ. We call this heavy-phrase extraposition. The heavy-phrase extraposition hypothesis predicts that dependencies with a heavy dependent, in terms of length of the phrase, are more likely to be involved in a crossing dependency. Additionally, we expect a positive correlation between dependency length and crossing tendency: The dependencies which are shorter in length are less likely involved in a crossing configuration (Ferrer-i-Cancho 2014; Ferrer-i-Cancho and Gómez-Rodríguez 2016). We call this localized-simplicity hypothesis: Localized words have simpler intervening structures between them.
Apart from the role of working memory, another factor that has garnered much attention in the processing literature is prediction. Sentence processing is known to involve a robust top–down component that involves a preactivation of upcoming linguistic material. Such predictions are known to facilitate comprehension (Levy 2008; Smith and Levy 2013) and to attenuate the cost of memory constraints (Husain, Vasishth, and Srinivasan 2014). On this account, when the presence of a dependent is highly expected, this high expectation could offset any cost incurred due to crossing dependency (Levy et al. 2012). So, we test the expectation hypothesis that a crossing configuration is more likely in situations where the upcoming dependent/head is highly expected. Expectation of a dependent given a head is operationalized as the log probability that a head has at least one outgoing dependency with that relation type.
Given the gap between the point at which prediction is made and the point at which the concerned linguistic entity is received via input, linguistic predictions have to be maintained in memory (Gibson 1998). This maintenance cost has been shown to correspond to measurable processing difficulty (Husain, Vasishth, and Srinivasan 2015; Ristic et al. 2021), that is, prediction maintenance over a longer period can be costly. It is therefore expected that longer maintenance of dependencies involved in crossing should be avoided. In this work we operationalize maintenance as (a) the number of words between the heads of the crossing dependencies, and (b) the number of heads between the heads of the crossing dependencies.3 We expect that dependencies that involve a crossing configuration tend to have shorter distance between their heads both in terms of number of words and number of heads. We call this head-head locality hypothesis: Two dependencies with their heads being far away from each other are less likely to form a crossing configuration.
3.2 Methods
In order to assess the role of various processing factors mentioned above, we fit logistic regression models to predict whether two dependencies cross.
Given dependency i and dependency j in a sentence (see Figure 8), this model uses the following predictors to predict whether dependency i is crossed by dependency j or not. We call i a host dependency and j a crosser dependency.
Head-dependent pointwise mutual information of dependency i and dependency j, i.e., PMIi and PMIj
Expectation of seeing a dependency relation given head type for dependency i and dependency j, i.e., EXPi and EXPj
Distance between head and dependent of dependency i and dependency j, i.e., DDi and DDj
Weight of dependent of dependency i, Wi, i.e., the number of nodes transitively dominated by the dependent of i plus one
Linear distance between head of dependency i and head of dependency j, LHD
Hierarchical distance (number of heads) between head of dependency i and head of dependency j, HHD
Given the discussion in Section 3.1, we predict the following:
The information-simplicity hypothesis predicts that a dependency with high pointwise mutual information is less likely to get involved in a crossing construction. Thus, it predicts a negative estimate for the parameters β1 and β5.
The heavy-phrase extraposition hypothesis predicts that if a dependency has a heavy dependent, it is more likely to get crossed. Thus, a positive estimate is predicted for β4.
The localized-simplicity hypothesis predicts a positive correlation between dependency length and crossing tendency, that is, a positive estimate for β3 and β7.
The expectation hypothesis predicts that a dependency with high expectation is more likely to involve a crossing construction, namely, a positive estimate for β2 and β6.
Finally, the head-head locality hypothesis predicts that two dependencies with longer distance between their heads (i.e., higher LHD or HHD) are less likely to cross each other, that is, β8 and β9 should show a negative estimate.
3.2.1 Data.
We choose 12 languages from the Surface-syntactic Universal Dependencies (SUD v2.4) treebank, out of which 7 are head-initial and 5 are head-final languages. The criteria for language selection is based on corpus size ( >5,000 trees) and language typology. Out of the 7 languages, Hindi, Dutch, Japanese, German, and Korean are head-final languages; the remaining languages, English, Arabic, French, Spanish, Italian, Polish, and Romanian, are head-initial.
3.3 Results
Table 3 summarizes results from the logistic regression model. Results show that all the predictors, namely, pointwise mutual information, expectation, dependency length, dependent weight, distance between the head of the host, and head of the crosser dependency, have significant effect on crossings in the expected directions.
Language . | Host dependency . | Crosser dependency . | LHD . | HHD . | |||||
---|---|---|---|---|---|---|---|---|---|
PMI . | EXP . | DD . | dep.Weight . | PMI . | EXP . | DD . | |||
English | − 0.26 | − 0.11 | 0.09 | 0.014 | − 0.27 | − 0.09 | 0.09 | −0.007 | − 0.147 |
Hindi | − 0.03 | − 0.31 | 0.07 | 0.015 | 0.15 | − 0.25 | 0.08 | 0.029 | − 0.192 |
Dutch | − 0.08 | − 0.09 | 0.10 | 0.020 | 0.04 | − 0.12 | 0.11 | − 0.091 | 0.085 |
Arabic | − 0.23 | 0.02 | 0.06 | 0.045 | 0.17 | − 0.23 | 0.08 | − 0.027 | − 0.103 |
Japanese | − 0.40 | 0.02 | 0.05 | 0.031 | − 0.39 | − 0.08 | 0.04 | − 0.023 | − 0.151 |
French | − 0.06 | − 0.30 | 0.08 | 0.018 | − 0.05 | − 0.34 | 0.08 | 0.068 | − 0.278 |
German | − 0.20 | − 0.17 | 0.11 | 0.031 | − 0.03 | − 0.16 | 0.11 | − 0.069 | 0.061 |
Korean | − 0.05 | − 0.07 | 0.12 | 0.082 | − 0.48 | 0.01 | 0.09 | − 0.04 | − 0.45 |
Spanish | − 0.11 | − 0.11 | 0.08 | 0.012 | − 0.19 | − 0.01 | 0.09 | 0.01 | − 0.21 |
Italian | 0.05 | − 0.22 | 0.09 | 0.021 | 0.08 | − 0.23 | 0.10 | − 0.02 | − 0.09 |
Polish | − 0.09 | − 0.11 | 0.11 | 0.010 | − 0.05 | − 0.09 | 0.12 | 0.002 | 0.01 |
Romanian | − 0.08 | − 0.09 | 0.12 | 0.028 | − 0.05 | − 0.09 | 0.14 | − 0.02 | − 0.06 |
Language . | Host dependency . | Crosser dependency . | LHD . | HHD . | |||||
---|---|---|---|---|---|---|---|---|---|
PMI . | EXP . | DD . | dep.Weight . | PMI . | EXP . | DD . | |||
English | − 0.26 | − 0.11 | 0.09 | 0.014 | − 0.27 | − 0.09 | 0.09 | −0.007 | − 0.147 |
Hindi | − 0.03 | − 0.31 | 0.07 | 0.015 | 0.15 | − 0.25 | 0.08 | 0.029 | − 0.192 |
Dutch | − 0.08 | − 0.09 | 0.10 | 0.020 | 0.04 | − 0.12 | 0.11 | − 0.091 | 0.085 |
Arabic | − 0.23 | 0.02 | 0.06 | 0.045 | 0.17 | − 0.23 | 0.08 | − 0.027 | − 0.103 |
Japanese | − 0.40 | 0.02 | 0.05 | 0.031 | − 0.39 | − 0.08 | 0.04 | − 0.023 | − 0.151 |
French | − 0.06 | − 0.30 | 0.08 | 0.018 | − 0.05 | − 0.34 | 0.08 | 0.068 | − 0.278 |
German | − 0.20 | − 0.17 | 0.11 | 0.031 | − 0.03 | − 0.16 | 0.11 | − 0.069 | 0.061 |
Korean | − 0.05 | − 0.07 | 0.12 | 0.082 | − 0.48 | 0.01 | 0.09 | − 0.04 | − 0.45 |
Spanish | − 0.11 | − 0.11 | 0.08 | 0.012 | − 0.19 | − 0.01 | 0.09 | 0.01 | − 0.21 |
Italian | 0.05 | − 0.22 | 0.09 | 0.021 | 0.08 | − 0.23 | 0.10 | − 0.02 | − 0.09 |
Polish | − 0.09 | − 0.11 | 0.11 | 0.010 | − 0.05 | − 0.09 | 0.12 | 0.002 | 0.01 |
Romanian | − 0.08 | − 0.09 | 0.12 | 0.028 | − 0.05 | − 0.09 | 0.14 | − 0.02 | − 0.06 |
We find an effect of pointwise mutual information for most of the languages such that two dependencies tend to cross each other if their head-dependent mutual information is low. In other words, a dependency that has relatively low mutual information between its head and dependent is more likely to cross—and get crossed by—another dependency. The result supports the information-simplicity hypothesis.
Additionally, all 12 languages show an effect of the dependent weight such that a dependency with heavier dependent phrase is more likely to get crossed by another dependency. The result supports the heavy-phrase extraposition hypothesis.
The results also support the localized-simplicity hypothesis, that is, dependency distance is positively correlated with crossing tendency—a shorter dependency is less likely to be involved in a crossing construction.
With regard to the role of prediction, results show that a dependency relation that has relatively low expectation given a head type is more likely to cross—and get crossed by—another dependency. The result does not support the expectation hypothesis. Finally, we find that the linear and hierarchical distance between the heads of two dependencies negatively influence whether they cross or not. Crossing tendency reduces if the number of heads/words between the heads of the two dependencies increases. This result holds for most of the languages (except Hindi, Spanish, Dutch, and Polish). The LHD/HHD results support the head-head locality hypothesis motivated by prediction maintenance account.
3.4 Extraposition in Noun Phrases
We also test a prediction by Levy et al. (2012) that a crossing dependency is easier to comprehend if expectation of a dependency relation given a nominal head is higher. We test whether expectation has a positive effect on crossing tendency of noun-headed constructions (see Figure 9). In addition to expectation effect, we also test the effect of weight of dependent of noun phrase—a noun-headed dependency is more likely to get crossed if its dependency is heavier because the heavy dependent may move around to minimize dependency length. To test the hypothesis, we fit a logistic regression similar to 3 with two predictors, namely, expectation of dependency relation and dependent weight.
Table 4 shows the estimates of effect of expectation and dependent weight on crossing tendency of a noun-headed construction. The effect of expectation has a negative estimate (except Japanese), suggesting that if expectation of seeing a particular dependency relation in a noun-headed dependency is higher, the dependency is less likely to get crossed by another dependency.
Language . | Dependent weight . | Expectation . | D.weight x Expectation . | ||||||
---|---|---|---|---|---|---|---|---|---|
. | SE . | Z value . | . | SE . | Z value . | . | SE . | Z value . | |
English | 0.096 | 0.013 | 7.36 * | −0.431 | 0.018 | −22.81 * | 0.005 | 0.003 | 2.04 * |
Hindi | 0.103 | 0.005 | 20.05 * | −0.422 | 0.011 | −39.09 * | 0.013 | 0.001 | 9.61 * |
Dutch | 0.122 | 0.006 | 18.89 * | −0.444 | 0.014 | −31.41 * | 0.005 | 0.001 | 3.77 * |
Arabic | 0.064 | 0.006 | 10.78 * | −0.250 | 0.025 | −9.89 * | 0.004 | 0.001 | 2.85 * |
Japanese | 0.046 | 0.002 | 22.57 * | −0.006 | 0.011 | −0.61 n.s. | −0.001 | 0.001 | −1.22 n.s. |
French | 0.110 | 0.004 | 22.61 * | −0.486 | 0.011 | −41.77 * | 0.008 | 0.001 | 8.96 * |
German | 0.198 | 0.005 | 36.29 * | −0.499 | 0.016 | −29.86 * | 0.016 | 0.001 | 13.93 * |
Korean | 0.009 | 0.012 | 0.82 n.s. | 0.054 | 0.017 | 3.11 * | −0.029 | 0.003 | −9.70 * |
Spanish | 0.102 | 0.005 | 19.39 * | −0.067 | 0.026 | −2.51 * | 0.005 | 0.002 | 2.59 * |
Italian | 0.092 | 0.006 | 14.34 * | −0.400 | 0.016 | −24.17 * | 0.003 | 0.001 | 2.60 * |
Polish | −0.016 | 0.011 | −1.39 n.s. | −0.051 | 0.016 | −3.10 * | −0.008 | 0.002 | −4.67 * |
Romanian | 0.143 | 0.011 | 12.07 * | −0.212 | 0.034 | −6.17 * | 0.018 | 0.005 | 3.76 * |
Language . | Dependent weight . | Expectation . | D.weight x Expectation . | ||||||
---|---|---|---|---|---|---|---|---|---|
. | SE . | Z value . | . | SE . | Z value . | . | SE . | Z value . | |
English | 0.096 | 0.013 | 7.36 * | −0.431 | 0.018 | −22.81 * | 0.005 | 0.003 | 2.04 * |
Hindi | 0.103 | 0.005 | 20.05 * | −0.422 | 0.011 | −39.09 * | 0.013 | 0.001 | 9.61 * |
Dutch | 0.122 | 0.006 | 18.89 * | −0.444 | 0.014 | −31.41 * | 0.005 | 0.001 | 3.77 * |
Arabic | 0.064 | 0.006 | 10.78 * | −0.250 | 0.025 | −9.89 * | 0.004 | 0.001 | 2.85 * |
Japanese | 0.046 | 0.002 | 22.57 * | −0.006 | 0.011 | −0.61 n.s. | −0.001 | 0.001 | −1.22 n.s. |
French | 0.110 | 0.004 | 22.61 * | −0.486 | 0.011 | −41.77 * | 0.008 | 0.001 | 8.96 * |
German | 0.198 | 0.005 | 36.29 * | −0.499 | 0.016 | −29.86 * | 0.016 | 0.001 | 13.93 * |
Korean | 0.009 | 0.012 | 0.82 n.s. | 0.054 | 0.017 | 3.11 * | −0.029 | 0.003 | −9.70 * |
Spanish | 0.102 | 0.005 | 19.39 * | −0.067 | 0.026 | −2.51 * | 0.005 | 0.002 | 2.59 * |
Italian | 0.092 | 0.006 | 14.34 * | −0.400 | 0.016 | −24.17 * | 0.003 | 0.001 | 2.60 * |
Polish | −0.016 | 0.011 | −1.39 n.s. | −0.051 | 0.016 | −3.10 * | −0.008 | 0.002 | −4.67 * |
Romanian | 0.143 | 0.011 | 12.07 * | −0.212 | 0.034 | −6.17 * | 0.018 | 0.005 | 3.76 * |
The weight of dependent in a noun-headed construction positively affects its crossing tendency (except Japanese)—a noun-headed construction has a higher tendency to get crossed by another dependency if the weight of its dependent is higher. The result supports the heavy-phrase extraposition hypothesis: A heavy dependent is more likely to get extraposed—hence causing a crossing dependency—in order to minimize dependency distance.4
We also find a significant interaction effect between dependent weight and expectation such that effect of dependent weight increases with increase in expectation. This result suggests that high expectation facilitates crossing dependencies with heavy dependent. In other words, a heavy dependent is more likely to get extraposed when expectation is high. This pattern has some support for the hypothesis proposed by Levy et al. (2012) that increased expectation facilitates the processing of right-extraposed structures.
3.5 Comparison with Baseline Trees
One concern regarding the significant effect of formal measures like dependency distance, dependent weight, and inter-head distance could be that they do not capture any linguistic or cognitive phenomenon, rather, their effect could be due to independent reasons such as tree topology. For example, it is possible that an arc in any random directed acyclic graph is more likely to get crossed by another arc if linear distance between its nodes is larger (Ferrer-i-Cancho and Gómez-Rodríguez 2016).
To check whether some natural language phenomena underlies the effects of dependency distance, dependent weight, and so forth, on crossing dependency, we fit a model to predict crossing tendency in real vs. random trees. The model includes interaction effect terms to test whether the expected effect of dependency distance, dependent weight, and so on, is larger in real trees compared with the random trees generated as part of Section 2. If the effect of these predictors arise due to some real psycholinguistic phenomena, we expect a positive estimate for the interaction effect.
Table 5 shows the results of the model. We find a significant interaction effect for all the predictors.
The effect of dependency distance on crossing tendency is larger in real language trees, such that longer dependencies are more likely to cross or get crossed by another dependency in real trees compared to random trees.
The effect of dependent weight is driven by real trees, such that a dependency with heavy dependent is more likely to get crossed in real trees, but not in random trees (the main effect is in opposite direction).
Predictor . | Estimate . | Std. Error . | z value . | . |
---|---|---|---|---|
Effect of predictors in random trees | ||||
DD (host dependency) | 0.27 | 0.003 | 89.91 | * |
DD (crosser dependency) | 0.21 | 0.003 | 69.45 | * |
LHD | −0.06 | 0.006 | −9.82 | * |
HHD | 0.05 | 0.014 | 3.53 | * |
Host dependent weight | −0.28 | 0.005 | −51.03 | * |
Interaction effects: Effect of predictors in real trees compared with random trees | ||||
DD: Real (host dependency) | 0.15 | 0.015 | 10.17 | * |
DD: Real (crosser dependency) | 0.18 | 0.015 | 11.77 | * |
LHD: Real | 0.35 | 0.025 | 13.84 | * |
HHD: Real | −0.30 | 0.052 | −5.78 | * |
Host dependent weight: Real | 0.31 | 0.022 | 13.75 | * |
Predictor . | Estimate . | Std. Error . | z value . | . |
---|---|---|---|---|
Effect of predictors in random trees | ||||
DD (host dependency) | 0.27 | 0.003 | 89.91 | * |
DD (crosser dependency) | 0.21 | 0.003 | 69.45 | * |
LHD | −0.06 | 0.006 | −9.82 | * |
HHD | 0.05 | 0.014 | 3.53 | * |
Host dependent weight | −0.28 | 0.005 | −51.03 | * |
Interaction effects: Effect of predictors in real trees compared with random trees | ||||
DD: Real (host dependency) | 0.15 | 0.015 | 10.17 | * |
DD: Real (crosser dependency) | 0.18 | 0.015 | 11.77 | * |
LHD: Real | 0.35 | 0.025 | 13.84 | * |
HHD: Real | −0.30 | 0.052 | −5.78 | * |
Host dependent weight: Real | 0.31 | 0.022 | 13.75 | * |
3.6 Discussion
The results from the logistic regression models provide compelling evidence for the role of working-memory constraints and information-locality on the occurrence crossing dependencies. To summarize, we find support for:
Information simplicity: The words with high mutual information are less likely to form crossing configurations.
Localized simplicity: The words that are close to each other are less likely to involve a crossing dependency.
Heavy-phrase extraposition: A dependency with a heavy dependent is more likely to get crossed by another dependency.
Head-head locality: Two dependencies with their heads being far away from each other are less likely to form a crossing configuration.
However, the results were only partly consistent with the expectation hypothesis: High expectation for the presence of a dependent correlates with a higher rate of crossing only when that dependent is heavy, as indicated by the presence of a positive interaction of weight and expectation in Table 5 for most languages, but a negative main effect of expectation. Overall, we found evidence for working-memory based and information-locality based accounts, but limited evidence for expectation-based accounts.
An interesting result that should be interpreted in the light of previous findings is the effect of dependency lengths on the occurrence of crossing dependencies. Recent work has shown that a constraint on dependency length alone cannot fully explain the low rate of crossing dependencies in natural languages (Yadav, Husain, and Futrell 2021). In contrast, our results show an effect of dependency length on crossing tendency such that longer dependencies are more likely to cross—or get crossed by—other dependencies. Additionally, our results in Section 2 show that dependency lengths can explain ill-nestedness distribution in natural languages. Together, these findings suggest that although dependency length affects the occurrence of crossing dependencies, it cannot fully explain the observed quantitative distribution of crossings in natural languages.
4 General Discussion
The current work investigated crossing dependencies from two perspectives. These were (a) the role of certain global graph-theoretic factors in determining the distribution of crossing dependencies in natural language, and (b) the role of certain local psycholinguistic factors in determining if two dependencies cross. Results from Experiment 1 provide strong evidence that edge-degree, end-point crossings, and HDD determine the distribution of crossing dependencies in natural language, while they provide weak to insubstantial evidence that other factors such as gap-degree and well-nestedness constrain crossing dependencies. This suggests that the apparent bounds on gap-degree and well-nestedness arise as a consequence of factors such as number of crossings, tree topology, and dependency length.
In particular, the results from Experiment 1 provide the strongest evidence in favor of cognitively motivated crossing constraints (such as HDD), suggesting a parsing/processing driven constraint on crossing dependencies. This proposal was further substantiated through the results from Experiment 2 that show that certain psycholinguistically motivated factors such as information locality, dependency weight, as well as inter-head distance can determine if a pair of dependency will cross, even when compared against random trees. These results provide evidence for a functional motivation for distribution of crossing dependencies in natural language. The key contribution of this work is to (a) provide a method to quantify the evidence for a particular graph theoretic constraint beyond what can be explained in terms of tree topology, number of crossings, and dependency distance, and (b) to highlight that crossings can be predicted through processing factors beyond dependency distance.
Together, the two findings suggest that graph-theoretic constraints on crossing dependencies could be driven by processing considerations (also see Ferrer-i-Cancho and Gómez-Rodríguez 2016). The idea that formal properties in grammar could be determined by processing consideration is not new. As stated earlier, Joshi (1985) proposed the MCS hypothesis, which required grammars to be efficiently parseable, in the sense of worst-case asymptotic complexity of exact parsing. Similarly, the performance-grammar correspondence hypothesis by Hawkins (2004) proposes that processing strategies get grammaticalized for efficiency considerations. We note that, on these accounts, efficiency in grammar is understood as efficiency in online comprehension of the utterances licensed by the grammar.
There is evidence that humans find crossing dependencies to be difficult during comprehension. One piece of evidence for this comes from processing of filler–gap dependencies. A filler-gap dependency typically involves a crossing dependency and, in order to resolve it, the human processing system is known to operate with the principle of immediacy, namely, the parser tries to resolve it sooner rather than later (De Vincenzi 1991). There is also evidence that, if possible, such dependencies are avoided (Staub et al. 2018). More recently, Husain and Yadav (2020) show that in Hindi, crossing dependencies are avoided during comprehension of participle clauses and that such dependencies lead to processing difficulty. Processing of crossing dependency necessarily involves maintenance of unresolved structure and a processing strategy to avoid building such complex configuration is consistent with the parser’s bias for building simple structures (Frazier 1985). In the domain of computational parsers, limits on quantities such as edge-degree, end-point crossings, and so forth, have been previously shown to lead to efficiency in terms of both asymptotic complexity and practical accuracy (Pitler, Kannan, and Marcus 2013), but there has not yet been any systematic investigation on whether these factors affect human online processing. If it is true that constraints on crossing dependencies are motivated by processing efficiency for humans, then formal factors such as end-point crossings might also lead to observable processing difficulty for humans.
The research discussed above suggests that the distribution of crossing dependencies may arise due to comprehension difficulty, but processing efficiency is not limited to comprehension only: Production and learning may also play a role. For example, it is known that difficulty during comprehension could be a consequence of pressures during production (MacDonald 2013); in this view, ease of production determines the distribution of linguistic patterns in a language community, and this distribution in turn makes certain structures either easy or difficult to comprehend. Learning may also play a role: Given that crossing dependencies are difficult to produce and comprehend, it is possible that such dependencies are also difficult to learn and that their rarity could be independently driven by learning biases (Chang 2009). Teasing apart the role (and extent) of how various processing factors affect crossing dependencies will require a dedicated research effort. We hope that our work is a step in that direction.
If crossing dependencies are difficult to process, then why do they exist at all? To answer this question, we need to appreciate that processing cost due to crossings could be just one of the many sources that lead to processing difficulty. Decades of research on comprehension and production has shown that processing complexity can be understood as a trade-off between various countervailing factors related to sentence encoding/decoding (e.g., Trueswell, Tanenhaus, and Garnsey 1994; Kaiser and Trueswell 2004; Altmann and Kamide 1999; Frazier 1979; Gibson 1998; Lewis and Vasishth 2005; Levy 2008). For example, syntactic configurations leading to clausal embeddings are known to be quite complex and under such configurations creating a crossing dependency (for example, by right-extraposition) could in fact lead to a relatively less complex structure (Yngve 1960). This, of course, implies that in certain context, other syntactic configurations could be costlier. Similarly, requirements at different levels of linguistic encoding (discourse, pragmatics, etc.) could require positioning of words that could lead to creation of crossing dependencies. Finally, during production, accessibility-related pressures (e.g., Ferreira and Dell 2000; Branigan, Pickering, and Tanaka 2008) could lead to creation of crossing dependencies. In short, formation of crossing dependencies could be due to (a) creation of less costly structures, (b) extra-sentential requirements, and (c) production pressures such as accessibility.
Could crossing dependencies arise due to non-functional reasons? It is possible that a language-wide grammatical constraint (e.g., on the position of heads) could lead to existence of certain crossing dependencies. For example, in English, the presence of wh crossing dependencies could be a consequence of a fixed word order in the language. Compare this to wh dependencies in a free word order language like Hindi where the wh phrase can appear in situ, thereby creating a non-crossing dependency. For such cases, avoiding a crossing dependency in a language like English is not possible due to language-wide grammatical constraints.
We hope this work stimulates future work studying the functional bases for restrictions on crossing dependencies using experimental and corpus methods.
Acknowledgments
We thank the three anonymous reviewers for helpful suggestions. This work was supported by an NVIDIA GPU grant to RF.
Notes
We define arity of a tree as maximum of out-degree of its nodes.
Tree height is maximum distance from the root to a leaf node of the tree.
The tendency to minimize the number of heads in such configuration could be related to a recent finding that syntactic heads are avoided in the intervening regions of a dependency (Yadav, Mittal, and Husain 2020).
One can argue that effect of dependent weight in noun-headed constructions is only due to right-extraposed relative clauses and hence the result should not be generalized to heavy-phrase extraposition hypothesis. However, the results for noun-headed constructions remains the same when we remove relative clause cases—implying that weight-expectation effect in noun-headed constructions is not driven by relative clauses constructions alone.