Abstract
We use large-scale corpora in six different gendered languages, along with tools from NLP and information theory, to test whether there is a relationship between the grammatical genders of inanimate nouns and the adjectives used to describe those nouns. For all six languages, we find that there is a statistically significant relationship. We also find that there are statistically significant relationships between the grammatical genders of inanimate nouns and the verbs that take those nouns as direct objects, as indirect objects, and as subjects. We defer deeper investigation of these relationships for future work.
1 Introduction
In many languages, nouns possess grammatical genders. When a noun refers to an animate object, its grammatical gender typically reflects the biological sex or gender identity of that object (Zubin and Köpcke, 1986; Corbett, 1991; Kramer, 2014). For example, in German, the word for a boss is grammatically feminine when it refers to a woman, but grammatically masculine when it refers to a man—Chefin and Chef, respectively. But inanimate nouns (i.e., nouns that refer to inanimate objects) also possess grammatical genders. Any German speaker will tell you that the word for a bridge, Brücke, is grammatically feminine, even though bridges have neither biological sexes nor gender identities. Historically, the grammatical genders of inanimate nouns have been considered more idiosyncratic and less meaningful than the grammatical genders of animate nouns (Brugmann, 1889; Bloomfield, 1933; Fox, 1990; Aikhenvald, 2000). However, some cognitive scientists have reopened this discussion by using laboratory experiments to test whether speakers of gendered languages reveal gender stereotypes (Sera et al., 1994)—for example, and most famously, when choosing adjectives to describe inanimate nouns (Boroditsky et al., 2003).
Although laboratory experiments are highly informative, they typically involve small sample sizes. In this paper, we therefore use large-scale corpora and tools from NLP and information theory to test whether there is a relationship between the grammatical genders of inanimate nouns and the adjectives used to describe those nouns. Specifically, we calculate the mutual information (MI)—a measure of the mutual statistical dependence between two random variables— between the grammatical genders of inanimate nouns and the adjectives that describe them (i.e., share a dependency arc labeled amod) using large-scale corpora in six different gendered languages (specifically, German, Italian, Polish, Portuguese, Russian, and Spanish). For all six languages, we find that the MI is statistically significant, meaning that there is a relationship.
We also test whether there are relationships between the grammatical genders of inanimate nouns and the verbs that take those nouns as direct objects, as indirect objects, and as subjects. For all six languages, we find that there are statistically significant relationships for the verbs that take those nouns as direct objects and as subjects. For five of the six languages, we also find that there is a statistically significant relationship for the verbs that take those nouns as indirect objects, but because of the small number of noun–verb pairs involved, we caution against reading too much into this finding.
To contextualize our findings, we test whether there are statistically significant relationships between the grammatical genders of inanimate nouns and the cases and numbers of these nouns. A priori, we do not expect to find statistically significant relationships, so these tests can be viewed as a baseline of sorts. As expected, for each of the six languages, there are no statistically significant relationships.
To provide further context, we also repeat all tests for animate nouns—a “skyline” of sorts—finding that for all six languages there is a statistically significant relationship between the grammatical genders of animate nouns and the adjectives used to describe those nouns. We also find that there are statistically significant relationships between the grammatical genders of animate nouns and the verbs that take those nouns as direct objects, as indirect objects, and as subjects. All of these relationships have effect sizes (operationalized as normalized MI values) that are larger than the effect sizes for inanimate nouns.
We emphasize that the practical significance and implications of our findings require deeper investigation. Most importantly, we do not investigate the characteristics of the relationships that we find. This means that we do not know whether these relationships are characterized by gender stereotypes, as argued by some cognitive scientists. We also do not engage with the ways that historical and sociopolitical factors affect the grammatical genders possessed by either animate or inanimate nouns (Fodor, 1959; Ibrahim, 2014).
2 Background
2.1 Grammatical Gender
Languages lie along a continuum with respect to whether nouns possess grammatical genders. Languages with no grammatical genders, like Turkish, lie on one end of this continuum, while languages with tens of gender-like classes, like Swahili (Corbett, 1991), lie on the other. In this paper, we focus on six different gendered languages for which large-scale corpora are readily available: German, Italian, Polish, Portuguese, Russian, and Spanish—all languages of Indo-European descent. Three of these languages (Italian, Portuguese, and Spanish) have two grammatical genders (masculine and feminine), while the other two have three grammatical genders (masculine, feminine, and neuter).
All six languages exhibit gender agreement, meaning that words are marked with morphological suffixes that reflect the grammatical genders of their surrounding nouns (Corbett, 2006). For example, consider the following translations of the sentence, “The delicate fork is on the cold ground.”
- (1)
Die zierliche Gabelstehtaufdem kaltenBoden. the.f.sg.nom delicate.f.sg.nom fork.f.sg.nom stands on the.m.sg.dat cold.m.sg.dat ground.m.sg.dat The delicate fork is on the cold ground.
- (2)
El tenedordelicado estáenel suelo frío. the.m.sg fork.m.sg delicate.m.sg is on the.m.sg ground.m.sg cold.m.sg The delicate fork is on the cold ground.
Because the German word for a fork, Gabel, is grammatically feminine, the German translation uses the feminine determiner, die. Had Gabel been masculine, the German translation would have used the masculine determiner, der. Similarly, because the Spanish word for a fork, tenedor, is grammatically masculine, the Spanish translation uses the masculine determiner, el, instead of the feminine determiner, la. As we explain in Section 3, we lemmatize each corpus to ensure that our tests do not simply reflect the presence of gender agreement.
2.2 Grammatical Gender & Meaning
Although some scholars have described the grammatical genders possessed by inanimate nouns as “creative” and meaningful (Grimm, 1890; Wheeler, 1899), many scholars have considered them to be idiosyncratic (Brugmann, 1889; Bloomfield, 1933) or arbitrary (Maratsos, 1979, p. 317). In an overview of this work, Dye et al. (2017) wrote, “As often as not, the languages of the world assign [inanimate] objects into seemingly arbitrary [classes]... William ofOckham considered gender to be a meaningless, unnecessary aspect of language.” Bloomfield (1933) shared this viewpoint, stating that “[t]here seems to be no practical criterion by which the gender of a noun in German, French, or Latin [can] be determined.” Indeed, adult language learners often have particular difficulty mastering the grammatical genders of inanimate nouns (Franceschina 2005, Ch. 4, DeKeyser 2005; Montrul et al. 2008), which suggests that their meanings are not straightforward.
Even if the grammatical genders possessed by inanimate nouns are meaningless, ample evidence suggests that gender-related information may affect cognitive processes (Sera et al., 1994; Cubelli et al., 2005, 2011; Kurinski and Sera, 2011; Boutonnet et al., 2012; Saalbach et al., 2012). Typologists and formal linguists have argued that grammatical genders are an important feature for morphosyntactic processes (Corbett, 1991, 2006; Harbour et al., 2008; Harbour, 2011; Kramer, 2014, 2015), while some cognitive scientists have shown that grammatical genders can be a perceptual cue—for example, human brain responses exhibit sensitivity to gender mismatches in several different languages (Osterhout and Mobley, 1995; Hagoort and Brown, 1999; Vigliocco et al., 2002; Wicha et al., 2003, 2004; Barber et al., 2004; Barber and Carreiras, 2005; Bañón et al., 2012; Caffarra et al., 2015), and the grammatical genders of determiners and adjectives can prime nouns (Bates et al., 1996; Akhutina et al., 1999; Friederici and Jacobsen, 1999). However, the precise nature of the relationship between grammatical gender and meaning remains an open research question.
In particular, the grammatical genders possessed by inanimate nouns might affect the ways that speakers of gendered languages conceptualize the objects referred to by those nouns (Jakobson, 1959; Clarke et al., 1981; Ervin-Tripp, 1962; Konishi, 1993; Sera et al., 1994, 2002; Vigliocco et al., 2005; Bassetti, 2007)—although we note that this viewpoint is somewhat contentious (Hofstätter, 1963; Bender et al., 2011; McWhorter, 2014). Neo-Whorfian cognitive scientists hold a particularly strong variant of this viewpoint, arguing that the grammatical genders possessed by inanimate nouns prompt speakers of gendered languages to rely on gender stereotypes when choosing adjectives to describe those nouns (Boroditsky and Schmidt, 2000; Boroditsky et al., 2002; Phillips and Boroditsky, 2003; Boroditsky, 2003; Boroditsky et al., 2003; Semenuks et al., 2017). Most famously, Boroditsky et al. (2003) claim to have conducted a laboratory experiment showing that speakers of German choose stereotypically feminine adjectives to describe, for example, bridges, while speakers of Spanish choose stereotypically masculine adjectives, reflecting the fact that in German, the word for a bridge, Brücke, is grammatically feminine, while in Spanish, the word for a bridge, puente, is grammatically masculine. Boroditsky et al. (2003) took these findings to be a relatively strong confirmation of the existence of a stereotype effect—that is, that speakers of gendered languages reveal gender stereotypes when choosing adjectives to describe inanimate nouns. That said, the experiment has not gone unchallenged. Indeed, Mickan et al. (2014) reported two unsuccessful replication attempts.
2.3 Laboratory Experiments vs. Corpora
Traditionally, studies of grammatical gender and meaning have relied on laboratory experiments. This is for two reasons: 1) laboratory experiments can be tightly controlled, and 2) they enable scholars to measure speakers’ immediate, real-time speech production. However, they also typically involve small sample sizes and, in many cases, somewhat artificial settings. In contrast, large-scale corpora of written text enable scholars to measure even relatively weak correlations via writers’ text production in natural, albeit less tightly controlled, settings. They also facilitate the discovery of correlations that hold across languages with disparate histories, cultural contexts, and even gender systems. As a result, large-scale corpora have proven useful for studying a wide variety of language-related phenomena (e.g., Featherston and Sternefeld, 2007; Kennedy, 2014; Blasi et al., 2019).
In this paper, we assume that a writer’s choice of words in written text is as informative as a speaker’s choice of words in a laboratory experiment, despite the obvious differences between these settings. Consequently, we use large-scale corpora and tools from NLP and information theory, enabling us to test for the presence of even relatively weak relationships involving the grammatical genders of inanimate nouns across multiple different gendered languages. We therefore argue that our findings complement, rather than supersede, laboratory experiments.
2.4 Related Work
Our paper is not the first to use large-scale corpora and tools from NLP to investigate gender and language. Many scholars have studied the ways that societal norms and stereotypes, including gender norms and stereotypes, can be reflected in representations of distributional semantics derived from large-scale corpora, such as word embeddings (Bolukbasi et al., 2016; Caliskan et al., 2017; Garg et al., 2018; Zhao et al., 2018). More recently, Williams et al. (2019) found that the grammatical genders of inanimate nouns in eighteen different languages were correlated with their lexical semantics. Dye et al. (2017) used tools from information theory to reject the idea that the grammatical genders of nouns separate those nouns into coherent categories, arguing instead that grammatical genders are only meaningful in that they systematically facilitate communication efficiency by reducing nominal entropy. Also relevant to our paper is the work of Kann (2019), who proposed a computational approach to testing whether there is a relationship between the grammatical genders of inanimate nouns and the words that co-occur with those nouns, operationalized via word embeddings. However, in contrast to our findings, they found no evidence for the presence of such a relationship. Finally, many scholars have proposed a variety of computational techniques for mitigating gender norms and stereotypes in a wide range of language-based applications (Dev and Phillips, 2019; Dinan et al., 2019; Ethayarajh et al., 2019; Hall Maudslay et al., 2019; Stanovsky et al., 2019; Tan and Celis, 2019; Zhou et al., 2019; Zmigrod et al., 2019).
3 Data Preparation
We use the May 2018 dump of Wikipedia to create a corpus for each of the six different gendered languages (i.e., German, Italian, Polish, Portuguese, Russian, and Spanish). Although Wikipedia is not the most representative data source, this choice yields language-specific corpora that are roughly parallel—that is, they refer to the same objects, but are not direct translations of each other (which could lead to artificial word choices). We use UDPipe to tokenize each corpus (Straka et al., 2016).
We dependency parse the corpus for each language using a language-specific dependency parser (Andor et al., 2016; Alberti et al., 2017), trained using Universal Dependencies treebanks (Nivre et al., 2017). An example dependency tree is shown in Figure 1. We then extract all noun–adjective pairs (dependency arcs labeled amod) and noun–verb pairs from each of the six corpora; for verbs, we extract three types of pairs, reflecting the fact that nouns can be direct objects (dependency arcs labeled dobj), indirect objects (dependency arcs labeled iobj), or subjects (dependency arcs labeled nsubj) of verbs. We discard all pairs that contain a noun that is not present in WordNet (Princeton University, 2010). We label the remaining nouns as “animate” or “inanimate” according to WordNet.
Next, we lemmatize all words (i.e., nouns, adjectives, and verbs). Each word is factored into a set of lexical features consisting of a lemma, or canonical morphological form, and a bundle of three morphological features corresponding to the grammatical gender, number, and case of that word. For example, the German word for a fork, Gabel, is grammatically feminine, singular, and genitive. For nouns, we discard the lemmas themselves and retain only the morphological features; for adjectives and verbs, we retain the lemmas and discard the morphological features.
For adjectives and verbs, lemmatizing is especially important because it ensures that our tests do not simply reflect the presence of gender agreement, as we describe in Section 2.1. However, this means that if the lemmatizer fails, then our tests may simply reflect gender agreement despite our best efforts. To guard against this, we use a state-of-the-art lemmatizer (Müller et al., 2015), trained for each language using Universal Dependencies treebanks (Nivre et al., 2017). We expect that when the lemmatizer fails, the resulting lemmata will be low frequency. We try to exclude lemmatization failures from our calculations by discarding low-frequency lemmata. For each language, we rank the adjective lemmata by their token counts and retain only the highest-ranked lemmata (in rank order) that account for 90% of the adjective tokens; we then discard all noun–adjective pairs that do not contain one of these lemmata. We repeat the same process for verbs.
Finally, to ensure that our tests reflect the most salient relationships, we also discard low-frequency inanimate nouns and, separately, low-frequency animate nouns using the same process. We provide counts of the remaining noun–adjective and noun–verb pairs in Table 3 (for inanimate nouns) and Table 4 (for animate nouns).
4 Methodology
For each language ℓ ∈{de,it,pl,pt,ru,es}, we define to be the set of adjective lemmata represented in the noun–adjective pairs retained for that language as defined above. We similarly define to be the set of verb lemmata represented in the noun–verb pairs retained for that language, as described above. We then define , , and to be the sets of verbs that take the nouns as direct objects, as indirect objects, and as subjects, respectively. We also define to be the set of grammatical genders for that language (e.g., ), to be the set of cases (e.g., ), and to be the set of numbers (e.g., ). Finally, we define fourteen random variables: and are -valued random variables, and are -valued random variables, and are -valued random variables, and are -valued random variables, and are -valued random variables, and are -valued random variables, and and are -valued random variables. The subscripts “i” and “a” denote inanimate and animate nouns, respectively
To test for statistical significance, we perform a permutation test. Specifically, we permute the grammatical genders of the inanimate nouns 10,000 times and, for each permutation, recalculate the MI between and using the permuted genders. We obtain a p-value by calculating the percentage of permutations that have a higher MI than the MI obtained using the non-permuted genders; if the p-value is less than 0.05, then we treat the relationship between and as statistically significant.
To test whether there are relationships between the grammatical genders of inanimate nouns and the verbs that take those nouns as direct objects, as indirect objects, and as subjects, we calculate , , and . Again, all probabilities are calculated with respect to inanimate nouns only, and we perform permutation tests to test for statistical significance. We also calculate six NMI variants for each of the three pairs of random variables, using normalizers that are analogous to those in Eq. (2) through Eq. (7).
As a baseline, we test whether there are relationships between the grammatical genders of inanimate nouns and the cases and numbers of those nouns—that is, we calculate and using probabilities that are calculated with respect to inanimate nouns only. Again, we perform permutation tests (but we do not expect that there will be statistically significant relationships), and we calculate six NMI variants for each pair of random variables using normalizers that are analogous to those in Eq. (2) through Eq. (7).
Finally, we calculate , , , , , and ) using probabilities calculated with respect to animate nouns only. The first five of these are intended to serve as a “skyline,” while the last two are intended to serve as a sanity check (i.e., we expect them to be close to zero, as with inanimate nouns). Again, we perform permutation tests to test for statistical significance, and we calculate six NMI variants for each pair of random variables.
5 Results
In the first row of Table 1, we provide the MI between and for each language ℓ ∈{de,it,pl,pt,ru,es}. For all six languages, is statistically significant (i.e., p < 0.05), meaning that there is a relationship between the grammatical genders of inanimate nouns and the adjectives used to describe those nouns. Rows 2–4 of Table 1 contain , , and for each language. For all six languages, and are statistically significant (i.e., p < 0.05). For five of the six languages, is statistically significant, but because of the small number of noun–verb pairs involved, we caution against reading too much into this finding. We note that direct objects are closest to verbs in analyses of constituent structures, followed by subjects and then indirect objects (Chomsky, 1957; Adger, 2003). Finally, the last two rows of Table 1 contain and , respectively, for each language. We do not find any statistically significant relationships for either case or number.
. | de . | it . | pl . | pt . | ru . | es . |
---|---|---|---|---|---|---|
0.0310 | 0.0500 | 0.0225 | 0.0400 | 0.0520 | 0.0664 | |
0.0290 | 0.0232 | 0.0109 | 0.0129 | 0.0440 | 0.0090 | |
0.0743 | 0.6973 | 0.0514 | 0.0230 | 0.0640 | 0.0184 | |
0.0276 | 0.0274 | 0.0226 | 0.0090 | 0.0270 | 0.0090 | |
< 0.001 | N/A | < 0.001 | N/A | < 0.001 | N/A | |
< 0.001 | < 0.001 | <0.001 | <0.001 | < 0.001 | < 0.001 |
. | de . | it . | pl . | pt . | ru . | es . |
---|---|---|---|---|---|---|
0.0310 | 0.0500 | 0.0225 | 0.0400 | 0.0520 | 0.0664 | |
0.0290 | 0.0232 | 0.0109 | 0.0129 | 0.0440 | 0.0090 | |
0.0743 | 0.6973 | 0.0514 | 0.0230 | 0.0640 | 0.0184 | |
0.0276 | 0.0274 | 0.0226 | 0.0090 | 0.0270 | 0.0090 | |
< 0.001 | N/A | < 0.001 | N/A | < 0.001 | N/A | |
< 0.001 | < 0.001 | <0.001 | <0.001 | < 0.001 | < 0.001 |
To facilitate comparisons, each subplot in Figure 2 contains six variants of , , and , calculated using normalizers that are analogous to those in Eq. (2) through Eq. (7), for a single language ℓ ∈{de,it,pl,pt,ru,es}. (We omit from each plot because of the small number of noun–verb pairs involved.) For ℓ ∈{it,pl,pt,es}, is larger than and , regardless of the normalizer. For ℓ ∈{it,pl}, is larger than ; is larger than ; and and are roughly comparable—again, all regardless of the normalizer. Meanwhile, is larger than and for the normalizer in Eq. (2), while , , and are all roughly comparable for the other five normalizers. Finally, and are roughly comparable and larger than , regardless of the normalizer.
In other words, the relationship between the grammatical genders of inanimate nouns and the adjectives used to describe those nouns is generally stronger than, but sometimes roughly comparable to, the relationships between the grammatical genders of inanimate nouns and the verbs that take those nouns as direct objects and as subjects. However, the relative strengths of the relationships between the grammatical genders of inanimate nouns and the verbs that take those nouns as direct objects and as subjects vary depending on the language.
In Table 2, we provide , , , , , and for each language ℓ ∈{de,it,pl,pt,ru,es}. As with inanimate nouns, we find that there is a statistically significant relationship between the grammatical genders of animate nouns and the adjectives used to describe those nouns. We also find that there are statistically significant relationships between the grammatical genders of animate nouns and the verbs that take those nouns as direct objects, as indirect objects, and as subjects. Again, the relationship for the verbs that take those nouns as indirect objects involves a small number of noun–verb pairs. As expected, we do not find any statistically significant relationships for either case or number.
. | de . | it . | pl . | pt . | ru . | es . |
---|---|---|---|---|---|---|
0.0928 | 0.1316 | 0.0621 | 0.0933 | 0.0845 | 0.1111 | |
0.0410 | 0.0543 | 0.0273 | 0.0320 | 0.0664 | 0.0091 | |
0.0737 | 0.0543 | 0.0439 | 0.0687 | 0.0600 | 0.0358 | |
0.0343 | 0.0543 | 0.0258 | 0.0252 | 0.0303 | 0.0192 | |
< 0.001 | N/A | < 0.001 | N/A | < 0.001 | N/A | |
< 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 |
. | de . | it . | pl . | pt . | ru . | es . |
---|---|---|---|---|---|---|
0.0928 | 0.1316 | 0.0621 | 0.0933 | 0.0845 | 0.1111 | |
0.0410 | 0.0543 | 0.0273 | 0.0320 | 0.0664 | 0.0091 | |
0.0737 | 0.0543 | 0.0439 | 0.0687 | 0.0600 | 0.0358 | |
0.0343 | 0.0543 | 0.0258 | 0.0252 | 0.0303 | 0.0192 | |
< 0.001 | N/A | < 0.001 | N/A | < 0.001 | N/A | |
< 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 |
Figure 3 is analogous to Figure 2, in that each subplot contains six variants of , , and , calculated using normalizers that are analogous to those in Eq. (2) through Eq. (7), for a single language ℓ ∈{de,it,pl,pt,ru,es}. (As with inanimate nouns, we omit from each plot because of the small number of noun–verb pairs involved.) For ℓ ∈{de,it,pl,pt,es}, is larger than and , regardless of the normalizer. For ℓ ∈{it,pl}, is larger than ; for ℓ ∈{de,pt}, is larger than ; and and are roughly comparable—again, all regardless of the normalizer. Meanwhile, is larger than which is larger than for the normalizers in Eq. (2) and Eq. (3), while and are roughly comparable and larger than for the other five normalizers.
Finally, each subplot in Figure 4 contains and , calculated using a single normalizer, for each for each language ℓ ∈{de,it,pl,pt,ru,es}. Each subplot in Figure 5 analogously contains and , while each subplot in Figure 6 contains and . The NMI values for animate nouns are generally larger
than the NMI values for inanimate nouns. The only exception is Polish, where is larger than , regardless of the normalizer.
6 Discussion
We find evidence for the presence of a statistically significant relationship between the grammatical genders of inanimate nouns and the adjectives used to describe those nouns for six different gendered languages (specifically, German, Italian, Polish, Portuguese, Russian, and Spanish). We also find evidence for the presence of statistically significant relationships between the grammatical genders of inanimate nouns and the verbs that take those nouns as direct objects, as indirect objects, and as subjects. However, we caution against reading too much into the relationship for the verbs that take those nouns as indirect objects because of the small number of noun–verb pairs involved. The effect sizes (operationalized as NMI values) for all of these relationships are smaller than the effect sizes for animate nouns. As expected, we do not find any statistically significant relationships for either case or number.
We emphasize that our findings complement, rather than supersede, laboratory experiments, such as that of Boroditsky et al. (2003). We use large-scale corpora and tools from NLP and information theory to test for the presence of even relatively weak relationships across multiple different gendered languages—and, indeed, the relationships that we find have effect sizes (operationalized as NMI values) that are small. In contrast, laboratory experiments typically focus on much stronger relationships by tightly controlling experimental conditions and measuring speakers’ immediate, real-time speech production. Moreover, although we find statistically significant relationships, we do not investigate the characteristics of these relationships. This means that we do not know whether they are characterized by gender stereotypes, as argued by some cognitive scientists, including Boroditsky et al. (2003). We also do not know whether the relationships that we find are causal in nature. Because MI is symmetric, our findings say nothing about whether the grammatical genders of inanimate nouns cause writers to choose particular adjectives or verbs. We defer deeper investigation of this for future work.
We note that each of our tests can be viewed as a comparison of the similarity of two clusterings of a set of items—specifically, a “clustering” of nouns into grammatical genders and a “clustering” of the same nouns into, for example, adjective lemmata. Although (normalized) MI is a standard measure for comparing clusterings, it is not without limitations (see, e.g., Newman et al. [2020] for an overview). For future work, we therefore recommend replicating our tests using other information-theoretic measures for comparing clusterings.
Acknowledgments
We thank Lera Boroditsky, Hagen Blix, Eleanor Chodroff, Andrei Cimpian, Zach Davis, Jason Eisner, Richard Futrell, Todd Gureckis, Katharina Kann, Peter Klecha, Zhiwei Li, Ethan Ludwin-Peery, Alec Marantz, Arya McCarthy, John McWhorter, Sabrina J. Mielke, Elizabeth Salesky, Arturs Semenuks, and Colin Wilson for discussions at various points related to the ideas in this paper. Katharina Kann approves this acknowledgment.
A Appendix A: Counts
Counts of the noun–adjective and noun–verb pairs for all six gendered languages are in Table 3 (for inanimate nouns) and Table 4 (for animate nouns).
. | de . | it . | pl . | pt . | ru . | es . |
---|---|---|---|---|---|---|
# noun–adj. tokens | 6443907 | 6246856 | 11631913 | 640558 | 32900200 | 3605439 |
# noun–adj. types | 770952 | 666656 | 640107 | 638774 | 1633963 | 368795 |
# noun types | 10712 | 6410 | 5533 | 5672 | 9327 | 6157 |
# adj. types | 4129 | 3607 | 4080 | 3431 | 11028 | 1907 |
# noun–verb (subj.) tokens | 3191030 | 1432354 | 2179396 | 1871941 | 6007063 | 1534211 |
# noun–verb (subj.) types | 445536 | 292949 | 297996 | 337262 | 864480 | 376888 |
# noun (subj.) types | 10741 | 6318 | 5522 | 5780 | 9129 | 7470 |
# verb types | 707 | 702 | 874 | 758 | 1803 | 875 |
# noun–verb (dobj.) tokens | 3440922 | 2855037 | 3964828 | 4850012 | 6738606 | 2859135 |
# noun–verb (dobj.) types | 427441 | 393246 | 236849 | 541347 | 713703 | 576835 |
# noun (dobj.) types | 10504 | 6407 | 4359 | 5896 | 8998 | 11567 |
# verb types | 805 | 806 | 708 | 738 | 1539 | 9746 |
# noun–verb (iobj.) tokens | 163935 | 71 | 54138 | 95009 | 1570273 | 56038 |
# noun–verb (iobj.) types | 50133 | 53 | 18214 | 39738 | 300703 | 24830 |
# noun (iobj.) types | 5520 | 59 | 2258 | 3757 | 8150 | 3574 |
# verb types | 386 | 68 | 417 | 357 | 1816 | 464 |
# noun–case tokens | 14681293 | N/A | 15300621 | N/A | 51641929 | N/A |
# noun–case types | 2252632 | N/A | 1465314 | N/A | 5028075 | N/A |
# noun types | 11989 | N/A | 5839 | N/A | 9692 | N/A |
# case types | 4 | 0 | 7 | 0 | 6 | 0 |
# noun–number tokens | 14681293 | 11588448 | 15300621 | 14631732 | 51641929 | 5672790 |
# noun–number types | 2252632 | 1748927 | 1465314 | 2042626 | 5028075 | 1034307 |
# noun types | 11989 | 7014 | 5839 | 6256 | 9692 | 1593 |
# number types | 2 | 2 | 2 | 2 | 2 | 2 |
. | de . | it . | pl . | pt . | ru . | es . |
---|---|---|---|---|---|---|
# noun–adj. tokens | 6443907 | 6246856 | 11631913 | 640558 | 32900200 | 3605439 |
# noun–adj. types | 770952 | 666656 | 640107 | 638774 | 1633963 | 368795 |
# noun types | 10712 | 6410 | 5533 | 5672 | 9327 | 6157 |
# adj. types | 4129 | 3607 | 4080 | 3431 | 11028 | 1907 |
# noun–verb (subj.) tokens | 3191030 | 1432354 | 2179396 | 1871941 | 6007063 | 1534211 |
# noun–verb (subj.) types | 445536 | 292949 | 297996 | 337262 | 864480 | 376888 |
# noun (subj.) types | 10741 | 6318 | 5522 | 5780 | 9129 | 7470 |
# verb types | 707 | 702 | 874 | 758 | 1803 | 875 |
# noun–verb (dobj.) tokens | 3440922 | 2855037 | 3964828 | 4850012 | 6738606 | 2859135 |
# noun–verb (dobj.) types | 427441 | 393246 | 236849 | 541347 | 713703 | 576835 |
# noun (dobj.) types | 10504 | 6407 | 4359 | 5896 | 8998 | 11567 |
# verb types | 805 | 806 | 708 | 738 | 1539 | 9746 |
# noun–verb (iobj.) tokens | 163935 | 71 | 54138 | 95009 | 1570273 | 56038 |
# noun–verb (iobj.) types | 50133 | 53 | 18214 | 39738 | 300703 | 24830 |
# noun (iobj.) types | 5520 | 59 | 2258 | 3757 | 8150 | 3574 |
# verb types | 386 | 68 | 417 | 357 | 1816 | 464 |
# noun–case tokens | 14681293 | N/A | 15300621 | N/A | 51641929 | N/A |
# noun–case types | 2252632 | N/A | 1465314 | N/A | 5028075 | N/A |
# noun types | 11989 | N/A | 5839 | N/A | 9692 | N/A |
# case types | 4 | 0 | 7 | 0 | 6 | 0 |
# noun–number tokens | 14681293 | 11588448 | 15300621 | 14631732 | 51641929 | 5672790 |
# noun–number types | 2252632 | 1748927 | 1465314 | 2042626 | 5028075 | 1034307 |
# noun types | 11989 | 7014 | 5839 | 6256 | 9692 | 1593 |
# number types | 2 | 2 | 2 | 2 | 2 | 2 |
. | de . | it . | pl . | pt . | ru . | es . |
---|---|---|---|---|---|---|
# noun–adj. tokens | 662760 | 818300 | 1137209 | 712101 | 3225932 | 387025 |
# noun–adj. types | 99332 | 92424 | 97847 | 90865 | 264117 | 50173 |
# noun types | 1998 | 1078 | 954 | 1006 | 2098 | 1320 |
# adj. types | 3587 | 3507 | 3836 | 3176 | 9833 | 1828 |
# noun–verb (subj.) tokens | 637801 | 399747 | 526894 | 456349 | 1516740 | 310569 |
# noun–verb (subj.) types | 113308 | 77551 | 89819 | 89959 | 253150 | 93586 |
# noun (subj.) types | 2056 | 1066 | 969 | 1013 | 2020 | 1477 |
# verb types | 707 | 702 | 874 | 758 | 1799 | 874 |
# noun–verb (dobj.) tokens | 321400 | 388187 | 456824 | 527259 | 494534 | 850234 |
# noun–verb (dobj.) types | 60760 | 55574 | 76348 | 92220 | 118818 | 85235 |
# noun (dobj.) types | 1901 | 1025 | 867 | 1028 | 1912 | 1023 |
# verb types | 804 | 805 | 724 | 737 | 1535 | 745 |
# noun–verb (iobj.) tokens | 51359 | 7 | 43187 | 23139 | 518540 | 23955 |
# noun–verb (iobj.) types | 17804 | 6 | 8440 | 110185 | 11353 | 9586 |
# noun (iobj.) types | 1149 | 6 | 628 | 773 | 1858 | 947 |
# verb types | 378 | 6 | 411 | 340 | 1769 | 456 |
# noun–case tokens | 1926614 | N/A | 1907688 | N/A | 6357089 | N/A |
# noun–case types | 390672 | N/A | 299511 | N/A | 987420 | N/A |
# noun types | 2292 | N/A | 1024 | N/A | 2194 | N/A |
# case types | 4 | 0 | 7 | 0 | 6 | 0 |
# noun–number tokens | 1926614 | 1801285 | 1907688 | 1931315 | 6357089 | 786177 |
# noun–number types | 390672 | 306968 | 299511 | 356352 | 987420 | 200785 |
# noun types | 2292 | 1135 | 1024 | 1072 | 2194 | 1593 |
# number types | 2 | 2 | 2 | 2 | 2 | 2 |
. | de . | it . | pl . | pt . | ru . | es . |
---|---|---|---|---|---|---|
# noun–adj. tokens | 662760 | 818300 | 1137209 | 712101 | 3225932 | 387025 |
# noun–adj. types | 99332 | 92424 | 97847 | 90865 | 264117 | 50173 |
# noun types | 1998 | 1078 | 954 | 1006 | 2098 | 1320 |
# adj. types | 3587 | 3507 | 3836 | 3176 | 9833 | 1828 |
# noun–verb (subj.) tokens | 637801 | 399747 | 526894 | 456349 | 1516740 | 310569 |
# noun–verb (subj.) types | 113308 | 77551 | 89819 | 89959 | 253150 | 93586 |
# noun (subj.) types | 2056 | 1066 | 969 | 1013 | 2020 | 1477 |
# verb types | 707 | 702 | 874 | 758 | 1799 | 874 |
# noun–verb (dobj.) tokens | 321400 | 388187 | 456824 | 527259 | 494534 | 850234 |
# noun–verb (dobj.) types | 60760 | 55574 | 76348 | 92220 | 118818 | 85235 |
# noun (dobj.) types | 1901 | 1025 | 867 | 1028 | 1912 | 1023 |
# verb types | 804 | 805 | 724 | 737 | 1535 | 745 |
# noun–verb (iobj.) tokens | 51359 | 7 | 43187 | 23139 | 518540 | 23955 |
# noun–verb (iobj.) types | 17804 | 6 | 8440 | 110185 | 11353 | 9586 |
# noun (iobj.) types | 1149 | 6 | 628 | 773 | 1858 | 947 |
# verb types | 378 | 6 | 411 | 340 | 1769 | 456 |
# noun–case tokens | 1926614 | N/A | 1907688 | N/A | 6357089 | N/A |
# noun–case types | 390672 | N/A | 299511 | N/A | 987420 | N/A |
# noun types | 2292 | N/A | 1024 | N/A | 2194 | N/A |
# case types | 4 | 0 | 7 | 0 | 6 | 0 |
# noun–number tokens | 1926614 | 1801285 | 1907688 | 1931315 | 6357089 | 786177 |
# noun–number types | 390672 | 306968 | 299511 | 356352 | 987420 | 200785 |
# noun types | 2292 | 1135 | 1024 | 1072 | 2194 | 1593 |
# number types | 2 | 2 | 2 | 2 | 2 | 2 |
References
Author notes
Equal contribution in this scientific whirlwind.