We have accepted the Editor’s invitation to comment on Alessandro Strumia’s paper in the current issue of Quantitative Science Studies. Strumia is a controversial figure. His biologistic accounts of the persistent gender gap in science have been subject to heated debate—both in print and on social media. Some researchers argue that Strumia’s viewpoints should be ignored. We disagree.
Despite overwhelming evidence of gender-related disadvantages, discrimination, and harassment (e.g., Brower & James, 2020; Budden, Tregenza, et al., 2008; Carli, Alawa, et al., 2016; Edmunds, Ovseiko, et al., 2016; El-Alayli, Hansen-Brown, & Ceynar, 2018; Guarino & Borden, 2017; Ilies, Hauserman, et al., 2003; Jagsi, Griffith, et al., 2016; Kabat-Farr & Cortina, 2014; Knobloch-Westerwick, Glynn, & Huge, 2013; Krawczyk & Smyk, 2016; Lerchenmueller & Sorenson, 2018; MacNell, Driscoll, & Hunt, 2015; National Academies of Sciences, Engineering, and Medicine, 2018; Reuben, Sapienza, & Zingales, 2014; Rivera, 2017; Rivera & Tilcsik, 2019; Sheltzer & Smith, 2014; Smyth & Nosek, 2015), Darwinist beliefs that science’s gender gap is best explained by a natural selection of the best and the brightest still echo in the corridors of many research institutions.
We find it crucial to expose the questionable evidence used to promote such beliefs. Strumia’s paper offers a case in point.
We structure our critique of Strumia’s paper in four parts. First, we document practices of selective citing and reporting in the study’s framing and conclusions. Second, we expose the questionable bibliometric assumptions guiding the empirical analysis. Third, we highlight data limitations and methodological flaws in Strumia’s analysis, and fourth we take issue with the bold and far-fetched interpretations presented in the study’s conclusion.
1. SELECTIVE CITING AND REPORTING
Misrepresenting previous research by leaving out relevant evidence that contradicts one’s personal views (“cherry picking”) or highlighting only those results that fit into one’s own argument is at best questionable research practice. In his paper, Strumia does both of those things. Table 1 lists examples of what we believe are cases of selective citing and biased reporting. The left column displays the references in question, the middle column summarizes Strumia’s account of these references, and the right column specifies what we see as problematic about Strumia’s representation of the literature. Obviously, we may interpret the studies in question somewhat differently from Strumia, but in this case, the account of the literature seems surprisingly skewed in the direction of Strumia’s underlying agenda. The list of omitted references that could have added nuance to Strumia’s review is too comprehensive to be covered in this comment.
Cited reference in question . | Strumia’s interpretations . | Problems with Strumia’s interpretations . |
---|---|---|
Caplar et al. (2017) | “For example, Caplar, Tacchella, and Birrer (2017) claim (consistent with my later findings) that papers in astronomy written by F authors are less cited than papers written by M authors, even after trying to correct for some social factors.” (p. 233). | This is an example of imprecise reporting: In five astronomy journals, papers first-authored by males, on average, were cited approximately 6% more than papers first-authored by women. |
Milkman et al. (2015) | “[L]ooking at gender in isolation (rather than at “women and minorities”), female students received slightly more responses from public schools (the majority of the sample) with respect to men in the same racial group.” (p. 226). | This is an example of selective reporting. Milkman et al. (2015) report that “faculty were significantly more responsive to White males than to all other categories of students, collectively, particularly in higher-paying disciplines and private institutions.” Private universities accounted for 37% percent of the sample. |
Witteman et al. (2019) | “found that female grant applications in Canada are less successful when evaluations involve career-level elements” (p. 226) | This is an example of selective reporting. Witteman and colleagues (2019) also found that the sex differences in success rates (in grant obtainment) were marginal when reviewers were asked to rate the proposals independent of track record. |
Xie and Shauman (1998), Levin and Stephan (1998), Abramo et al. (2009), Larivière et al. (2013), Way et al. (2016), Holman et al. (2018) | “Bibliometric attempts to recognize higher merit […] found that male faculty members write more papers.” (p. 226). | This is an example of imprecise reporting. Xie and Shauman (1998) observe a 20% gap in research productivity in the late 1980s and early 1990s. However, they also find that “most of the observed sex differences in research productivity can be attributed to sex differences in personal characteristics, structural positions, and marital status.” |
Levin and Stephan (1998) investigate gender differences in publication rates in four disciplines (Physics, Earth science, Biochemistry, and Physiology) and conclude that “in every instance‚ except the earth sciences‚ women published less than men‚ although the difference is statistically significant only for biochemists employed in academe and physiologists employed at medical schools” (p. 1056). The study did not adjust for scientific rank. | ||
In Abramo and colleagues’ (2009) study of Italian researchers, female professors and associate professors in the physical sciences had higher publication rates than their male counterparts, while male assistant professors had higher publication rates than female counterparts (see Tables 7–9 in Abramo et al., 2009). | ||
Larivière et al. (2013) do not compare the average publication rates of women and men. | ||
Way et al. (2016) study publication productivity in computer science from 1970 to 2010 and find that “Productivity scores do not differ between men and women. This is true even when we consider only men and women who moved up the ranks and, separately, men and women who moved down (p > 0.05, Mann–Whitney)” (see Table 2 in Way et al., 2016). However, they find that in the cohort hired after 2002 men have higher average publication rates than women. | ||
Holman and colleagues’ (2018) data set does not allow them to directly compare the publication rates of women and men. | ||
Aycock et al. (2019) | “Various studies focused on discrimination as a possible source of gender differences. Small samples of female physics students were interviewed by Barthelemy, McCormick, and Henderson (2016) and Aycock, Hazari et al. (2019).” (p. 225). | This is an example of biased reporting: Aycock et al. (2019) report results from a survey of 455 undergraduate women in physics. Seventy-five percent of these had experienced at least one type of sexual harassment in a context associated with physics. |
Thelwall, Bailey et al. (2018) | “Large gender differences along the people/things dimension are observed in occupational choices and in academic fields: Such differences are reproduced within sub-fields (Thelwall et al., 2018). In particular, female participation is lower in sub-fields closer to physics, even within fields with their own cultures, such as ‘physical and theoretical chemistry’ within chemistry (Thelwall et al., 2018). This suggests that the people/things dimension plays a more relevant role than the different cultures of different fields.” (p. 248). | The analysis by Thelwall and colleagues (2018) does not offer any substantial evidence that interest plays a greater role than culture. |
Gibney (2017), Guarino and Borden (2017) | “Furthermore, psychology finds that females value careers with positive societal benefits more than do males: (…). Indeed Gibney (2017) finds that women in UK academia report dedicating 10% less time than men to research and 4% more time to teaching and outreach, and Guarino and Borden (2017) finds that women in U.S. non-STEM fields do more academic service than men.” (p. 248). | Here, Strumia links women’s extra burdens with respect to teaching obligations and academic service to an argument about a female propensity to value careers with positive societal benefits. However, none of these factors are highlighted or examined as potential confounders in his own gender comparisons of publication and citation rates. |
Handley et al. (2015) | “Furthermore, fields that study bias might have their own biases: Stewart-Williams, Thomas et al. (2019) and Winegard, Clark et al. (2018) found that scientific results exhibiting male-favoring differences are perceived as less credible and more offensive. Handley, Brown et al. (2015) found that men (especially among STEM faculty) evaluate gender bias research less favorably than women.” (p. 247). | This is an example of biased reporting. Handley et al. (2015) also found that men evaluated an abstract showing gender bias in research evaluations less favorably than a moderated version of the same abstract indicating no gender bias. This latter result (left out of Strumia’s paper) counters his argument on this matter. |
Ceci et al. (2014), Su et al. (2009), Lippa (2010), Hyde (2014), Su et al. (2015), Thelwall (2018b), Stoet et al. (2018) | “An important clue is that a similar gender difference already appears in surveys of occupational plans and first choices of high-school students (Ceci, Ginther et al., 2014; Xie & Shauman, 2003). This is possibly mainly due to gender differences in interests (Ceci et al., 2014; Hyde, 2014; Lippa, 2010; Stoet & Geary, 2018; Su & Rounds, 2015; Su, Rounds, & Armstrong, 2009; Thelwall, Bailey et al., 2018).” (p. 226). | This is an example of selective citing. Here, Strumia leaves out a vast literature on how prevalent gendered assumptions at play in cultural socialization and upbringing operate to divert men towards and women away from STEM careers. See, for example, Zwick and Renn (2000), Eccles and Jacobs (1990), Jacobs and Eccles (1992), and Jones and Wheatley (1990). |
Su et al. (2009), Diekman et al. (2010), Lippa (2010), Su et al. (2015), Thelwall (2018) | “This suggests extending my considerations from possible sociological issues to possible biological issues. It is interesting to point out that the gender differences in representation and productivity observed in bibliometric data can be explained at face value (one does not need to assume that confounders make things different from what they seem), relying on the combination of two effects documented in the scientific literature: differences in interests (Diekman, Johnson, & Clark, 2010; Lippa, 2010; Su, Rounds, & Armstrong, 2009; Su & Rounds, 2015; Thelwall, Bailey et al., 2018)” … (p. 247–248). | This is an erroneous interpretation of the literature. With the exception of Lippa (2010), none of the studies listed here directly relate their findings to biological sex differences. Indeed, Su and Rounds (2015) argue that “while the literature has consistently shown the influence of social contexts (e.g., parents, schools) on students' interest development, particularly the development of differential interests for boys and girls (…), little is known about the link between biological factors (e.g., brain structure, hormones) and interest development.” |
Cited reference in question . | Strumia’s interpretations . | Problems with Strumia’s interpretations . |
---|---|---|
Caplar et al. (2017) | “For example, Caplar, Tacchella, and Birrer (2017) claim (consistent with my later findings) that papers in astronomy written by F authors are less cited than papers written by M authors, even after trying to correct for some social factors.” (p. 233). | This is an example of imprecise reporting: In five astronomy journals, papers first-authored by males, on average, were cited approximately 6% more than papers first-authored by women. |
Milkman et al. (2015) | “[L]ooking at gender in isolation (rather than at “women and minorities”), female students received slightly more responses from public schools (the majority of the sample) with respect to men in the same racial group.” (p. 226). | This is an example of selective reporting. Milkman et al. (2015) report that “faculty were significantly more responsive to White males than to all other categories of students, collectively, particularly in higher-paying disciplines and private institutions.” Private universities accounted for 37% percent of the sample. |
Witteman et al. (2019) | “found that female grant applications in Canada are less successful when evaluations involve career-level elements” (p. 226) | This is an example of selective reporting. Witteman and colleagues (2019) also found that the sex differences in success rates (in grant obtainment) were marginal when reviewers were asked to rate the proposals independent of track record. |
Xie and Shauman (1998), Levin and Stephan (1998), Abramo et al. (2009), Larivière et al. (2013), Way et al. (2016), Holman et al. (2018) | “Bibliometric attempts to recognize higher merit […] found that male faculty members write more papers.” (p. 226). | This is an example of imprecise reporting. Xie and Shauman (1998) observe a 20% gap in research productivity in the late 1980s and early 1990s. However, they also find that “most of the observed sex differences in research productivity can be attributed to sex differences in personal characteristics, structural positions, and marital status.” |
Levin and Stephan (1998) investigate gender differences in publication rates in four disciplines (Physics, Earth science, Biochemistry, and Physiology) and conclude that “in every instance‚ except the earth sciences‚ women published less than men‚ although the difference is statistically significant only for biochemists employed in academe and physiologists employed at medical schools” (p. 1056). The study did not adjust for scientific rank. | ||
In Abramo and colleagues’ (2009) study of Italian researchers, female professors and associate professors in the physical sciences had higher publication rates than their male counterparts, while male assistant professors had higher publication rates than female counterparts (see Tables 7–9 in Abramo et al., 2009). | ||
Larivière et al. (2013) do not compare the average publication rates of women and men. | ||
Way et al. (2016) study publication productivity in computer science from 1970 to 2010 and find that “Productivity scores do not differ between men and women. This is true even when we consider only men and women who moved up the ranks and, separately, men and women who moved down (p > 0.05, Mann–Whitney)” (see Table 2 in Way et al., 2016). However, they find that in the cohort hired after 2002 men have higher average publication rates than women. | ||
Holman and colleagues’ (2018) data set does not allow them to directly compare the publication rates of women and men. | ||
Aycock et al. (2019) | “Various studies focused on discrimination as a possible source of gender differences. Small samples of female physics students were interviewed by Barthelemy, McCormick, and Henderson (2016) and Aycock, Hazari et al. (2019).” (p. 225). | This is an example of biased reporting: Aycock et al. (2019) report results from a survey of 455 undergraduate women in physics. Seventy-five percent of these had experienced at least one type of sexual harassment in a context associated with physics. |
Thelwall, Bailey et al. (2018) | “Large gender differences along the people/things dimension are observed in occupational choices and in academic fields: Such differences are reproduced within sub-fields (Thelwall et al., 2018). In particular, female participation is lower in sub-fields closer to physics, even within fields with their own cultures, such as ‘physical and theoretical chemistry’ within chemistry (Thelwall et al., 2018). This suggests that the people/things dimension plays a more relevant role than the different cultures of different fields.” (p. 248). | The analysis by Thelwall and colleagues (2018) does not offer any substantial evidence that interest plays a greater role than culture. |
Gibney (2017), Guarino and Borden (2017) | “Furthermore, psychology finds that females value careers with positive societal benefits more than do males: (…). Indeed Gibney (2017) finds that women in UK academia report dedicating 10% less time than men to research and 4% more time to teaching and outreach, and Guarino and Borden (2017) finds that women in U.S. non-STEM fields do more academic service than men.” (p. 248). | Here, Strumia links women’s extra burdens with respect to teaching obligations and academic service to an argument about a female propensity to value careers with positive societal benefits. However, none of these factors are highlighted or examined as potential confounders in his own gender comparisons of publication and citation rates. |
Handley et al. (2015) | “Furthermore, fields that study bias might have their own biases: Stewart-Williams, Thomas et al. (2019) and Winegard, Clark et al. (2018) found that scientific results exhibiting male-favoring differences are perceived as less credible and more offensive. Handley, Brown et al. (2015) found that men (especially among STEM faculty) evaluate gender bias research less favorably than women.” (p. 247). | This is an example of biased reporting. Handley et al. (2015) also found that men evaluated an abstract showing gender bias in research evaluations less favorably than a moderated version of the same abstract indicating no gender bias. This latter result (left out of Strumia’s paper) counters his argument on this matter. |
Ceci et al. (2014), Su et al. (2009), Lippa (2010), Hyde (2014), Su et al. (2015), Thelwall (2018b), Stoet et al. (2018) | “An important clue is that a similar gender difference already appears in surveys of occupational plans and first choices of high-school students (Ceci, Ginther et al., 2014; Xie & Shauman, 2003). This is possibly mainly due to gender differences in interests (Ceci et al., 2014; Hyde, 2014; Lippa, 2010; Stoet & Geary, 2018; Su & Rounds, 2015; Su, Rounds, & Armstrong, 2009; Thelwall, Bailey et al., 2018).” (p. 226). | This is an example of selective citing. Here, Strumia leaves out a vast literature on how prevalent gendered assumptions at play in cultural socialization and upbringing operate to divert men towards and women away from STEM careers. See, for example, Zwick and Renn (2000), Eccles and Jacobs (1990), Jacobs and Eccles (1992), and Jones and Wheatley (1990). |
Su et al. (2009), Diekman et al. (2010), Lippa (2010), Su et al. (2015), Thelwall (2018) | “This suggests extending my considerations from possible sociological issues to possible biological issues. It is interesting to point out that the gender differences in representation and productivity observed in bibliometric data can be explained at face value (one does not need to assume that confounders make things different from what they seem), relying on the combination of two effects documented in the scientific literature: differences in interests (Diekman, Johnson, & Clark, 2010; Lippa, 2010; Su, Rounds, & Armstrong, 2009; Su & Rounds, 2015; Thelwall, Bailey et al., 2018)” … (p. 247–248). | This is an erroneous interpretation of the literature. With the exception of Lippa (2010), none of the studies listed here directly relate their findings to biological sex differences. Indeed, Su and Rounds (2015) argue that “while the literature has consistently shown the influence of social contexts (e.g., parents, schools) on students' interest development, particularly the development of differential interests for boys and girls (…), little is known about the link between biological factors (e.g., brain structure, hormones) and interest development.” |
2. MISGUIDED ASSUMPTIONS
Strumia’s questionable citing practices serve as an illustrative example of what sociologists and scientometricians refer to as “referencing as persuasion” (Gilbert, 1977; Latour, 1987)1. Paradoxically, Strumia’s own empirical analysis builds on a completely different, and more normative conception of what a citation is. In his paper, he claims that citation indicators represent a reliable proxy of scientific merit (i.e., “referencing as rewards”: Kaplan, 1965; Merton, 1968). By so doing, Strumia disregards the vast literature demonstrating the drawbacks of using citations as quality indicators (for a recent review, see Aksnes, Langfeldt, & Wouters, 2019). There are very good reasons why Martin and Irvine (1983) chose to equate citations with impact, not merit or quality. Citations are noisy, social measures and their distributions are skewed, not least due to cumulative effects (Merton, 1968). Many references are perfunctory (Moravcsik & Murugesan, 1975) and citing practices often have a social and persuasive function (as illustrated in Strumia’s own paper). They are interesting as indices of symbolic capital in the science system (Bourdieu, 1988). In a tautological sense, they may be indicative of “high performance” to some, and they are certainly (mis)used in evaluative contexts, but it is a major delusion to use citation indicators as a direct measure of merit. But Strumia’s use of citations is quite unusual as he makes an unsubstantiated chain of reasoning from citations to merit and from merit to nonsensical claims about biological sex differences in physicists’ cognitive capacities.
3. METHODOLOGICAL FLAWS
A foundation for Strumia’s analysis is his strong belief in the value of what he calls “large amounts of objective quantitative data about papers, authors, citations, and hires” (p. 225). We are not so impressed with the amounts of data or their quality, let alone what they may be a proxy for.
Bibliometric data are not objective per se, as Strumia implies. They are generally noisy (i.e., faulty, biased, and incomplete; Schneider, 2013). Noise is additive. Thus, citation linkages introduce errors, and so do author and gender disambiguation. There is no reason to assume that such errors are random. Noisy data are rife in the social sciences, especially in areas where data are “big” and processed algorithmically. We should always interpret results derived from such data with caution, especially when the observed differences are small. Large samples may give “precise” estimates, but precise estimates can be systematically biased when the analysis builds on noisy data. Having worked with author and gender disambiguation ourselves (Andersen, Schneider, et al., 2019; Nielsen, Andersen, et al., 2017), we would be more cautious than Strumia in declaring supremacy of data quantity over data quality.
Like many other bibliometric studies, Strumia’s analysis is data driven, and nowhere do we get the impression that a preanalysis plan has been specified or followed. A careful preanalysis plan will decrease “researcher degrees of freedom” (Simmons, Nelson, & Simonsohn, 2011) in planning, running, analyzing, and reporting opportunistically, and reassure that the findings are not just the outcome of extensive data mining. Ruling out data mining and data-dependent analysis is essential when studies pretend to be confirmatory with causal-like statements. Strumia’s study is exploratory and this has implications for what can be made of the results. The flexibility in sampling, processing, and analytical choices obviously implies that the results are conditional and that different choices could have produced different results. Without a preanalysis or a multiverse of potential alternative analyses (Steegen, Tuerlinckx, et al., 2016), selective reporting and confirmation bias seem likely. In such situations, statistical significance tests are uninterpretable (Gelman & Loken, 2013).
The hiring analysis presented in Strumia’s Section 3.2 is based on “big” longitudinal data of questionable quality. Strumia claims that the HepNames database used in this part of the analysis offers “precise career information” (p. 229). However, a quick-and-dirty lookup of five renowned Danish fundamental physicists returned no useful information about “first hirings” that could go into such an analysis2. Strumia seems aware that his data are flawed. He ends up with a sample of “about 10,000 first hires” and supplements these with a sample of “unbiased ‘pseudohires’” (p. 229). The first of these samples is clearly a convenience sample plagued by selection bias. The “pseudohires” are indeed pseudo; if not, then we would assume that Strumia would have used only the “pseudohires” as a proxy.
Strumia has granted us access to the raw data used in the hiring analysis. Our inspection of this data set reveals that a large share of the listed authors do not have any publications or citations prior to being hired (see Figure 1). We estimate that first-hires without any publications or citations account for up to 40% of the listed authors in the early period of the sample. In other words, the hiring analysis is based on questionable data. Strumia’s own hiring analysis also includes suspicious yearly fluctuations in the average number of fractionally counted papers and individual citations at the first hiring moment. Further, it does not provide any annual baseline of how many men and women were hired (see Figure 4 and Figures S2 and S3 in Strumia’s paper). The longitudinal perspective is also misguiding given that a large share of the authors hired in the early period have no registered publications or citations in the database. Given the volatility and skewed nature of the data, we find it peculiar that Strumia only reports mean scores in Figures 4 and 8. Median scores and variances would have underscored the fragility of the data.
Confounding is a major challenge in bibliometric research, and especially so in observational studies of hiring and selection. Strumia’s analysis is no exemption. The analytical approach is overly simplistic and atheoretical, and Strumia does not offer any convincing solutions for how to deal with the many potential confounders that plague the analysis. Indeed, inelegant attempts are made to rule out the influence of selected confounders (including institutional prestige, continent, and scientific age), but all of these confounders are examined in isolation (see Figures S2–S4 in Strumia’s paper). In a social science perspective, this makes the hiring analysis unavailing.
4. FAR-FETCHED CONCLUSIONS
We do not as such reject all of Strumia’s empirical findings. The slight gender variations observed in the citation and publication distributions are compatible with the results of other bibliometric gender comparisons3. Note here that in observational settings, such aggregate findings are extremely vulnerable to selection bias. What we do reject is Strumia’s far-fetched interpretations of these findings. Here, we present selected statements from the study’s conclusion and take issue with the most preposterous and nonsensical claims.
While many social phenomena could produce different averages, producing different variances would need something that specifically disadvantages research by top female authors. Just to take one example of a social nature, a gender gap in research productivity could arise if better female authors receive more honors and leadership positions that drive them away from research. However, data also show an excess of young authors among those who produced top-cited papers: The excess is observed among both M and F authors. This suggests extending my considerations from possible sociological issues to possible biological issues. [p. 247]
It is interesting to point out that the gender differences in representation and productivity observed in bibliometric data can be explained at face value (one does not need to assume that confounders make things different from what they seem), relying on the combination of two effects documented in the scientific literature: differences in interests and in variability.” [p. 248]
In summary, what Strumia’s gender analysis contributes is (a) a strongly biased representation of the existing literature, (b) a superficial, exploratory citation and publication analysis based on misguided assumptions, (c) an overly simplistic hiring analysis plagued by confounding and noisy data, and (d) concluded by highly speculative explanations based on twisted assumptions and with little or no empirical basis.
Notes
While Gilbert’s “persuasion” concerns the use of “acknowledged” references to boost one’s own work, Latour argues that citing authors often deliberately misrepresent and distort the works they allude to by twisting the meaning to suit their own ends. We believe that Strumia practices both forms of persuasion.
For example, Nils Overgaard Andersen, Jens Hjorth, Flemming Besenbacher, Lene Vestergaard Hau, and Sune Lehmann. We also looked up some where information was present, albeit not in standardized form (e.g., Benny Lautrup and Andrew Jackson).
Further, Sabine Hossenfelder and colleagues (2018) seem to corroborate Strumia’s aggregate findings in a comparison of Inspire and arXiv data albeit with smaller, average, gender differences and diverging results on the question of gender homophily in citing practices.