1. INTRODUCTION
Alessandro Strumia recently published a survey of gender differences in publications and citations in high-energy physics (HEP). In addition to providing full access to the data, code, and methodology, Strumia (2021) systematically describes and accounts for gender differences in HEP citation networks. His analysis points both to ongoing difficulties in attracting women to HEP and an encouraging—though slow—trend in improvement.
Unfortunately, however, the time and effort that Strumia (2021) devoted to collating and quantifying the data are not matched by a similar rigor in interpreting the results. To support his conclusions, he selectively cites available literature and fails to adequately adjust for a range of confounding factors. For example, his analyses do not consider how unobserved factors—for example, a tendency to overcite well-known authors—drive a wedge between quality and citations and correlate with author gender. He also fails to take into account many structural and nonstructural factors—including, but not limited to, direct discrimination and the expectations that women form (and actions they take) in response to it—that undoubtedly lead to gender differences in productivity.
We therefore believe that a number of Strumia’s conclusions are not supported by his analysis. Indeed, we reanalyze a subsample of solo-authored papers from his data, adjusting for year and journal of publication, authors’ research age and their lifetime “fame.” Our reanalysis suggests that female-authored papers are actually cited more than male-authored papers. This finding is inconsistent with the “greater male variability” hypothesis that Strumia (2021) proposes to explain many of his results.
In the conclusion to his paper, Strumia states that “… dealing with complex systems, any simple interpretation can easily be incomplete …”. We agree entirely. Strumia’s simple—and, more importantly, simplistic—analysis and interpretation are far from complete.
2. BIASED LITERATURE REVIEW
Strumia (2021) notes that there is a “vast literature” dealing with gender differences in STEM subjects. Scientific analyses of gender differences should represent this literature in an even-handed and unbiased manner; as Del Giudice, Puts et al. (2019) highlight, “An honest, sophisticated public debate on sex differences demands a broad perspective with an appreciation for nuance and full engagement with all sides of the question.” That appreciation for nuance and full engagement is not present in Strumia (2021). For example
In the introduction to his paper, Strumia asserts that “No significant biases have been found in examined real grant evaluations [Ceci et al., 2014; Ley & Hamilton, 2008; Marsh, Jayasinghe, & Bond, 2011; Mutz, Bornmann, & Daniel, 2012] and referee reports of journals [Borsuk, Aarssen et al., 2009; Ceci et al., 2014; Edwards, Schroeder, & Dugdale, 2018].” Yet a large body of literature—which he fails to cite—reaches the opposite conclusion. (See, for example, Burns, Straus et al., 2019; Card, DellaVigna et al., 2020; Dworkin, Linn et al., 2020; Fox & Paine, 2019; Hengel, 2017; Helmer, Schottdorf et al., 2017; Royal Society of Chemistry, 2019; Steinberg, Skae, & Sampson, 2018; Witteman, Hendricks et al., 2019 and references therein.) Strumia (2021)’s lack of balance in citing relevant work is misleading and arguably disingenuous. An objective analysis of gender differences should aim to be neither.
Strumia (2021) also notes that theoretical modeling of citations is “affected by questionable systematic issues.” We assume the use of “questionable” here is meant to capture relationships that are difficult to identify and quantify (e.g., due to a lack of suitable controls). We agree entirely that all efforts to study citations must be mindful of these limitations. Again, however, Strumia selectively cites just one paper that “[tries] to correct for some social factors,” namely Caplar, Tacchella, and Birrer (2017). Other studies analyzing more restricted samples have come to the opposite conclusion (e.g., Card et al., 2020; Hengel & Moon, 2020).
3. CONFOUNDERS AND STATISTICAL REANALYSIS
In addition to selectively citing the literature, Strumia (2021) fails to consider the potential impact of a broad range of confounding factors on gender differences in citations and the publication process. To help highlight the problems this introduces, one of the authors (EH) reanalyzed a subset of Strumia’s data.
The reanalyzed data contain 5,599 solo-authored articles—5,386 authored by a man and 213 authored by a woman—published in five high-profile physics journals from 2010 to 2016 (inclusive). The selection criteria were designed to address the influence of key confounders in Strumia’s data. First, we restricted the sample to solo-authored articles to account for the fact that male physicists are more likely to be senior authors on papers involving much larger research teams1. Second, restricting the data to articles published in a small set of well-known journals made it easier to confirm the quality of gender assignment in Strumia’s data2. By including journal-year fixed effects, we also better account for differing citation patterns between fields3. Third, younger articles have had less time to accrue citations and older articles are disproportionately by male authors. For that reason, we also restrict our analysis to newer articles (i.e., those published between 2010 and 2016) as well as controlling for journal-year fixed effects.
An additional difference between Strumia’s study and our own is that we analyze data at the article level instead of aggregating citations over authors. This allows us to better address the following issues discussed in Hengel and Moon (2020):
Male authors disproportionately cluster at the very top and very bottom of the citation distribution, but raw citation counts are truncated from below at zero and unbounded from above. This generates a nonlinear mapping from quality onto citations that depends on the former’s variance. When used as a proxy for quality, average citations for male-authored papers will, as a result, generally place too much weight on high-citation papers and not enough weight on low-citation papers compared to the average for female-authored papers.
To deal with this issue, we transform raw citation counts with the inverse hyperbolic sine function (asinh). We stress, however, that our results do not meaningfully change if we use raw citation counts as the dependent variable instead.
Unobserved (or uncontrolled for) confounders (e.g., winning a prestigious award) boost citations conditional on quality and disproportionately correlate with articles and authors located in the distribution’s right tail. A related concern is that the citations that a paper accumulates are not fixed in time. As a result, they could be influenced by the future success or failure of a paper’s authors (i.e., even among nonsuperstar physicists, a stronger publishing record later on probably drives citations to earlier work, all else being equal). For evidence see, for example, Bjarnason and Sigfusdottir (2002). Both factors potentially correlate with gender: For example, women produce fewer papers than men and are proportionately less likely to win the Nobel Prize.
To deal with these issues, we control for the year in which an author was first published and her total number of lifetime publications4. Our results should therefore be interpreted as gender differences in citations between authors who began their careers around the same time and had accumulated similar lifetime “fame” at the time citations were collected.
This is not what we observe. Our evidence suggests that female-authored papers receive about 12 log points more citations than male-authored papers, conditional on covariates. This figure is weakly statistically significant.
Figure 1 shows the distribution of citations among male vs. female-authored papers in this sample for both transformed and nontransformed citations. Note that the distribution of citations to male-authored papers closely overlaps with the distribution of citations to female-authored papers across the entire range of the distribution.
Estimated gender differences in citations at the mean for each of the five journals are shown in Figure 2. They consistently suggest either no statistically significant gender gap in citations or a citation gap that favors women.
We are aware that automatic and binary classification here can be problematic (Keyes, 2018), especially for nonbinary and transgender physicists. We have followed Strumia (2021)’s use of a binary classification scheme, as we wish to highlight issues with the original analysis and we do not have access to demographic information provided by the authors within the present work. But we recognize the shortcomings of a reductive analysis to an apparent gender binary in such bibliometric analyses and discussions (Strauss, Borges et al., 2020; Rasmussen, Maier et al., 2019) and urge journals to collect demographic data regarding authorship more routinely and to promote the use of self-identity.
4. THE HIGHER MALE VARIABILITY HYPOTHESIS
Despite the admission in Strumia (2021) that the simple interpretation laid out therein “can easily be incomplete,” the data are nonetheless explained in the context of the highly contentious higher male variability (HMV) hypothesis and a biological basis of difference, together with gendered differences in interests. Once again, there is a lack of appropriate representation and citation of the relevant extensive literature base.
The HMV hypothesis is widely contested and debated; both Gray, Lyth et al. (2019) and Stevens and Haidt (2017) provide systematic and even-handed discussions. Note, in particular, the geographical variation highlighted by Gray et al. (2019) in relation to the HMV hypothesis, which counters the claims that any observed gender differences are biological in origin:
… we find that there is significant heterogeneity between countries, and that much of this can be quantified using variables applicable across these assessments (such as test, year, male-female effect size, mean country size, and Global Gender Gap Indicators).
Strumia’s undue and unwarranted confidence in the HMV interpretation—given his admission that the data are influenced by “questionable systematic issues”—is such that the abstract closes with the claim that the “quantitative shape” of the data can be “fitted by higher male variability.” As highlighted by Hyde (2014),
… even if there is slightly greater male variability for some cognitive measures, this finding is simply a description of the phenomenon. It does not address the causes of greater male variability, which could be due to biological factors, sociocultural factors, or both.
Strumia (2021) also does not provide any direct evidence for the causal link he suggests between the HMV hypothesis, biological determinism, and citation rates. Instead, there is nothing more than an inference based on drawing a comparison with what are termed “psychometric observations.” This is an entirely unjustified conflation of correlation and causation and has no place in a rigorous interpretation of the data. Only properly controlled studies allow for a robust distinction between correlation and causation—a fundamental tenet of all statistical analysis. Strumia (2021) admits that those controlled studies are simply not possible.
Similarly, the homogeneity of the people/things dimension (Spelke, 2005; Thelwall, Bailey et al., 2019) is very much overstated in Strumia (2021) in support of the argument that a lack of interest underpins the level of participation by women in HEP. Thelwall et al. (2019), whose work is cited by Strumia in the context of the people/things metric, devote a great deal of time in the conclusion of their paper highlighting the deficiencies in their approach:
Thus, the people/things dimensions can only provide a partial explanation for gender differences in topic choices across the full spectrum of academia because there are many important exceptions … Given that the current research has not attempted to assess any cause and effect relationships, deviations from the people/thing dimensions could also be due to other factors within academia that deflect people from pursuing their interests, such as editorial, departmental or funding policies.
Another confounding factor is vocal criticism of women within academia by individuals such as Strumia, who may well be contributing to the problem of women feeling unwelcome in physics. As Halpern, Benbow et al. (2007), whose paper is cited in Strumia (2021), put it,
We conclude that early experience, biological factors, educational policy, and cultural context affect the number of women and men who pursue advanced study in science and math and that these effects add and interact in complex ways. There are no single or simple answers to the complex questions about sex differences in science and mathematics.
5. BIBLIOMETRICS AS A PROXY FOR SCIENTIFIC QUALITY
There is yet a broader issue surrounding Strumia’s methodology: The analysis implicitly assumes not only a direct link between citation rates (normalized as per Eq. 1 of Strumia [2021]) and scientific quality, but, as noted above, it infers—with no direct quantitative evidence—a further link to the HMV hypothesis. Moreover, it is assumed that all authors contribute equally to a paper. Strumia recognizes, to some extent, that this assumption of equal contributions is problematic—“… there is no warranty that each author contributed to each paper”—but his justification is very far from compelling:
Despite this, the data show that the total fractionally counted bibliometric output of collaborations scales, on average, as their number of authors [Rossi, Strumia, & Torre (2019)], suggesting that large collaborations form when scientifically needed and that gift authorship does not play a large role.
In dismissing the role of sociological factors, Strumia (2021) also subjectively (and rather inconsistently) appeals to the reader’s perception of the prestige of top authors: “A physicist might read their names and conclude that no sociological confounder can wash away most of them.” MacRoberts and MacRoberts (2010) describe the process of biased citing, which involves a number of considerations including the “halo effect” (exemplified by the preceding quote from Strumia [2021]), in-house citations, obliteration, and the Matthew effect. Each of these effects, and others (Cowley, 2015) can lead to disproportionate citations of the primary literature. There is also very good evidence that seminal papers can often be cited but not actually read (Ball, 2002; Simkin & Roychowdhury, 2003).
More broadly, the use of citations as a measure of scientific quality is highly questionable and has been the subject of significant debate for decades (see Leydesdorff, Bornmann et al. (2016) for a review). We note that Strumia highlights that “traditional metrics (such as citation counts, h-index, paper counts) now fail to provide reasonable proxies for scientific merit in fundamental physics.” His solution—the introduction of what he terms an “individual citation” metric, Eq. 1 of Strumia (2021)—is not a compelling strategy to isolate scientific quality from citation numbers. As Aksnes, Langfeldt, and Wouters (2019) highlight,
Research quality is a multidimensional concept, where plausibility/soundness, originality, scientific value, and societal value commonly are perceived as key characteristics … citations reflect aspects related to scientific impact and relevance, although with important limitations … there is no evidence that citations reflect other key dimensions of research quality
6. CONCLUSIONS
The analysis presented in Strumia (2021) suffers from a number of deficiencies that severely undermine the inferences drawn and the conclusions reached therein. In particular, the contributions of a wide variety of potential confounding factors are not only unaddressed but are dismissed with no justification. This unwarranted dismissal, coupled with a fundamental confusion of correlation and causation when drawing inferences, is especially problematic in a study that claims to provide an unbiased assessment of the role of gender in citation patterns. Instead, the analysis throughout is very far from neutral or disinterested; the data are interpreted within an ideologically motivated context—as is clear from both the paper itself and previous work by the same author, namely Strumia (2018) and Strumia (2019).
In that broader context, Strumia’s analysis is substantively problematic. The views he espouses have had an impact not only across the physics and STEM communities but have been widely reported across the international media (including Conradi, 2019; Giuffrida and Busby, 2018; Nicholson, 2018; Young, 2019). Physics as a discipline is broadly acknowledged to struggle with both the recruitment and retention of women (Eddy & Brownell, 2016; Kalender et al., 2019; Porter & Ivie, 2019). The widespread dissemination of derogatory and unsubstantiated views about women’s ability in this sphere is not going to overcome this problem. It will be off-putting to talented women who might be considering a career in the sciences and contributes to the hostile environment endured by many women (and other minorities) currently working in physics. There is an increasing body of research that suggests that an individual’s performance in an academic context can be harmed by an awareness that others’ perception of their work might be distorted by stereotypes (see, for example, Casad, Hale, & Wachs, 2017; Kalender et al., 2019; Shapiro & Williams, 2012). The associated cultural implications can result in minoritized individuals not contributing or disengaging from their academic communities.
Ultimately, the work presented in Strumia (2021) is not merely a flawed, biased, and ideologically motivated analysis. It is also likely to be actively harmful to the progress of women in physics, to the detriment not only of many individuals but of our entire community.
ACKNOWLEDGMENTS
We thank Beck Strauss for helpful discussions regarding issues associated with misclassification of gender.
FUNDING INFORMATION
TBB acknowledges funding of his Research Fellowship from the Royal Academy of Engineering. PM similarly acknowledges the Engineering and Physical Sciences Research Council (EPSRC) for the award of an established career fellowship (EP/T033568/1.).
DATA AVAILABILITY
Data and replication files for all analyses presented in this paper are available on GitHub (https://github.com/erinhengel/strumia-qss).
Notes
In contrast, Strumia (2021), Hengel (2017), and Hengel and Moon (2020) assume that each coauthor on a paper contributes equally to it. This relationship, however, is unlikely to be linear when there are a very large number of coauthors, as there often are in fields such as high-energy experimental physics.
To verify gender, we manually searched (via Google) for all the previously identified women and the subset of men with no more than one citation. Gender was identified based on pronouns or inferred from photos. Authors classified as female about whom we found no information—or the information we did find was ambiguous about gender—are omitted from our analysis (13 observations). As per the main text, we note that we would rarely identify individuals as nonbinary from this gender analysis. From this analysis, we found that 21–26% of people classified as women are more reasonably classified as men. Moreover, their solo-authored articles tended to receive a disproportionately low number of citations.
Strumia (2021), in contrast, adjusts for field by normalizing a paper’s citation count by the length of its own reference list, which roughly correlates with field (Strumia, 2021; Eq. 1). (We obtain similar results using his normalized indicator of citations.)
To account for age, Strumia (2021) weights each author by one-half times the inverse of the proportion of authors of the same gender who first published in the same year he or she did (Eq. 2). For example, if 300 authors—two of whom were female—first published in 1995, then each female author would be weighted by 75, whereas each male author would be weighted by roughly 0.5.
Effectively, the distribution of citations cannot collapse to a degenerate distribution after conditioning on these variables. As evidence against this possibility, it does not appear that citations are homogenous, conditional on, for example, journal.
Assuming male and female talent is normally distributed with identical means, then gender differences in variability are equivalent to gender differences in (conditional) averages. Presumably, all physicists are drawn from the top half of each distribution; thus, greater variability in men implies that average male talent is higher than average female talent, conditional on being a physicist.