Selective referencing and questionable evidence in Strumia ’ s paper on “ Gender issues in fundamental physics ”

We have accepted the Editor ’ s invitation to comment on Alessandro Strumia ’ s paper in the current issue of Quantitative Science Studies . Strumia is a controversial figure. His biologistic accounts of the persistent gender gap in science have been subject to heated debate — both in print and on social media. Some researchers argue that Strumia ’ s viewpoints should be ignored. We disagree. beliefs that ’ s gender is best explained by a natural of the best the research


SELECTIVE CITING AND REPORTING
Misrepresenting previous research by leaving out relevant evidence that contradicts one's personal views ("cherry picking") or highlighting only those results that fit into one's own argument is at best questionable research practice. In his paper, Strumia does both of those things. Table 1 lists examples of what we believe are cases of selective citing and biased reporting. The left column displays the references in question, the middle column summarizes Strumia's account of these references, and the right column specifies what we see as problematic about Strumia's representation of the literature. Obviously, we may interpret the studies in question somewhat differently from Strumia, but in this case, the account of the literature seems surprisingly skewed a n o p e n a c c e s s j o u r n a l Citation: Andersen, J. P., Nielsen "For example, Caplar, Tacchella, and Birrer (2017) claim (consistent with my later findings) that papers in astronomy written by F authors are less cited than papers written by M authors, even after trying to correct for some social factors." (p. 233).
This is an example of imprecise reporting: In five astronomy journals, papers first-authored by males, on average, were cited approximately 6% more than papers first-authored by women.
Milkman et al. (2015) "[L]ooking at gender in isolation (rather than at "women and minorities"), female students received slightly more responses from public schools (the majority of the sample) with respect to men in the same racial group." (p. 226).
This is an example of selective reporting. Milkman et al. (2015) report that "faculty were significantly more responsive to White males than to all other categories of students, collectively, particularly in higherpaying disciplines and private institutions." Private universities accounted for 37% percent of the sample.

Witteman et al. (2019)
"found that female grant applications in Canada are less successful when evaluations involve career-level elements" (p. 226) This is an example of selective reporting. Witteman and colleagues (2019) also found that the sex differences in success rates (in grant obtainment) were marginal when reviewers were asked to rate the proposals independent of track record.
This is an example of imprecise reporting. Xie and Shauman (1998) observe a 20% gap in research productivity in the late 1980s and early 1990s. However, they also find that "most of the observed sex differences in research productivity can be attributed to sex differences in personal characteristics, structural positions, and marital status." Levin and Stephan (1998) investigate gender differences in publication rates in four disciplines (Physics, Earth science, Biochemistry, and Physiology) and conclude that "in every instance‚ except the earth sciences‚ women published less than men‚ although the difference is statistically significant only for biochemists employed in academe and physiologists employed at medical schools" (p. 1056). The study did not adjust for scientific rank.
In Abramo and colleagues' (2009) study of Italian researchers, female professors and associate professors in the physical sciences had higher publication rates than their male counterparts, while male assistant professors had higher publication rates than female counterparts (see Tables 7-9 (2016) study publication productivity in computer science from 1970 to 2010 and find that "Productivity scores do not differ between men and women. This is true even when we consider only men and women who moved up the ranks and, separately, men and women who moved down ( p > 0.05, Mann-Whitney)" (see Table 2 in Way et al., 2016). However, they find that in the cohort hired after 2002 men have higher average publication rates than women.

Quantitative Science Studies
Holman and colleagues' (2018) data set does not allow them to directly compare the publication rates of women and men. This is an example of biased reporting: Aycock et al. (2019) report results from a survey of 455 undergraduate women in physics. Seventy-five percent of these had experienced at least one type of sexual harassment in a context associated with physics.
"Large gender differences along the people/things dimension are observed in occupational choices and in academic fields: Such differences are reproduced within sub-fields (Thelwall et al., 2018). In particular, female participation is lower in sub-fields closer to physics, even within fields with their own cultures, such as 'physical and theoretical chemistry' within chemistry (Thelwall et al., 2018). This suggests that the people/things dimension plays a more relevant role than the different cultures of different fields." (p. 248).
The analysis by Thelwall and colleagues (2018) does not offer any substantial evidence that interest plays a greater role than culture.
Gibney (2017), Guarino and Borden (2017) "Furthermore, psychology finds that females value careers with positive societal benefits more than do males: (…). Indeed Gibney (2017) finds that women in UK academia report dedicating 10% less time than men to research and 4% more time to teaching and outreach, and Guarino and Borden (2017) finds that women in U.S. non-STEM fields do more academic service than men." (p. 248).
Here, Strumia links women's extra burdens with respect to teaching obligations and academic service to an argument about a female propensity to value careers with positive societal benefits. However, none of these factors are highlighted or examined as potential confounders in his own gender comparisons of publication and citation rates. This is an erroneous interpretation of the literature. With the exception of Lippa (2010), none of the studies listed here directly relate their findings to biological sex differences. Indeed, Su and Rounds (2015) argue that "while the literature has consistently shown the influence of social contexts (e.g., parents, schools) on students' interest development, particularly the development of differential interests for boys and girls (…), little is known about the link between biological factors (e.g., brain structure, hormones) and interest development."

257
Selective referencing and questionable evidence in the direction of Strumia's underlying agenda. The list of omitted references that could have added nuance to Strumia's review is too comprehensive to be covered in this comment.

MISGUIDED ASSUMPTIONS
Strumia's questionable citing practices serve as an illustrative example of what sociologists and scientometricians refer to as "referencing as persuasion" (Gilbert, 1977;Latour, 1987) 1 . Paradoxically, Strumia's own empirical analysis builds on a completely different, and more normative conception of what a citation is. In his paper, he claims that citation indicators represent a reliable proxy of scientific merit (i.e., "referencing as rewards": Kaplan, 1965;Merton, 1968). By so doing, Strumia disregards the vast literature demonstrating the drawbacks of using citations as quality indicators (for a recent review, see Aksnes, Langfeldt, & Wouters, 2019). There are very good reasons why Martin and Irvine (1983) chose to equate citations with impact, not merit or quality. Citations are noisy, social measures and their distributions are skewed, not least due to cumulative effects (Merton, 1968). Many references are perfunctory (Moravcsik & Murugesan, 1975) and citing practices often have a social and persuasive function (as illustrated in Strumia's own paper). They are interesting as indices of symbolic capital in the science system (Bourdieu, 1988). In a tautological sense, they may be indicative of "high performance" to some, and they are certainly (mis)used in evaluative contexts, but it is a major delusion to use citation indicators as a direct measure of merit. But Strumia's use of citations is quite unusual as he makes an unsubstantiated chain of reasoning from citations to merit and from merit to nonsensical claims about biological sex differences in physicists' cognitive capacities.

METHODOLOGICAL FLAWS
A foundation for Strumia's analysis is his strong belief in the value of what he calls "large amounts of objective quantitative data about papers, authors, citations, and hires" (p. 225). We are not so impressed with the amounts of data or their quality, let alone what they may be a proxy for.
Bibliometric data are not objective per se, as Strumia implies. They are generally noisy (i.e., faulty, biased, and incomplete; Schneider, 2013). Noise is additive. Thus, citation linkages introduce errors, and so do author and gender disambiguation. There is no reason to assume that such errors are random. Noisy data are rife in the social sciences, especially in areas where data are "big" and processed algorithmically. We should always interpret results derived from such data with caution, especially when the observed differences are small. Large samples may give "precise" estimates, but precise estimates can be systematically biased when the analysis builds on noisy data. Having worked with author and gender disambiguation ourselves (Andersen, Schneider, et al., 2019;Nielsen, Andersen, et al., 2017), we would be more cautious than Strumia in declaring supremacy of data quantity over data quality.
Like many other bibliometric studies, Strumia's analysis is data driven, and nowhere do we get the impression that a preanalysis plan has been specified or followed. A careful preanalysis plan will decrease "researcher degrees of freedom" (Simmons, Nelson, & Simonsohn, 2011) in planning, running, analyzing, and reporting opportunistically, and reassure that the findings are not just the outcome of extensive data mining. Ruling out data mining and data-dependent analysis is essential when studies pretend to be confirmatory with causal-like statements. Strumia's study is exploratory and this has implications for what can be made of the results. The flexibility 1 While Gilbert's "persuasion" concerns the use of "acknowledged" references to boost one's own work, Latour argues that citing authors often deliberately misrepresent and distort the works they allude to by twisting the meaning to suit their own ends. We believe that Strumia practices both forms of persuasion. in sampling, processing, and analytical choices obviously implies that the results are conditional and that different choices could have produced different results. Without a preanalysis or a multiverse of potential alternative analyses (Steegen, Tuerlinckx, et al., 2016), selective reporting and confirmation bias seem likely. In such situations, statistical significance tests are uninterpretable (Gelman & Loken, 2013).
The hiring analysis presented in Strumia's Section 3.2 is based on "big" longitudinal data of questionable quality. Strumia claims that the HepNames database used in this part of the analysis offers "precise career information" (p. 229). However, a quick-and-dirty lookup of five renowned Danish fundamental physicists returned no useful information about "first hirings" that could go into such an analysis 2 . Strumia seems aware that his data are flawed. He ends up with a sample of "about 10,000 first hires" and supplements these with a sample of "unbiased 'pseudohires'" (p. 229). The first of these samples is clearly a convenience sample plagued by selection bias. The "pseudohires" are indeed pseudo; if not, then we would assume that Strumia would have used only the "pseudohires" as a proxy.
Strumia has granted us access to the raw data used in the hiring analysis. Our inspection of this data set reveals that a large share of the listed authors do not have any publications or citations prior to being hired (see Figure 1). We estimate that first-hires without any publications or citations account for up to 40% of the listed authors in the early period of the sample. In other words, the hiring analysis is based on questionable data. Strumia's own hiring analysis also includes suspicious yearly fluctuations in the average number of fractionally counted papers and individual citations at the first hiring moment. Further, it does not provide any annual baseline of how many men and women were hired (see Figure 4 and Figures S2 and S3 in Strumia's paper). The longitudinal perspective is also misguiding given that a large share of the authors hired in the 2 For example, Nils Overgaard Andersen, Jens Hjorth, Flemming Besenbacher, Lene Vestergaard Hau, and Sune Lehmann. We also looked up some where information was present, albeit not in standardized form (e.g., Benny Lautrup and Andrew Jackson). Confounding is a major challenge in bibliometric research, and especially so in observational studies of hiring and selection. Strumia's analysis is no exemption. The analytical approach is overly simplistic and atheoretical, and Strumia does not offer any convincing solutions for how to deal with the many potential confounders that plague the analysis. Indeed, inelegant attempts are made to rule out the influence of selected confounders (including institutional prestige, continent, and scientific age), but all of these confounders are examined in isolation (see Figures S2-S4 in Strumia's paper). In a social science perspective, this makes the hiring analysis unavailing.

FAR-FETCHED CONCLUSIONS
We do not as such reject all of Strumia's empirical findings. The slight gender variations observed in the citation and publication distributions are compatible with the results of other bibliometric gender comparisons 3 . Note here that in observational settings, such aggregate findings are extremely vulnerable to selection bias. What we do reject is Strumia's far-fetched interpretations of these findings. Here, we present selected statements from the study's conclusion and take issue with the most preposterous and nonsensical claims.
While many social phenomena could produce different averages, producing different variances would need something that specifically disadvantages research by top female authors. Just to take one example of a social nature, a gender gap in research productivity could arise if better female authors receive more honors and leadership positions that drive them away from research. However, data also show an excess of young authors among those who produced top-cited papers: The excess is observed among both M and F authors. This suggests extending my considerations from possible sociological issues to possible biological issues. [p. 247] It is interesting to point out that the gender differences in representation and productivity observed in bibliometric data can be explained at face value (one does not need to assume that confounders make things different from what they seem), relying on the combination of two effects documented in the scientific literature: differences in interests and in variability." [p. 248] The claims made here are speculative, empirically unsubstantiated, and founded on twisted assumptions. First, there is no reason to believe that differences in averages are more likely to stem from social factors than differences in variability, at least not when it comes to scientific performance. The argument presented here does not follow logically from the results. Second, Strumia's biologistic reading of the literature on gender differences in interests is misguided (see Table 1). Third, Strumia does not measure intelligence in his analysis. Thus, his assertion that sex differences in variability "explain" gender differences in productivity is both unreasonable and unwarranted. Fourth, extant research on intelligence and scientific productivity is scarce, and does not suggest any direct relationship between the two (Bayer & Folger, 1966;Cole & Cole, 1974). Fifth, Strumia's speculations of a higher male variability in fundamental physics have no empirical basis in the peer-reviewed scientific literature.