Gender bias in funding evaluation: A randomized experiment

Abstract Gender differences in research funding exist, but bias evidence is elusive and findings are contradictory. Bias has multiple dimensions, but in evaluation processes, bias would be the outcome of the reviewers’ assessment. Evidence in observational approaches is often based either on outcome distributions or on modeling bias as the residual. Causal claims are usually mixed with simple statistical associations. In this paper we use an experimental design to measure the effects of a cause: the effect of the gender of the principal investigator (PI) on the score of a research funding application (treatment). We embedded a hypothetical research application description in a field experiment. The subjects were the reviewers selected by a funding agency, and the experiment was implemented simultaneously with the funding call’s peer review assessment. We manipulated the application item that described the gender of the PI, with two designations: female PI and male PI. Treatment was randomly allocated with block assignment, and the response rate was 100% of the population, avoiding problems of biased estimates in pooled data. Contrary to some research, we find no evidence that male or female PIs received significantly different scores, nor any evidence of same-gender preferences of reviewers regarding the applicants’ gender.


INTRODUCTION: GENDER DIFFERENCES AND GENDER BIAS IN RESEARCH FUNDING
There is evidence that gender differences continue to exist in many dimensions of science activities and research outcomes in all countries (Ceci, Ginther et al., 2014).Women show lower application levels, success rates, and levels of research funding (Pohlhaus, Jiang et al., 2011;Suarez, Fiorentin, & Pereira, 2023); differences across genders exist in the proportion of female researchers in many STEM careers (Bello & Galindo-Rueda, 2020); the percentage of women at the top of careers as full professors or in highly prestigious universities is lower (Directorate-General for Research and Innovation (European Commission), 2021); and women have slower career advancement (Wang & Degol, 2017), get less prestigious special chairs (Treviño, Gomez-Mejia et al., 2018), show weaker publication and citation patterns (Mayer & Rathmann, 2018), tend to collaborate less (Aksnes, Piro, & Rørstad, 2019;Gaughan & Bozeman, 2016;Kwiek & Roszka, 2021), and get higher rejection rates (Fox & Paine, 2019).In some of these domains, though, there seems to be a trend of diminishing gaps (Cruz-Castro, Ginther, & Sanz-Menéndez, 2023).Additionally, every single factor interacts with the others; for example, publications could account for differences in funding (Ginther, Basner et al., 2018) or differences in funding could predict career advancement patterns (Bloch, Graversen, & Pedersen, 2014).Moreover, a lack of clear concepts, theories, and causal models are among the factors accounting for the contradictory findings.and success rates misses the point of the reviewers as actors, and ignores the fact that bias is the product of human action.
Including the evaluators as an essential part in the analysis of bias is not only relevant for a better understanding of the processes causing gender bias; in fact, it has become an important element of the policy discourse and actions regarding gender balance among evaluators and within panels.This has contributed to a shift in analytical focus to include reviewers as units of analysis as well.In this regard, establishing causality is especially relevant for the design of sound policy interventions.
We acknowledge that some research has also focused on the activities of reviewers or panels and their features (gender composition, etc.), in line with the idea of identifying the causal mechanisms or at least the influential factors (Marsh, Jayasinghe, & Bond, 2011).
Underlying theories of the behavior of reviewers highlight different aspects; for example, some psychologists and sociologists have embraced gender role congruity theory (Eagly & Karau, 2002), homophily behavior (Lawrence & Shah, 2020;Murray, Siler et al., 2019), the matching hypothesis (J.R. Cole, 1979), gender similarity (Hyde, 2005), or the existence of gender stereotypes against women (Ellemers, 2018).These works are relevant to the present study because they highlight possible mechanisms.
Economists have followed different lines and assumptions, and have linked the differences in the evaluation behavior of women and men to other theories, for example related to gender differences regarding preferences (Croson & Gneezy, 2009), information availability (Tversky & Kahneman, 1974), or behavioral attitudes towards risk or competition (Niederle & Vesterlund, 2011).
Establishing the causal models and identifying the processes and mechanisms is an essential part of the explanation that is quite often missed (Cruz-Castro & Sanz-Menéndez, 2020;Traag & Waltman, 2022).
As the reviewers and panelists are actors interacting with objects (applications, etc.), in the process, if bias exists it is relevant to analyze the process of evaluation directly.In this regard, experimental approaches are well suited for measuring the effects of the causes by determining whether female and male PI-led funding applications are assessed differently.In principle, all other factors being identical, if gender differences in outcomes are found, it means that the "cause" is the gender of the applicant.
As our focus is on the use of peer review in the allocation of competitive funding and the effect of the applicant's gender, our research questions are 1.Are female and male PIs assessed differently?2. Do female and male evaluators differ in their assessment of male and female PI applicants?
In other words, can we identify differences in the rating of competitive funding applications that could be associated exclusively with the gender of the PI or the gender of the reviewer?The identification strategy in the research design should clearly allow us to check whether a particular gender of the PI is the cause of the differences in evaluation outcomes.Experimentation that involves control by the researchers, and includes manipulation and randomization, is a valuable source of evidence, but its use for the analysis of evaluation for funding has been scarce.This paper aims to contribute to this debate on gender bias in research funding evaluation by broadening its empirical foundations.

Gender bias in funding evaluation
The remainder of the paper is organized as follows: First, we review what some previous research, based on different methodological approaches (observation and experimentation), has produced in terms of evidence regarding gender bias in the context of the use of peer review for research evaluation; second, we justify the experiment and its contribution; third, we explain the methodology and research design; fourth, we present the results; and finally, we discuss the findings and present some conclusions.

PREVIOUS RESEARCH AND EVIDENCE
Experimental studies of gender differences in research funding are limited, probably due to the difficulties of access to the actors in the evaluation context; however, there are other topics and methodological approaches related to peer review, such as reviewing papers for journal publication or examining CVs for hiring candidates, which have been more extensively addressed with experimental approaches.The literature, however, is rather fragmented.
In this section, we examine previous research and evidence on gender bias emerging from peer review evaluation that can help to contextualize our study, contribute to identifying some analytical perspectives, and suggest some causal links.The previous research is presented according to the type of research design, highlighting its methodological approach vis-à-vis the findings.

Quantitative Observational Evidence
A few decades ago, research highlighted that the gender of researchers could be a factor influencing assessments of quality.Over the years, the evidence and conclusions about gender differences in funding allocation emerging from observational studies of research funding has been contradictory; at the same time, gender differences have reduced (Cruz-Castro et al., 2023).
The claim that female PIs have their grants funded less often than male PIs, and when they do have them funded the amounts are smaller for women than for men when compared with the available evidence, shows a more complex story, as there is also evidence of no systematic bias against women in peer review for funding allocation (Ceci & Williams, 2011;Kahn & Ginther, 2018) or that when experienced professional evaluators, with information about the applicant's competence, were involved, no bias was found (Ceci, 2018).
Since the classic study at the NSF by the Coles (Cole & Cole, 1979, 1981;Cole, Cole, & Simon, 1981;Cole, Rubin, & Cole, 1977, 1978), which found small differences in funding by gender but more related to rank or past performance, the controversy has continued.Evaluations of grant applications and success by gender in other countries have also yielded heterogeneous results.Apparent lower success rates of women in funding applications have motivated studies in different countries.The highly cited work by Wennerås and Wold (1997) at the Swedish Medical Research Council (SMRC) examined the scores in different evaluation dimensions and correlated them with some indicators of "merit" (mainly bibliometric) or "social connections"; they found a discrepancy that women needed a higher performance to get funding and inferred a strong bias against them1 .However, their data were not available, the analysis not replicated, and, therefore, the findings not confirmed (Levy & Kimura, 2009, p. 260).In fact, Sandström and Hällsten (2008), with a similar design and also data from the SMRC, showed that women actually fared better than men.Jayasinghe, Marsh, and Bond (2003), with data from the Australian Research Council (ARC), found that gender differences in success were small and nonsignificant.Ley and Hamilton (2008) also found near-equal U.S. NIH funding success for men and women at all stages of their careers.
Comparative research is scarce, but in a classic metareview of research in several disciplines and countries, Bornmann, Mutz, and Daniel (2007) reported gender bias in grant funding.However, the finding was later reversed by Marsh, Bornmann et al. (2009), who showed a lack of effect of gender generalized across disciplines, countries, and funding agencies.
Van der Lee and Ellemers (2015a), focusing on reviewers' scores and outcomes in a Dutch NWO funding program, suggested gender influence in the research funding and deduced the existence of gender bias; however, their findings have been methodologically questioned (Albers, 2015;van der Lee & Ellemers, 2015b, 2015c;Volker & Steenbeek, 2015).
More recently, Severin, Martins et al. (2020) examined whether the gender of applicants and peer reviewers and other factors influenced peer review of grant applications submitted to the Swiss National Science Foundation (SNSF).Male applicants received more favorable evaluation scores than female applicants, and male reviewers awarded higher scores than female reviewers, but in multivariable analysis, differences between male and female applicants were attenuated.
Recent contributions in the field of gender disparities in research funding have advanced in the introduction of more complex analyses (looking not only at the gender of the applicant but also at the gender composition of teams; or differentiating between scoring and approval), the use of mixed methods (combining regression models with linguistic analysis of review reports), and the consideration of intersections between gender and research content.
For instance, Bianchini, Llerena et al. (2022), examining the reviews of a pan-European funding scheme (EUROCORES) over more than a decade, linked the outcome of the grant proposal peer review with the gender representation in research consortia; they found a gender effect in the evaluation outcomes of both panel members and reviewers, as applications from consortia with a higher share of female scientists were less successful in panel selection and received lower scores from external reviewers.Interestingly, they also analyzed the evaluative language of written review reports and although apparently reviewers did not perceive female scientists as being less competent, this was not reflected in the scoring, which was lower for consortia with higher female rates.
Relatedly, some studies with Dutch data have pointed out that lower scores do not automatically mean lower success rates.This is the finding of Mom and van den Besselaar (2022) and van den Besselaar and Mom (2022) with data from the European Research Council (ERC) starting grants and NWO, who report that women get systematically lower scores, but that this does not lead to overall bias in the outcomes (success rates); these findings were in line with Bol, de Vaan, and van de Rijt (2022).
In an interesting mixed-method approach, Larregue and Nielsen (2023) analyzed funded and unfunded social science applications submitted to a research council in Western Europe, exploring how applicants' disciplinary, thematic, and methodological orientations intersect with gender to shape funding opportunities.Their descriptive analysis shows that women's proposals were underfunded, with a relative gender difference of around 20%.Then they

Quantitative Science Studies 598
Gender bias in funding evaluation use computational text and mediation analysis, and find that around one-third of this disparity may be attributed to gender differences in disciplinary focus, thematic specializations, and methodologies, as it appears that there is a devaluation of qualitative methods and, more broadly, interpretive, descriptive, and exploratory approaches in proposal assessments, areas in which women appear more specialized; this is in line with previous research about specialization (Leahey, 2006(Leahey, , 2007)).
As a summary, some of the differences in the reported results relate to the use of different concepts of gender bias and various operationalization methods, in addition to contextual (country, funding agency, type of program) or sampling effects (Cruz-Castro & Sanz-Menéndez, 2020).Typically, analyses were carried out by using single-level models and analytic techniques such as correlation, analysis of variance, tests for proportions, or multiple regression; additionally, observational research is always subject to the standard problems of unobserved heterogeneity or endogeneity and it does not always include clear causal models.Our aim, however, is not to dismiss the contribution of these approaches, which have been predominant, but rather to advocate for more pluralistic methodological perspectives by showing the value of the experimental method for broadening the empirical bases of the study of evaluation bias.

Natural Experiments
There is a class of observational research that is inappropriately called "natural experiments" (Titiunik, 2021).Natural experiments, sometimes labeled as "quasi experimental designs" (Shadish, Cook, & Campbell, 2001), claim to take advantage of embedding or contextualizing the analysis in real-life situations.The idea of experiments is associated with control of two dimensions, namely randomization and manipulation (Barrera, Gerxhäni et al., 2023); natural experiments correspond to a type of observational research with no control by the researcher, or at best only include some form of randomization (Deaton & Cartwright, 2018).
Most of this research refers to gender bias in hiring and has focused on a relevant policy topic: the effects of the (gender) composition of the evaluation panels (controlling for quality and proximity), and their changes.The classic "matching hypothesis" regarding the applicant's and reviewer's same department (Cole, 1979) provided the basis for the homophily or same-gender preference claims (Murray et al., 2019); expectations were that increasing the number of women in evaluation panels would favor female scores or success rates.However, empirical results are far from conclusive.
Whether the gender composition of recruiting committees for university professorships matters is a question that has been addressed mostly in the context of country case studies.Taking advantage of the opportunity provided by "natural experiments" where the allocation of reviewers to specific evaluating committees was random (a lottery), Bagues and colleagues, first in Spain (Zinovyeva & Bagues, 2015), and later in Italy (Bagues, Sylos-Labini, & Zinovyeva, 2017) analyzed how a larger presence of female evaluators affected committee decision-making.Their results revealed that having a larger number of women on evaluation committees did not increase either the quantity or the quality of the female candidates who qualified.This work is relevant to the present study by showing that information from individual evaluation reports revealed that female evaluators were not significantly more favorable toward female candidates.Witteman, Hendricks et al. (2019) used a Canada Institutes of Health funding program, which was divided into two new grant programs, to "differentiate" the effect of the intrinsic quality of the proposal from the merits of the candidate on evaluation outcomes.They argue that gender differences in evaluation (in highly competitive funding programs) were less relevant when the proposal, and not the "caliber" or the CV of the applicant, was the focus of the assessment.Albeit interesting, their analysis did not control for the quality of the PI (e.g., with a measure of past performance); missing such a variable precludes ruling out competing explanations.
In regard to mechanisms, psychology research has pointed to the activation of stereotypes (Fiske, Cuddy et al., 2002).For instance, gender-role congruity theory (Eagly & Karau, 2002) proposes that perceived incongruity between the female gender role and leadership roles leads to two forms of prejudice: perceiving women less favorably than men as potential occupants of leadership roles and evaluating behavior that meets the prescriptions of a leader's role less favorably when it is shown by a woman.
Stereotype-based expectations may also be related to the prior segregation of the area subjected to evaluation (occupations, jobs, or fields), whereby, in male-dominated or femaledominated areas, reinforcement is expected, whereas in neutral areas, similar evaluations across genders will emerge.In this domain, Koch, D'Mello, and Sackett (2015) did a metaanalysis of research findings about workplace decisions according to the gender distribution of jobs.They found that men were preferred for male-dominated jobs (i.e., gender-role congruity bias), whereas no strong preference for either gender was found for female-dominated or integrated jobs; additionally, male evaluators exhibited greater gender-role congruity bias than did female evaluators for male-dominated jobs.
These theoretical perspectives suggest that status and statistical discrimination may operate in evaluation processes and often assume that male professors are more biased by gendertypical stereotypes than their female counterparts (Solga, Rusconi, & Netz, 2023).
The findings related to gender-role congruity and gender-homophily (McPherson, Smith-Lovin, & Cook, 2001;Murray et al., 2019) and some empirical anomalies (van den Besselaar & Mom, 2022) have prompted thinking on other potential explanations related to the impact of "social desirability" behavior (Krumpal, 2013), whereby when there are fewer women in a field, they would be favored in evaluations, and where there are more women in the area, it would be men who are favored.Probably we could also expect some effects of the gender equality policies in place in particular contexts (Stewart & Valian, 2018).

Laboratory and Survey Experiments
It is relevant to distinguish among different qualities of experimental approaches (Deaton & Cartwright, 2018).Potentially, laboratory and survey experiments make use of both randomization and manipulation.As mentioned, there is not much literature trying to measure and explain the possible existence of gender bias in the context of research funding evaluation, but there is some relevant research addressing the role of gender in the outcomes of evaluation for hiring decisions and acceptance of journal papers.
Of particular relevance for the present study is the work of Moss-Racusin, Dovidio et al. ( 2012) who conducted a randomized study (n = 127) to investigate experimentally whether science faculty exhibited a bias against female students in a hiring process.Science faculty from research-intensive universities in the United States rated the application materials of a student-who was randomly assigned either a male or female name-for a laboratory manager position.Faculty evaluators rated the male applicant as significantly more competent and hirable than the (identical) female applicant.The gender of the faculty participants did not affect responses, so female and male faculty were equally likely to exhibit bias against the female student.

Gender bias in funding evaluation
With results to the contrary, Williams and Ceci (Ceci & Williams, 2015;Williams & Ceci, 2015) developed a series of randomized experiments on 873 tenure-track faculty (439 men, 434 women) from biology, engineering, economics, and psychology at 371 US universities/colleges, evaluating applications for tenure track assistant professorships.Applicants' profiles were systematically varied to disguise them for identically qualified women versus men.Results revealed a 2:1 preference for women by faculty of both genders across both math-intensive and nonmath-intensive fields, with the single exception being male economists, who showed no preference for one gender or another.
Experiments have expanded to other countries, and more recently Carlsson, Finseraas et al. (2021) examined the role of bias in academic recruitment by conducting a large-scale survey experiment among faculty in various disciplines from universities in Iceland, Norway, and Sweden.The faculty respondents rated CVs of hypothetical candidates-who were randomly assigned either a male or a female name-for a permanent position as an associate professor in their discipline.Their results also contradicted some previous findings (Moss-Racusin et al., 2012), because, despite the underrepresentation of women in all fields, the female candidates were viewed as both more competent and more hirable compared to their male counterparts.They concluded that biased evaluations of equally qualified candidates do not seem to be the key explanation of the persistent gender gap in academia in the Nordic region.However, the participants were the faculty in general, not those deciding or really involved in the hiring decision-making process.
Also relevant for our work, a recently published paper (Solga et al., 2023) uses a large factorial survey experiment with German university professors, and studies whether male and female committee members evaluate female and male applicants for professorships differently.They found neither differences between male and female professors nor the presence of a Matilda effect2 , but some advantage for female applicants in the invitation phase.They also considered that findings are probably related to the gender equality policy of having a substantial female quota in selection committees.The overall methodological approach is in line with ours, but their focus is on hiring not on funding.
These previous works have in common the construction of two groups of evaluators who received the same materials, and the randomization of the assignment of the gender of applicants to those evaluators, therefore building a group receiving female applications and a group receiving male applications.In this paper, we take a similar approach.The approach is different to studying gender blinding, where what is tested is the effect of concealing information about the gender of applicants versus otherwise3 .
Papers evaluated for journals or conferences have also long been the subject of experimental manipulation, but again, the results are partially contradictory, mainly as a consequence of the research designs (Lloyd, 1990;Paludi & Bauer, 1983).
For example, classic studies reported gender effects against female authors, but others show small or no gender bias; Borsuk, Aarssen et al. (2009) manipulated a published article to reflect different author designations.The article was then reviewed by referees of both genders at various stages of scientific training and experience.Name changing did not influence acceptance rates or quality ratings.However, female postdoctoral researchers were the most critical referees, regardless of the author name provided.Additionally, there was no evidence of same-gender preferences.This study strongly suggests that more experienced women may apply different expectations to peer review, as others found in observational studies (Cruz-Castro & Sanz-Menéndez, 2021).
As we have mentioned, there is limited experimental evidence emerging from the gender effects in peer review for research funding, but the NIH in the United States has attracted some attention.Looking at the interaction between gender and race, Forscher, Cox et al. (2019) used 48 NIH R01 grant proposals and modified the PI names to create separate versions of each proposal (White male, White female, Black male, and Black female).They found little to no race or gender bias in initial R01 evaluations.Focusing on race, Nakamura, Mann et al. (2021) tested the specific effects of anonymization (concealing the race and identity) on the scores of Black and White applicants to the US NIH, with the aim of investigating bias against the former.Designed as a test of whether blinded review reduces racial disparities in peer review, they found, interestingly, that it changed the scores of White PIs' applications for the worse, but did not, on average, impact the scores of Black PI applications.Although statistically small, differences remained in favor of White PIs' scores, but anonymization reduced that difference.
As becomes evident from the data presented, careful examination of the experimental findings is also needed, not just because the findings usually relate to small n and have limited external validity, but mainly because most of the differences in findings come from the specifications of the research design and the inductive causal analysis.
One of the standard criticisms of laboratory or survey experiments is related to the nonrealistic nature of participants, a common feature in the majority of laboratory experiments; in field experiments or factorial surveys, what is often criticized is the absence of a real context related to the task.We aim to contribute to the literature by addressing some of these shortcomings.

JUSTIFICATION OF THE FIELD EXPERIMENT AND ITS CONTRIBUTION
The standard problems of confounding, lurking, or unobserved heterogeneity have not always been properly addressed in previous observational approaches on research funding and peer review.Research addressing the links between research funding and gender bias has often faced problems of validity, both external validity (as it usually referred to a single country, funding agency, or funding instrument) and internal validity, more related to the research design and the causality approach.
Acknowledging that experiments (control via manipulation of treatments and randomization) will not solve all problems of causality either (Deaton & Cartwright, 2018;Knight & Winship, 2013), we believe that, if properly designed, they can contribute to testing the existence of some regularities when assessing the bias of reviewers in favor of or against a particular PI gender, and to understanding the causal processes (Bendiscioli, Firpo et al., 2022).
At the very least, the influence of the gender of the PI within the peer review practice should be tested to ensure that the general assumptions about peer review objectivity are well founded.With an "experimental" approach, we can focus mainly on the measurement issue and be less dependent on theories and assumptions.
A need for better conceptualization has recently been highlighted in relation to peer review (Derrick, 2019;Hug, 2022), but behind the problems regarding the evidence we have identified, there is also a lack of explicit models of causality (Cruz-Castro & Sanz-Menéndez, 2020;Traag & Waltman, 2022;van den Besselaar, Mom et al., 2020).
For the sake of clarity, by gender disparity we understand a difference in the outcome of interest between male and female applicants; whereas by gender bias, we understand any difference between male and female applicants that is directly causally affected (and directly measured) by their gender; a gender disparity may be the result of an indirect causal pathway from someone's gender to a particular outcome and may be affected by differences in merit, but a gender bias is a direct causal effect of the action of reviewers.
Even when concepts are clearly established and are part of rigorous analytical models, and the focus is on the "causes" of "gender bias," most research treats it as a "nonobservable" factor.
To highlight the differences in the analytical approaches between observational and experimental approaches we sketch them in stylized forms.
In Figures 1 and 2 we represent two simplified examples of the underlying causal models in the observational and experimental approaches, and the process of investigation included in the research design.
In typical observational research, the question of whether an applicant's gender explains differences in evaluation results is addressed by taking into consideration differences in merit by gender (measured in different forms).Bias, which is often implicit and nonobservable, is considered in this example as the residual that is not explained by the observable variables introduced in the model (unobserved heterogeneity).
Monitoring whether the gender of the applicant influences the outcome or the scoring usually includes a control for merit (e.g., publications), and, if differences in success or in the levels of merit needed to get the grant are found, as for example in the pioneering work of Wennerås and Wold (1997), the standard conclusion is the existence of gender bias.In Figure 2, due to a research design that equalizes merit and the randomization of the gender of applicants, the relations between the gender of the PI and the scoring as the outcome of the assessment could be seen as direct.And if differences are found between gender scoring, those are logically assumed to represent the existence of bias.
In the typical experimental approach, merit is made identical for all applicants and it does not vary by gender (this is why merit is absent in the figure).Therefore, the occurrence of bias can be inferred from differences in scoring of the randomly allocated male-and female-led applications.
To make it clearer, we will paraphrase the title of a paper (Dawid & Musio, 2022): Although observational researchers focus on the "causes of effects," that is, reconstructing how different explanatory variables contribute to a certain observed outcome (e.g., what are the causes of the differences), experimentalists are interested in the "effect of causes," that is, the causal effects of an intervention.In principle, the main goal and contribution of experimental approaches is to provide an explanation for previously established social regularities and to answer questions to test hypotheses from theoretical models (Barrera et al., 2023;Gërxhani & Miller, 2022).As a first rationale, the models themselves contain the causality links and determine what the experimenter needs to manipulate; a second rationale of experimental approaches is to empirically establish the existence or not of that regularity, a purpose that is of special relevance in our topic of interest.
The ideas of control and of intervention or manipulation of a variable (cause) are at the core of the experimental approach.Basically, what we do with an experiment is to modify one or more independent variables and measure the changes produced in the dependent variable of interest afterward.
Although some experimental research designs have tried to replicate peer evaluation procedures, these approaches have been criticized for their lack of realism, and experimental studies in funding agencies are rare.
To address the empirical research question of whether reviewers in funding agencies assess and score male and female applicants differently, in this work we advocate for field experiments to be embedded in research funding contexts that resemble reality as much as possible, allowing researchers to directly manipulate, allocate, observe, and test whether the gender of the applicant plays any role in the scores given to funding applications by the reviewers.
With this goal in mind, we present a field experiment implemented during the evaluation process, with the same evaluators that the evaluation agency selected for the assessment task, and with the same evaluation criteria and scores used by the funding agency.Embedded experiments, wherein theoretically relevant variables are systematically manipulated in the field, have important benefits for improving causal inference-a critical component in the development of any research field.
We seek to contribute to a literature that suffers from a high degree of inconclusiveness and contradictory evidence.We are not aware of field experiments implemented in the context of evaluation for the allocation of research funding in real time and context.

DESIGNING AN EXPERIMENTAL APPROACH IN A REAL FUNDING AGENCY
The randomized studies developed in the real world are usually called field experiments, where the word field comes from the original uses in agricultural research.But field in social research refers to "setting," and the setting or the "place" is just one relevant criterion to assess the experimentation.There are other factors that determine the "degree of fieldness" over various dimensions; some of the most important are (Gerber & Green, 2012): 1. authenticity of the treatments: whether the treatment used in the study resembles the intervention of interest in the real world; 2. the realism of participants: whether the participants in the experiment resemble the actors that usually participate in this type of process; 3. the genuineness of the context: whether the context within which subjects are receiving the treatment resembles the context of interest; and 4. the truth of the outcome measures: whether the outcome measures resemble the actual outcome of theoretical or practical interest.
We know that the strength of the experiments refers to internal validity, but this "naturalistic" approach has sometimes been presented also as a way to deal with some unforeseen threats to validity and inference, also mostly related with external validity and generalizability (Cook & Campbell, 1979;Shadish et al., 2001), which arise when drawing inferences based on the laboratory setting.Of course, generalizability is also related to other relevant factors associated with institutional and cultural context, in addition to the population or sample sizes.
From what we know, the implementation of field experiments in research-funding organizations has been limited, usually because of the complexities of dealing with ethical issues in experimentation (Hansen & Tummers, 2020), and in the interest of not affecting the fairness of the allocation processes (Rayzberg, 2019).This is probably why many of the experimental approaches on research funding have been set in somehow "unrealistic" or artificial contexts (Eden, 2017).
Our research questions are 1.Are female and male PI applications for funding assessed differently?2. Do male and female evaluators differ in their assessment of male and female PI applicants?
To answer the questions empirically, we implemented the field experiment through a webbased factorial survey (Auspurg & Hinz, 2015) to all evaluators appointed by a funding agency (realism of participants) to assess the applications to a research funding instrument.
In our field experiment, we took advantage of the overall organization of the evaluation process to implement the experiment in the same period in which the reviewers were evaluating the real submitted applications to the call (genuineness of the context).
The experiment was embedded in the process of the evaluation of, and simultaneously with, a funding call for university research groups (Consolidated and emerging research groups of the Galician University System) of the Galician Regional Government in Spain set by the General Secretariat for Universities (SXU) 4 .The evaluation process was arranged and organized by the ACSUG (Galician Agency for the Quality-and Accreditation-of the Regional University System) 5 .
We embedded the experiment in a survey to all reviewers in June 2022 (the same month in which they were doing the real evaluation work 6 ).The general objective of the survey was to 4 Information about the call can be found at https://www.edu.xunta.gal/portal/es/node/36119(accessed April 8, 2023).For more information about the description of the funding program of the SXU, the criteria and weighting for evaluation of applications and the evaluation procedure set, see the Supplementary material (SM 1). 5 For the Evaluation Unit (ACSUG) ethical standards, see https://www.acsug.es/(accessed April 8, 2023). 6After careful consideration about the ways in which it could affect the evaluation process, we ruled out (in agreement with the funding agency) the possibility of introducing a "fictitious" additional application among the real set of applications assigned to each reviewer.Instead, we proceeded with the built-in-survey experiment.Respondents were aware that the description of the application attributes they were rating in the factorial survey was a "hypothetical one."

Quantitative Science Studies 605
Gender bias in funding evaluation analyze the opinion of reviewers about the evaluation process and the appropriateness of the criteria for merit assessment defined in the call7 .
In the experiment, we asked them to score a hypothetical application 8 based on the description of some attributes relevant for the assessment, with the same definition of the evaluation criteria, using an evaluation template identical to the one used by the funding agency (truth of outcome measures).Our application consisted of three main parts: first, a description of the group composition, structure, interdisciplinarity, and gender balance, including whether leadership was female or male, with no names provided 9 ; second, a quantitative summary of the group curriculum vitae (CV) from the last 3 years, including a number of past record items, such as publications (number, type and impact indicators), talent (PhD training and attraction of ERC grantees), and scientific and transfer activity (funded projects, contracts, income, patents, spinoffs); and third, a statement of the group strategy, where quality and feasibility were to be assessed 10 .
It is important to emphasize that the program aim was to competitively provide research groups with basic funding; it did not fund specific research projects, and this is why the main basis of the evaluation was the CV of the group, its past records, and the group strategy statement.The scoring sheet of the experiment was the same as the one used in the program.See the weighting of the evaluation criteria in the Supplementary material (SM 1, Table S1).
The experiment design resembles some "classic" ones implemented for hiring candidates (Ceci & Williams, 2015;Moss-Racusin et al., 2012;Swim, Borgida et al., 1989;Williams & Ceci, 2015) in the sense that a single and unique application (and its merits), which is characterized by specific attributes related to the evaluation criteria, is embedded into a population survey experiment (Mutz, 2011) or factorial survey (Auspurg & Hinz, 2015).
describing the substantive issue of interest in the survey.Methodologically, it is true that if subjects of the experiment (field survey in our case) were aware of the substantive topic (a quite sensitive one, as it is gender bias) they could change their behavior (implicit bias or stereotyped beliefs) (Krumpal, 2013) to adapt to the more socially desirable behavior (Walzenbach, 2019), or just opt for a self-selection behavior of not answering (Leeper, 2017).It is well known that not considering this in the research design is very problematic for sustaining the robustness of the conclusions.This is why we embedded our randomized experiment in a general study of the evaluation practices in academia, specifically implemented in one evaluation agency, and focused on one funding instrument design.In factorial surveys, the randomization of treatments (of some questions or parts of the questionnaire) has become standard.More information about reporting on the experiment is provided in the Supplementary material (SM 2). 8Herein, experimental application. 9There is a factor related to the funding instrument under study (research group funding) that is worth highlighting: the fact that the gender composition of the group was included as an evaluation item in the template; as mentioned before, this precluded the design of an experiment based on gender blinding, but at the same time it allows us to use this feature for designing the measurement of the effect of gender. 10The experimental application resembled the real ones that the reviewers were in parallel evaluating for the agency but was not completely realistic.This is for two reasons: First, as the experiment was embedded in a survey, there were space and time limitations for completion; and second, and most importantly, stratifying by field was not feasible due to the size of the population of evaluators.Therefore, for the experimental application to be generic, and to fit in a variety of potential scientific fields, most information of the group CV had to be presented as quantitative indicators, although some indicators of quality and impact about their publications were introduced in the form of publications in Q1, position of papers in distribution of citations, or number of prestigious grants, such as those from the ERC, to mention some examples.Overall, the experimental application was shorter and contained mostly quantitative information, but we do not believe that this relative lack of authenticity compromised the validity of the experiment.
There were two versions of the same identical experimental application, with a description of the merit items (past performance), strategy, and group and PI characteristics to be assessed, which varied only in the designation of the gender of the group leader or PI.To test whether reviewers assessed male and female PIs differently, a group randomly received a female-led application (treatment), whereas the other group received a male-led application (control); we use the terms treatment and control in the standard way in the experimental literature.
One of the challenges for the experimental design is to rule out the possibility of a "bad randomization" allocation that could produce treatment and control groups whose background attributes differ in some relevant way; this situation can produce errors of inference if pooled.
Considering we had a fairly small population of 74 reviewers 11 selected and an unequal distribution by gender, but inside the boundaries of what the law determines as a gender balanced evaluation panel and reviewers' composition, it was not advisable use a completely random assignment, where each unit (reviewers) is assigned to a treatment (female applicant) and control group (male applicant) with equal probability.To favor the conditions of same probability of assignment, we used a blocked assignment (see flow diagram in Figure 3), where observations are first divided into two distinct strata (male and female reviewers) and the subjects within each stratum are allocated randomly to treatment and control groups.With this approach, male and female reviewers have the same probabilities to get either of the two designated PI genders, and each block could also be analyzed as an independent experiment 12 .
In summary, block randomization guarantees that a specific proportion of a subgroup of the population will be assigned to the treatment and control groups (see Figure 3).This design feature is important, because unless the probability of assignment to the treatment group is identical for every block, pooling observations across blocks will produce biased estimates of the overall average treatment effect (ATE).Advantages also relate to the practical or ethical imperatives and to statistical precision, which is very important in small populations, as is the case here.
For the randomization, subjects were listed in a spreadsheet in two groups according to gender and assigned a random number in each group.Assignment to the two versions (designation of Female PI-or Male PI-led team) was randomized, in a blocked assignment by 11 We need to clarify that this is not a small sample study (typically known as small N design) (Smith & Little, 2018).We are not drawing a random sample from an unknown population but instead working with the whole small population of a funding agency case.This difference has implications because our design concentrates its experimental power at the agency case level and provides high-powered tests of effects at that level.Working at the organizational and program level has advantages in terms of precise measurement, experimental control, and replication.We believe that in environments or contexts that can be explored at the individual level (as in the case of remote reviews) and when our focus (bias) is on aggregate results for our population, studies with a relatively limited number of participants are less prone to criticism regarding size. 12Block randomization ensures that equal numbers of male and female evaluators are assigned to each experimental condition.This design has been implemented to address some practical statistical concerns: First, the reduction of sampling variability if all is left to a complete or simple randomization procedure, considering that male and female evaluators did not represent the same proportion in the population.With block randomization, the subjects in each block (men and women) have similar potential outcomes.Second, block randomization also allows that subgroups are available for separate analysis; for example, in our gender analysis we might be interested in comparing the ATE (average treatment effect) among men with the ATE among female reviewers (Imbens & Rubin, 2015).For a snapshot description of our experiment see the Supplementary material (SM 3).reviewer gender, using a pre-established algorithm for randomization produced by a thirdparty source.
We are measuring the existence of bias in the evaluation of applications of female and male PIs.Results will be examined in terms of the scores (0 to 5 in the relevant item) that the specific item of "group gender composition and leadership" has received from male and female evaluators in cases in which the PI of the application is shown as male or female.The applications being identical in all other respects, if mean scores (ATE) are significantly different between male and female designated PIs, we could empirically confirm the existence of some kind of gender bias, caused by the gender of the PI.
In sum, we believe we add to the existing literature by 1. implementing the experiment in a genuine evaluation context in a funding agency; 2. studying a real population of evaluators selected by the funding agency; and 3. using the whole population involved and not a sample.

MAIN RESULTS OF THE FIELD EXPERIMENT
In this section, we present the main results from the field experiment developed in the funding agency.We will focus first on the ratings by the reviewers of the evaluation item that includes designation of the gender of the PI.Second, we will analyze how reviewers from both genders assessed the applications from female and male PIs.

Are Female and Male PIs Assessed Differently?
When we analyze the evaluation item that includes identification of the gender of the PI, we find some relevant results.
As Figure 4 shows, evaluators tend to score male PIs higher, although the differences are small and not statistically significant at 95% CI.Therefore, when reviewers are randomly assigned to the treatment, there are no significant differences in the rating of the application leadership item for male and female PIs.The mean score for female PIs is 3.446 (SD 1.5174), and the mean score for male PIs is 3.716 (SD 0.9541).This visual conclusion is also confirmed if we treat our cases as a sample of an unknown population, with various tests to check the levels of uncertainty of our measures13 (Cox, 2020): First, with the t-test assuming equal variances as a parametric tool; second, with the Kruskal-Wallis test as nonparametric; and, finally, with bootstrap intervals for effect size 14 .
The t-test of the mean differences of the scores given by the evaluators to the leadership item shows no significant differences between male or female PIs (see Table 1).
The p-value associated with the t-test is not lower than 0.05-the contrast is not significant, in bilateral or left or right unilateral; this clearly means that the dependent variable (the scores taken as a continuous variable) does not present differences in the mean among the two groups, with a level of confidence of 95%.Thus, we cannot reject the null hypothesis (H 0 ) that no differences exist between both groups.
Whereas p-values are used to assess the statistical significance of a result, we also need to make an additional estimation of the effect size.Measures of effect sizes are used to assess the practical significance of a result.The effect sizes most commonly used refer to measures of group differences (the d family); the d family includes estimators such as Cohen's d, Hedges's g, and Glass's d (see Table 2).
For female PIs, the effect size for scores of the leadership item are 0.21 standard deviations (SD) lower than for male PIs.This result is usually considered a small effect size, as it represents around 0.2 standard deviations (Cohen, 1988); in another possible interpretation, based null hypothesis, statistical significance, and the confidence intervals (Cox, 2020;Fraser, 2017), but we used them as a way of addressing our research questions. 14For all the estimates presented in the paper we have used STATA 17 (StataCorp, 2021).on the U-statistics, it means 15% nonoverlap and 85% overlap in the distributions, or that 54% of the group of male PIs exceeds the 50th percentile of the female PI group.
As the scoring could also be taken as categorical, and the distribution is not completely normal, a nonparametric test was A Kruskal-Wallis H test was conducted to confirm, under different assumptions, if the rating of the leadership was different: (a) Female PI (n = 37); (b) Male PI (n = 37).The test showed that there was not a statistically significant difference in scores between the two groups, χ 2 (1) = 0.112, p = 0.7375.
A significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no real difference.In theory, if the p-value is less than or equal to the significance level, we should reject the null hypothesis and conclude that not all population means are equal.As the p-value is much larger than the significance level, H 0 cannot be rejected.Therefore, the null hypothesis cannot be rejected with this test either.
Simulation studies have shown that bootstrap confidence intervals may be preferable to confidence intervals based on the noncentral t distribution when the variable of interest does not have a normal distribution (Algina, Keselman, & Penfield, 2006;Kelley, 2005) 15 .
We also did a nonparametric bootstrap estimation resampling the cases.Results from the bootstrap statistics (see Table 3) are robust with the other effect size test implemented.

Do Female and Male Reviewers Differ in Their Assessments of Male and Female PI Applicants?
We also tested if there are differences in the patterns of evaluation by female and male evaluators.In our experiment, male and female reviewers are assigned a female or male-led application with the same probability.
As groups, male and female reviewers do not score significantly differently; in aggregate, female reviewers score slightly lower (3.48,SD 1.35) than male reviewers (3.65, SD 1.21), but the differences are small and, again, not statistically significant.assumption that the dependent variable is distributed as normal is met.As the p-value associated with the homoscedasticity test statistic (homogeneity of variances) is not greater than 0.05, this initial assumption is violated, and it is not recommended to proceed only from a parametric point of view.
We used a nonparametric test, and the Kruskal-Wallis test showed that there was not a statistically significant difference in scores between the two groups of evaluators, χ 2 (1) = 0.150, p = 0.6985.
But the interesting question we can address thanks to the block assignment is the interaction between gender and PI gender: Do female reviewers and male reviewers score differently when assessing the female and male PIs?As we observe in Figure 5, the difference in the mean values among the groups is mainly related to a different pattern of the distribution of marks, in which the marks assigned by female reviewers to female PIs (in comparison with male PIs) show a higher dispersion than the marks assigned to male PIs (who are rarely given very low scores by female reviewers).This is not the case for male reviewers; analyzing the mean values of the scores assigned by male reviewers to the randomly allocated applications (male or female PI), we observe similar median values and a lower dispersion of marks (see Figure 5).
As noted, the differences in the aggregate mean values of scores arise mainly from the different way in which female reviewers assess female and male PIs, especially at the extreme of the distribution.
The advantage of the stratified experimental design or blocked assignment is that we can also monitor the scoring patterns of female and male reviewers and treat them as independent experiments.
Figure 6 represents the mean scores of the treatment differentiated by the gender of the reviewer.Analyzing the mean values, we observe that female reviewers assign higher scores to male PIs (3.81, SD 0.8539) than to female PIs (3.16, SD 1.6705); in comparison, the mean rating of female and male PIs by male reviewers is almost the same (3.67 and 3.64), with an SD of 1.3904 for female PIs and an SD of 1.0385 for male PIs, respectively.

Gender bias in funding evaluation
Male reviewers rate male and female PIs with almost the same mean scores, and female reviewers rate female PIs worse than male PIs, who are favored.
As in the first analysis, we conducted some additional tests to assess the uncertainty of our measures.Again, assuming that our population was a sample, the differences in the two-

Quantitative Science Studies 612
Gender bias in funding evaluation sample t-test with equal variances are not statistically significant, but there are some important nuances (see Table 4).
For both female reviewers and male reviewers taken as groups, as the p-value associated with the t-test is not lower than 0.05, the contrast is not statistically significant, whether it is bilateral or right or left unilateral, with a confidence interval of 95%.As a conclusion, the dependent variable does not present differences in means between the two groups tested (female and male reviewers scoring female and male PIs), with a confidence interval of 95%.
As in the analysis of the scores of the PI leadership item, we will repeat the nonparametric analysis, considering the distribution of the ratings, for the four different groups of interest (evaluators' gender and PIs' gender).
A Kruskal-Wallis H test was conducted to determine if the rating of the leadership item was different for the four groups: (a) Female Reviewers Assessing Female PIs (n = 16); (b) Female Reviewers Assessing Male PIs (n = 16); (c) Male Reviewers Assessing Female PIs (n = 21); and (d) Male Reviewers Assessing Male PIs (n = 21).The Kruskal-Wallis test showed that there was not a statistically significant difference in scores between the four groups, χ 2 (3) = 1.398,

Quantitative Science Studies 613
Gender bias in funding evaluation p = 0.7059.Therefore, we cannot reject the null hypothesis (H 0 ) that no differences exist between the evaluation of female and male PIs by female and male reviewers.
Although we could not reject the null hypothesis, we observe more differences in the effect size based on the mean comparison of the scoring of female reviewers on female and male PIs than in the effect size of the male reviewers doing the same (see Table 5).
Comparing the effect size (Cohen's d ) of the gender of the PI for female and male reviewers, we observe important differences in scoring by female and male reviewers.Although the effect size of the differences in scores of the female reviewers when assessing female and male PIs goes up to almost 0.50 standard deviations, the effect size of the differences in scores of the male reviewers when assessing female and male PIs is almost marginal: 0.02 standard deviations.
An effect size of 0.5 means that the score allocated by female reviewers to male PIs is 0.5 standard deviations higher than the average female PI and, hence, male PI scores exceed the scores of 49% of the female PIs.
With these analyses, we can also confirm that the source of the small difference identified in the ratings across the gender of the PI scoring has its origins in the differences in the scores assigned by male and especially female reviewers; although differences in the mean values are small (and not statistically significant), it is important to note that these differences in results come mainly from the diverse way in which female evaluators on average assess applications with male or female PIs.

SUMMARY AND CONCLUSIONS
In this paper, we implemented a survey field experiment in the Galician University research funding and evaluation agencies at the time in which evaluation of the real applications to

Quantitative Science Studies 614
Gender bias in funding evaluation the call was taking place.The experiment focused on identifying the effects of the gender of the application's PI alone, and in interaction with another factor (the gender of the reviewer).
The main purpose of the experiment was not so much to confirm theories (with additional empirical evidence) as to confirm or not (the existence of ) the regularities regarding gender bias.The experiment was built as a hypothetical funding application evaluation form in a factorial survey, was implemented in the same period as the real evaluation was being performed, using the same evaluation criteria and weighting in the evaluation form, and with the real reviewers selected by the funding agency.
Regarding our first research question, we found only small differences in the scores of the randomly assigned applications whereby male PI scores are higher than female PI ones, but these differences were not statistically significant.Thus, the results do not allow the null hypothesis that there are no differences in scores to be rejected.Estimating the effect size, we found small orders of magnitude.
As regards the second research question, the differences in scoring arise mainly from how female evaluators assign the scores to female and male PIs, with better scores for the latter.Although the differences are not statistically significant, estimating the effect size of the differences for reviewers of different genders we have found that, for female reviewers, the difference in the effect size between female PI and male PI scores is around 0.5 standard deviations, and for male reviewers, the effect size of the difference is negligible (0.02 SD).
Observational evidence about the existence of gender bias in research funding has yielded contradictory results, partly due to conceptual imprecision, but also because of methodological and measurement shortcomings, in addition to contextual or sampling effects that preclude controlling for unobserved heterogeneity.As a consequence, this type of observational research has looked for the causes of effects, adopting an approach in which gender bias (unobserved) is often considered as the residual of the gender disparity not explained by the observed variables, including merit.
In contrast, the experimental literature to which our study aims to contribute introduces randomization and/or manipulation to search for the effects of causes and to confirm (or reject) the existence of gender differences in the assessment that could support the idea of bias against one gender or the other.The study by Moss-Racusin et al. (2012), albeit analyzing a hiring decision, shares some methodological similarities with ours.In contrast with their results, we did not find that male applicants were rated significantly higher than female ones.In this sense, our findings are more in line with previous results (Carlsson et al., 2021;Ceci & Williams, 2015;Forscher et al., 2019;Williams & Ceci, 2015), which claim that there is not systematic bias against women in peer review evaluation.Nor did we find in our case that women were favored.
More recent studies based on German factorial surveys find that the evaluation favored women (Solga et al., 2023), and attribute the finding to the potential effect of "gender equality policies" in hiring procedures.
However, academic hiring is different from research funding, and with our data from the assessment of a set of highly experienced reviewers selected by the Galician Regional evaluation agency, only small differences were found; this connects with Ceci's claim (Ceci, 2018) about the relevance of the experience factor in research evaluation.
Previous experimental research has often been criticized for the lack of genuineness of the laboratory context, the nonauthenticity of the treatments, or the unrealistic character of the subjects.Experimental studies in funding agencies are rare.We have added to the literature a field study with a research design that used blocked randomization of the treatments and got

Quantitative Science Studies 615
Gender bias in funding evaluation a rate of response of 100% of the population of evaluators involved in an ongoing research funding process, thus providing more robust evidence in terms of causal inference.
Of course, the context of the regional agency, the practices of selection of evaluators, and the behavioral effects of more than 15 years of gender equality policies could affect the evaluation (and it is expected to affect it over time), but determining the effect of the regulation over time was not the research question guiding our experiment.Our experiment's main empirical result is that the gender of the applicant could not be seen as the direct cause (as we believe that the randomization of the treatment if properly implemented is a direct response to the effect of the causes approach) of a higher or lower rating of the funding application.Therefore, we found no support for the effects of gender of the PI on peer review outcomes, at least with our group of real reviewers.
In much of the previous experimental research, the gender of the reviewing participants has either not been a focus in the analysis or, when it has been, no effects on the responses were reported (Forscher et al., 2019;Moss-Racusin et al., 2012;Solga et al., 2023).
In our case, female and male reviewers differ in their assessments; however, these effects are not in line with the matching hypothesis (or with the claim that reviewers hold samegender preferences, or make gender-role congruity associations) (Jayasinghe et al., 2003;Marsh et al., 2009Marsh et al., , 2011)).Nevertheless, some of our results show that female reviewers give male PIs higher scores than female PIs, up to 0.5 SD; this leadership "bonus" in the scoring of male PIs could suggest that stereotypes about gender attributes associated with leadership roles may operate (Eagly & Karau, 2002), at least in the case of the female reviewers in our study.Empirically, the ratings of female reviewers of female PIs show a higher dispersion than the ratings of male PIs.The results emerging from our experimental design are robust, but we acknowledge that the grounds of the differences in female reviewers' assessment of male and female PIs could result from other nonobservable attributes of our population, such as related evaluation experience or the disciplines or fields of research; the idea that more senior or excellent female academics may apply more demanding expectations in peer review has already been suggested by empirical findings elsewhere (Borsuk et al., 2009;Cruz-Castro & Sanz-Menéndez, 2021).Factors such as the distribution of reviewers among different disciplines and research fields could also influence the aggregate values of the assessment, based on the different ways in which they interpret or provide meaning to the description of the application included in the experiment.
The findings have some policy implications regarding the rationale for increasing the number of female reviewers on the panels as a way to increase female funding success rates or their ratings, in line with the empirical findings of previous research showing a lack of effect of the gender composition of committees on the number of successful female candidates (Bagues et al., 2017;Zinovyeva & Bagues, 2015) as policy action.
As for the caveats in our research, we should acknowledge the following.First, the evidence presented, albeit robust, is limited in terms of generalizability to a broader community of reviewers, mostly because of the small number of subjects in the experiment and the local context and assessment practices.
In fact, in the real contexts of small countries or regional funding agencies, the number of applications and reviewers is limited to provide statistical generalizability to the results; in our case, as all reviewers involved in the evaluation participated and there were no dropouts, the potential limitations for induction and robust internal validity were not present.Moreover, in favor of the case, we should note that the funding agency under study implements a policy of reviewer identification and selection from outside the region, targeting experienced scholars and scientists across several Spanish research institutions.Nevertheless, larger numbers would be needed to qualify the results by scientific field; developing a large-scale strategy of replication of the experiment in other funding agencies could also be further research.
Second, the type of instrument (group funding) and the limited level of competitiveness of the call (with success rates higher than 60%) may have contributed to the outcomes, as there is some observational evidence that women tend to be disfavored in more competitive contexts (Ors, Palomino, & Peyrache, 2013).
Finally, we have analyzed a policy instrument that mentions in the call the commitment to gender equality in science and academia, and this may have had a moderating impact on the gender effect, possibly linked to socially desirable behavior or rational adaptation to a changing policy environment.
To sum up, more comparative research is needed in other funding agencies to control for context-specific factors of this and other types; in this regard, replication of the experiment in other funding agencies would strengthen the findings.

Figure 1 .
Figure 1.Observational design: Example of competing causal mechanisms of the effect of gender.*Nonobservable.

Figure 2 .
Figure 2. Experimental design: Example of causal mechanisms of the effect of gender.

Figure 3 .
Figure 3. Flow diagram of the field experiment.

Figure 4 .
Figure 4. Mean scores (ATE) for the randomly assigned gender PI and CI at 95%.

Figure 5 .
Figure 5. Box plot distribution of scores of Female PI and Male PI by gender of the reviewers.Note: From left to right: Female reviewer to Female PI (1), Female reviewer to Male PI (2), Male reviewer to Female PI (1), Male reviewer to Male PI (2).

Figure 6 .
Figure 6.Mean scores (ATE) of gender PI by gender of the reviewers and CI at 95%.

Table 1 .
T-test (two-sample t-tests with equal variances) of gender PI scores.Mean values, SE, SD, and CI at 95% of the scoring

Table 2 .
Effect size based on mean comparison of scores for gender PI

Table 4 .
Two-sample t-test with equal variances of score differences of the gender of PI by gender of the reviewers

Table 5 .
Effect size based on mean comparisonScore of female reviewers effect size of mean comparison of PI gender