Indicators of research quality, quantity, openness, and responsibility in institutional review, promotion, and tenure policies across seven countries

Abstract The need to reform research assessment processes related to career advancement at research institutions has become increasingly recognized in recent years, especially to better foster open and responsible research practices. Current assessment criteria are believed to focus too heavily on inappropriate criteria related to productivity and quantity as opposed to quality, collaborative open research practices, and the socioeconomic impact of research. Evidence of the extent of these issues is urgently needed to inform actions for reform, however. We analyze current practices as revealed by documentation on institutional review, promotion, and tenure (RPT) processes in seven countries (Austria, Brazil, Germany, India, Portugal, the United Kingdom and the United States). Through systematic coding and analysis of 143 RPT policy documents from 107 institutions for the prevalence of 17 criteria (including those related to qualitative or quantitative assessment of research, service to the institution or profession, and open and responsible research practices), we compare assessment practices across a range of international institutions to significantly broaden this evidence base. Although the prevalence of indicators varies considerably between countries, overall we find that currently open and responsible research practices are minimally rewarded and problematic practices of quantification continue to dominate.


INTRODUCTION
The need to reform research assessment processes related to career advancement at research institutions has become increasingly recognized in recent years, especially to better foster open and responsible research practices 1 . In particular, it is claimed that current practices focus too much on quantitative measures over qualitative measures (Colavizza, Hrynaszkiewicz et al., 2020;Malsch & Tessier, 2015), with misuse of quantitative research metrics, including the Journal Impact Factor, among the most pressing issues for equitable research assessment generally, which aims to foster open and responsible research in particular. Therefore, recent years have seen a focus on attempts to understand how principles and practices of openness and responsibility are currently valued in the reward and incentive structures of research-performing organizations, especially by direct examination of organizations' review, promotion, and tenure 2 (RPT) policies. Such studies have heretofore focused on specific contexts, however. Work led by Erin McKiernan and Juan Pablo Alperin examined policies in place across a range of types of institutions in the United States and Canada Alperin, Schimanski et al., 2020;McKiernan, Bourne et al., 2016;Niles, Schimanski et al., 2020). Rice, Raffoul et al. (2020) studied criteria used across a range of countries, but only within biomedical sciences faculties. Hence, further work is needed to describe types of criteria in place across a range of institutional types internationally.
This paper aims to fill this gap. Our primary research question can be formulated as "What quantitative and qualitative criteria for review, promotion, and tenure are in use across research institutions in a purposive sample of seven countries internationally?" Subquestions include "How prevalent are criteria related to open and responsible research in these contexts?," "How prevalent are potentially problematic practices (e.g., use of publication quantity or journal impact factors)?," and "What trends can be observed across this sample?" To answer these questions, we investigate the prevalence of qualitative and quantitative indicators in RPT policies across seven countries: Austria, Brazil, Germany, India, Portugal, the United Kingdom and the United States. This involved manually collecting 143 RPT policy documents from 107 institutions. These documents were then systematically coded for the inclusion of language related to 17 elements 3 (including those related to qualitative or quantitative assessment of research, service to the institution or profession, and open and responsible research practices) using a predefined data-charting form. Directly comparing the indicators and criteria in place at such a range of international institutions hence aims to broaden the evidence base of the range of practices currently in place 4 .

Research Assessment and Researcher Motivations
Institutional policies regarding RPT typically focus on three broad areas: research, teaching, and service (both to the profession and the institution). The relative importance of each varies across institutions and has also changed over time (Gardner & Veliz, 2014;Youn & Price, 2009). In the European context, a recent survey of researchers investigated indicators widely used at EU institutions for review, promotion, and tenure. The most common factors used in research assessment were (according to survey respondents): number of publications (68%), patents and securing funds (35%), teaching activities (34%), collaboration with other researchers (32%), collaboration with industry (26%), participation in scientific conferences (31%), supervision of young researchers (25%), awards (23%), and contribution to institutional visibility (17%) (European Commission, Directorate General for Research and Innovation, 2017).
When it comes to the assessment of research contributions, reflecting the common idiom "publish or perish," publication in peer-reviewed venues remains central. Primary publication types vary across disciplines. Although journal articles dominate in Science, Technology, Engineering, and Mathematics (STEM) subjects, monographs or edited collections have greater importance in the Humanities and Social Sciences (Adler, Ewing, & Taylor, 2009;Alperin et al., 2020). Within Computer Science, meanwhile, publication in conference proceedings is the most important factor (McGill & Settle, 2011). However, irrespective of which type of publication is favored, institutions tend to position productivity (often quantified via metrics) as a defining feature in RPT policies (Gardner & Veliz, 2014). The ways in which this emphasis on productivity and quantification influences academics' focus and shapes behaviors, often in detrimental ways, is worth expanding upon to understand how current trends in RPT policies may be limiting the uptake of open and responsible research.
Institutional committees tasked with determining whether research contributions are sufficient for promotion, review, or tenure face something of a dilemma. Although the quantity of publications is comparatively easy to assess, measuring their quality is a more difficult challenge. Ideally, committees would be able to read each of the contributions themselves to make their own firsthand judgments on the matter. However, the mass of material created, as well as increased research specialization drastically reducing the number of experts that possess the required expertise for such quality judgments, mean that usually proxy indicators for quality are sought. Here, two factors are particularly popular: publication venue and citation counts.
In perceptions of the prestige of academic journals, the Journal Impact Factor has assumed a particularly pernicious role. Created by Eugene Garfield of the Institute for Scientific Information, the Journal Impact Factor calculates an average of citations per article within the last 2 years to provide a metric of the relative use of academic literature at the journal level. Originally created to assist library decisions regarding journal subscriptions, the Journal Impact 4 Results from this study were previously made available via the ON-MERRIT project report "D6.1 Investigating Institutional Structures of Reward & Recognition in Open Science & RRI" (Pontika, Klebel et al., 2021). This paper presents enhanced analysis based on a slightly modified data set (corrected to eliminate minor inconsistencies in data charting, as explained in footnote 7, Section 3.3). In addition, the data underlying this study are also incorporated into the data paper (Pontika, Gyawali et al., 2022a).
Factor soon came to be used as a proxy for relative journal importance by research assessors and researchers themselves (Adler et al., 2009;Walker, Sykes et al., 2010). Various criticisms have been levelled at the Journal Impact Factor, most prominently that relatively few outlier publications with many citations skew distributions such that most publications in that journal fall far below the mean. Additional criticisms include that differences in citation practices between (and even within) fields make the Journal Impact Factor a poor tool for comparison, that it is susceptible to gaming by questionable editorial practices, and suffers a lack of transparency and reproducibility (Fleck, 2013). Nonetheless, its use as a proxy for research quality in research assessment became commonplace (Gardner & Veliz, 2014;McKiernan, Schimanski et al., 2019). McKiernan et al. (2019) studied RPT documents and found that 40% of North American research-intensive institutions mentioned the Journal Impact Factor or closely related terms. Accordingly, researchers commonly list a journal's impact factor as a key factor they take into account when deciding where to publish . Citation counts at the article level are also often used as a proxy for research quality within RPT processes (Adler et al., 2009;Brown, 2014). Indeed, Alperin et al. (2019) found that such indicators were mentioned by the vast majority of institutions. However, citations have been widely criticized for being too narrow a measure of research quality (Curry, 2018;Hicks, Wouters et al., 2015;Wilsdon, Allen et al., 2015). The application of particularistic standards is especially perilous for early-career researchers who have yet to build their profile. By using citation metrics to evaluate research contributions, initial positive feedback leads to the selfreinforcement loop known as the Matthew Effect (Wang, 2014). Moreover, indicators such as the h-index are highly reactive (Fleck, 2013) and therefore risk reifying monopolization of resources (prestige, recognition, money) in the hands of a select elite. The h-index was designed as a measurement tool to showcase the consistency of the cited researchers but creates a disadvantage for early-career researchers and neglects the diversity of citation rates across scientific disciplines and subdisciplines (Costas & Bordons, 2007).

Research Assessment and Open and Reproducible Research
Multiple initiatives in the last decade have sought to raise the alarm on the overuse of quantitative indicators and highlight the need to consider a broader range of practices (beyond publications). For instance, the San Francisco Declaration on Research Assessment (DORA) specifically criticized use of the Journal Impact Factor in research assessment 5 . The 10 principles of the Leiden Manifesto for Research Metrics (Hicks et al., 2015) sought to reorient the use of metrics by critiquing their "misplaced concreteness and false precision," arguing that quantitative should be used as a support for "qualitative, expert assessment," with strict commitments to transparency.
Such critiques of overquantification have developed alongside movements to foster open and reproducible research. These two trends meet where advocates of Open Science or Responsible Research & Innovation (RRI) identify concern among researchers that uptake of open and responsible research practices will negatively impact their career progress (Adler et al., 2009;Migheli & Ramello, 2014;Peekhaus & Proferes, 2015;Rodriguez, 2014;Wilsdon et al., 2015).
As a result, recent research has investigated if and how criteria relating to open and responsible research practices are rewarded in RPT policies. In particular, the influential "Promotion, Review, and Tenure" project headed by Erin McKiernan and Juan Pablo Alperin has examined these issues in depth by studying a corpus of RPT policy documents from 129 universities in the United States and Canada. This project found that aspects related to open and reproducible research were rare or undervalued. Alperin et al. (2019) found, for example, that only 6% of RPT policies mentioned "Open Access," often in a negative way. Public engagement, although mentioned in a large number of policies, was nonetheless undervalued by associating it with service, rather than research work. Meanwhile, 40% of policies from research-intensive institutions mentioned the Journal Impact Factor in some way, with the overwhelming majority of those (87%) supporting its use in at least one RPT document and none heavily criticizing it .
A similar study by Rice et al. (2020) studied the presence of "traditional" (e.g., publication quantity) and "nontraditional" (e.g., data-sharing) criteria used for promotion and tenure in biomedical sciences faculties. In that context, the authors found that mentions of practices associated with open research were very rare (data-sharing in just 1%, with Open Access publishing, registering research, and adherence to reporting guidelines mentioned in none). Most prevalent were traditional criteria including peer reviewed publications (95%), grant funding (67%), national or international reputation (48%), authorship order (37%), Journal Impact Factor (28%), and citations (26%).
Although general trends, including prevalence of (sometimes problematic) quantitative measures and lack of recognition for open and responsible research practices, can be observed across these two groups of work, nonetheless there are important nuances we should take into account. In the biomedical context, Rice et al. (2020) saw "notable differences" in the availability of guideline documents across continents and "subtle differences in the use of specific criteria" across countries. In the United States/Canada context, meanwhile, differences across types of institutions were observed-with, for instance, "research-intensive" institutions being more likely to encourage use of the Journal Impact Factor (McKiernan et al., 2019). These differences, across institutional types and national boundaries, require further investigation.
This current study complements and extends this work. Such work is crucially important, especially as reform of rewards and recognition processes is now a policy priority, particularly in Europe. Vanguard institutions such as Utrecht University in the Netherlands are already implementing such reforms (Woolston, 2021). The Paris Call on Research Assessment, announced at the Paris Open Science European Conference (organized by the French Presidency of the Council of the European Union) in February 2022, calls for evaluating the "full range of research outputs in all their diversity and evaluating them on their intrinsic merits and impact" (Paris Call on Research Assessment, 2022). The Paris Call also sought the formation of a "coalition of the willing" to build consensus and momentum across institutions. The European Commission is currently building such a coalition (Research and Innovation, 2022). This paper further contributes to the evidence base to inform such reform.

METHODS
We assembled and qualitatively analyzed RPT policy documents from academic institutions in seven countries (Austria, Brazil, Germany, India, Portugal, the United Kingdom, and the United States).

Sampling
In selecting countries, we used purposive sampling (candidate countries whose primary language was covered by the research team (i.e., English, German, or Portuguese). Although automated translations can go a long way in basic understanding, the task at hand required knowledge of the policy landscape of the studied countries, as well as the ability for precise reading of source materials. We first identified four target countries, based on our European focus and our team's familiarity with the language (English, German, Portuguese) and policy landscapes of specific countries. In addition, the United States was included as a representative of a leading research country and to allow comparisons with previous research . Furthermore, we included India and Brazil as examples of large "lowand middleincome countries" based on gross national income per capita as published by the World Bank 6 . They play a growing role in research, and broaden our scope to include Asia and South America. We acknowledge that our sample of countries cannot be considered random or representative of the situation globally. However, given the current lack of knowledge of RPT criteria in place across national contexts, we nonetheless believe that our sample adds richly to current knowledge.
To include representative numbers of institutions of perceived high and low prestige, we used the Times Higher Education World University Rankings ( WUR) 2020 to select institutions. Institutions from each selected country were sorted based on their relative WUR performances in the categories "Research" and "Citations." We then divided each category into three equally sized subcategories, defining them as "High-," "Medium-," and "Low-" performing institutions. Next, we calculated the median of each subcategory and selected the institutions that were closest to the median as representatives of this category. We included both the "Research" and "Citations" fields from the WUR, as both of these are research-related indicators. Duplicate entries of the same institution appearing in both categories were replaced by the next available institution in the "Citations" category. Our sampling procedure resulted in a sample of 107 institutions across seven countries (Table 1). 6 World Bank: https://www.worldbank.org/en/home. Although selection of institutions based on university rankings is an often-used strategy (e.g., Rice et al. (2020) use the Leiden Ranking in a similar approach), it is not without flaws. First, university rankings have been criticized for their reliance on biased and unreliable reputational survey data (Waltman, Calero-Medina et al., 2012) and issues of gaming and selective reporting (Gadd, 2021). Second, rankings such as the WUR only include the most prominent institutions and leave out many institutions based on partly arbitrary criteria (e.g., how many yearly publications they need to be included). The reported groups of "high," "medium," and "low"-ranked universities are only relative to the set of institutions included in the ranking, and not academia as a whole. We see our use of the WUR as a pragmatic approach to reproducibly sampling universities. This does not negate their deficiencies for guiding prospective students to choose institutions or informing policy decisions.

Data Collection
Policy documents were collected using a shared search protocol. First, we used Google to search for the institution name along with various constellations of keywords. Table 2 shows the set of keywords identified and used for the policies identification in the three languages: English, German, and Portuguese.
Institutions often have both institution-wide and departmental-specific RPT policies. Due to difficulties in finding specific departmental policies in the United Kingdom and United States, we collected only institution-level policies. To ensure a consistent set of policies, we defined the following exclusion rules: 1. We did not collect advertisements for job descriptions even though these could include some insightful requirements applicable to the RPT policies. 2. We included RPT policies only and not other policies such as Ethics, Diversity, and OA, where similar concepts could appear.
Policies could apply to any post-PhD researcher career stage. The collected policies evaluated various research-related positions. For example, in the United Kingdom, some institutions have separate policies for associate professors, full professors, and readers. In the United States, there are separate policies for tenured and nontenured staff. In Austria, there are policies for habilitation (qualification for teaching, needed for promotion to professor) and qualification agreements for tenure track (associate professors), but no promotion to full professor exists. In India, we could often not find specific policies, but rather the evaluation forms that researchers use to apply for promotion. In these cases, we therefore analyze the evaluation forms instead. Some institutions have separate policies for all researcher categories, (i.e., separate policies for lecturers, assistant professors, associate professors, professors, and so on), but others have a uniform policy covering all positions. Hence, the number of institutions is smaller than the total number of policies collected (Table 1). Where more than one policy was identified for an institution, we assessed the indicators separately for each policy and counted an indicator as "fulfilled" when it appeared in at least one policy.
We were sometimes unable to obtain policy documents for target institutions. Specifically, where access was restricted to members of the institution only, data collectors emailed the institution's human resources department to ask for a copy of the policy. If no response was received within 10 days, data collectors recorded this information and sampled the next institution from the list for that country and strata until a sufficient number of policies was obtained. Table 3 shows the institutions that did not have a public policy per country. An initial round of document collection occurred during the period November 2019 to March 2020. The sample was then further extended between March and April 2021. As some institutions had several distinct policies relating to different career stages, 143 total RPT policy documents were collected for analysis from the 107 institutions.

Data Charting
Data were extracted from the policy documents using a standardized data-charting form. The form was devised in multiple rounds of iteration. Key indicators for inclusion were identified from various sources, including the MoRRI indicators (MoRRI, 2018) and a group of studies performed in the North American context by Alperin et al. (2019), as well as from the surveyed literature. We collected and examined 17 different indicators (Table 4), including "traditional" assessment indicators relating to quantification and quality of publications, and a set of "alternative" indicators relating to open and responsible research, and related issues such as gender equality and Citizen Science. In addition, information was gathered on the date policies came into effect, the academic positions (e.g., tenure track, professor, lecturer, senior lecturer) or types of processes (e.g., promotion, review, tenure) they governed.
Five coders were involved, all with competence in English, three in German, and one in Portuguese. Policies were assigned based on language competences. They coded the presence (1) or absence (0) of each indicator in each policy, and copied the sentence mentioning the indicator and the ones before and after. Each document was coded by one individual. To assess intercoder reliability, an independent coder (TRH) performed a reviewer audit of a random sample of 10% of the total number of institutions. Comparing this second round of review to the first responses revealed a high intercoder reliability of 96.78%.
Before carrying out the analysis, several steps were taken to ensure data integrity and consistency. Data were originally collected via spreadsheets and subsequently collated using R to 7 For the Indian case, most institutions did not have specific policies. To retain India in the sample, we therefore coded five evaluation forms, two policy documents and five documents that included both a policy and an evaluation form. avoid copy-paste errors. We checked that every indicator was present for each policy; that in cases where an indicator had been found (coded as 1) a text excerpt was present; and that in cases where no indicator had been found (coded as 0) also no text excerpt was present. The inconsistencies found were checked and resolved by TK. Table 4. Overview of data-charting form main elements ("Are the following mentioned as being taken into account in the promotion/evaluation procedures as stated in the policy?") To facilitate the review of our results and the reuse of the data, we translated all non-English excerpts to English in a two-step procedure. First, we used DeepL 9 to obtain an initial translation of the excerpt. A native speaker then checked the translation, revising to ensure that meaning, context, and use of special terms mirrored the original. The validated translation was then recorded alongside the original text for subsequent analysis.

Data Analysis
All data analysis was conducted using R (R Core Team, 2021), with the aid of many packages from the tidyverse (Wickham, Averick et al., 2019), including ggplot2 for visualizations (Wickham, 2016). Computational reproducibility of the analysis is ensured through the use of the drake package (Landau, 2018). The analyses presented in this paper are all exploratory and have not been preregistered. To enable the comparison of the indicators' presence against each other, we rely on Multiple Correspondence Analysis (MCA) (Greenacre & Nenadic, 2018;Nenadic & Greenacre, 2007). Correspondence analysis and its extension MCA are similar to principal component analysis (PCA) in mapping the relationships between variables to a high-dimensional Euclidean space. The goal of the method is then "to redefine the dimensions of the space so that the principal dimensions capture the most variance possible, allowing for lower-dimensional descriptions of the data" (Blasius & Greenacre, 2006, p. 5). The obtained dimensions can therefore be inspected for their alignment to specific variables, enabling conclusions about the main trends found in the data. MCA thus offers a visual representation of contingency tables and is well suited for the categorical data collected in this study. Furthermore, MCA allows us to investigate the relationship between indicators and countries jointly. Several considerations apply when analyzing data via MCA.
First, we apply MCA in a strictly exploratory fashion. Inspecting its visual output facilitates interpretation of the relationship between indicators and how they relate to countries, but we do not conduct any testing of hypotheses. Second, the graphical solution offered by MCA maximizes deviations from the average, allowing for statements of the prevalence of indicators relative to one another. For statements about absolute frequencies of indicators across countries we rely on MCA's numerical output (see Supporting information), as well as cell frequencies found in the corresponding contingency tables. All supporting data and required code are available via Zenodo (Pontika, Klebel et al., 2022b).

RESULTS
Assessing the prevalence of traditional and alternative (especially open/responsible researchrelated) criteria across policies of 107 institutions, we find substantial differences in their prevalence (Figure 1). While 72% of institutions mention "service to the profession," no institution mentions data sharing or Open Access publishing. Overall, traditional indicators, related to the profession or to scientific publications, are much more common than indicators related to open and responsible research.
In terms of more traditional indicators, by far the most common indicator mentioned in the policies was service to the profession, which includes activities such as organizing conferences or mentoring PhDs (72%). Extending the concept of professional service, almost half of the policies also mention peer review & editorial activities (47%). A second important aspect among the sampled policies is that of scientific publications, with frequent mentions of the number of publications, or publication quality. Although a call to rate quality over 9 https://www.deepl.com/translator. quantity is not uncommon in the policies, problematic practices were still worryingly prevalent. For example, journal metrics such as the Journal Impact Factor were mentioned in at least a quarter of the policies, while sheer productivity, as measured by quantity of publications, was present in around a fifth of cases.
Indicators relating to open and responsible research were very rare. We discovered no mentions of data sharing or Open Access publishing. Creation of software was quite well represented (13% of cases), due to its prevalence in policies in Brazil, where it is mentioned at 75% of institutions (Figure 3; see Section 4.2). Mentions of RRI elements were more encouraging, as the RRI-related aspects of interactions with industry (37%), engagement with the public (35%), and engagement with policy makers (22%) were relatively well represented. However, issues relating to gender were mentioned only in 6-9% of cases.

Relationship Between Indicators
Institutions rely on a distinct combination of indicators and criteria to assess researchers. To investigate how these indicators are related (i.e., which aspects are commonly mentioned in tandem), we rely on MCA. This method relates criteria against each other and allows us to investigate deviations from the average, as well as which indicators commonly appear together. To further substantiate these findings, we provide bivariate correlations between all indicators in the Supporting information ( Figure S2). Note that these analyses are exploratory and based on a data set of moderate size.
The first apparent aspect from analyzing the variables jointly is that the studied indicators tend to be cumulative (Figure 2). In broad terms, there is a basic divide between institutions that mention many criteria and those that mention few to none (see also Table 5 on how this relates to countries). Investigating relationships further, the first dimension (horizontal axis) draws heavily from engagement beyond academia: with industry, the public, and policy makers. Institutions commonly mention them together, with bivariate correlations of about .5 between the three indicators (see Figure S2). The same institutions also mention contributions to review & editorial activities, service to the profession, and publication quality more often than the average institution. On the other end of the spectrum (righthand side) are institutions that mention engagement beyond academia and service to the profession less frequently than the average. Although the first distinction (between institutions mentioning many indicators and engagement beyond academia in particular) is strongest, the second dimension (vertical) provides additional insight on the interrelatedness of the indicators. The divergence along this dimension revolves around institutions that mention publication quality and reliance on citations on one side, with institutions mentioning software, patents, and journal metrics, as well as citizen science, on the other side. It is noteworthy that publication quality and citations (often seen as a proxy indicator for publication quality), are mentioned jointly at an above average rate. Mentions of publication quality, on the other hand, are unrelated to mentions of journal metrics (such as the Journal Impact Factor, r = −0.01, 95% basic bootstrap CI [−0.21, 0.17]), which are considered a much more problematic indicator of research quality. Citations and journal metrics are represented on opposite sides of the vertical spectrum, despite the criteria being weakly correlated (r = .22, [0.02, 0.44]). This is driven by the fact that journal metrics are moderately related to patents Figure 2. Relationship between indicators for review, promotion, and tenure. The figure is a graphical representation of the relationships between indicators when considering their multivariate relationships. The figure's origin (0, 0) represents the sample average. "++" means that an indicator is present, "-" that it is not present. Engagement is abbreviated with "E." The horizontal axis (Dimension 1) accounts for 66.7% of variation in the data. This dimension mainly contrasts institutions that mention instances of engagement (with the public, industry, or policy makers) and citizen science, as well as service to the profession, with institutions that do neither. The vertical axis (Dimension 2) accounts for 8.4% of variation in the data. This dimension mainly contrasts institutions that value publication quality and rely on citations with institutions that value patents and journal metrics, as well as software, on the other end of the spectrum (bottom). Citizen science is found near the bottom of the axis but does not contribute strongly to this dimension. The indicators "Data sharing" and "Open Access publishing" are not included in the model, as both were not found in any of the policies. Furthermore, variables relating to gender were not included, because they relate to the composition of review panels rather than research assessment criteria per se.   Table 5. Low cell frequencies and empty cells prohibit the use of common chi-square metrics for contingency tables.
(r = .32, [0.14, 0.52]), but unrelated to publication quality. It should be noted that the concepts of "journal metrics" and "publication quality" might overlap in how they are applied in practice. While we coded text phrases such as "High quality scholarly outputs with significant authorship contributions" as pertaining to publication quality, this might in practice be assessed via journal metrics (e.g., Journal Impact Factor or ranking quartiles).

Country Comparison
When comparing countries, we find differences in terms of the overall prevalence of indicators, but also their relative importance. The absolute number of indicators per country varies because we sampled more institutions for larger countries than for smaller ones. Importantly, however, the relative number of indicators also varies considerably (Table 5). Although just under a third of the analyzed indicators were identified in policies in Austria, Brazil, Germany, Portugal and the United Kingdom, the figures were 18% for the United States and 16% for India. The lower numbers in the United States and India may reflect the nature of the documents examined in those cases. We only examined institution-wide policies and in the United States it may be the case that detailed criteria are more often contained at departmental or faculty-level policies; in India (as stated) assessment forms were also analyzed, as few institutions had official policy documents (see Table 3).
Regarding the relative prevalence of the various indicators, we find considerable differences between countries. Figures 3 and 4 rely on the same data and enable a thorough examination of the differences present in our data across countries. In the following we summarize results for each country in detail.
• Austria: A very high share of sampled institutions mention the number of publications (67%), and half also mention journal metrics. Service to the profession, while the most common concept across countries, is mentioned in only 50% of institutions in Austria. A major distinction between institutions from German-speaking countries (i.e., Austria and Germany) and all other countries is that the former frequently mention concepts of gender, with four out of six Austrian universities mentioning gender equality, while this is not found in any other country. • Brazil: All Brazilian institutions mention service to the profession, and three out of four mention patents, review & editorial activities, and software, while mentions of software are uncommon in other countries. Similar to India and Austria, journal metrics are mentioned quite frequently (42%). Sampled policies from Brazil are similar to policies from the United Kingdom in frequently mentioning service and engagement beyond academia but diametrically opposed in also frequently mentioning patents and software, both of which are very rare in the United Kingdom. Finally, both country profiles are relatively far from the sample average, indicating configurations that are less common among other countries. • Germany: Policies from German universities are very similar to their Austrian counterparts, which suggests similarities based on shared cultural and academic traditions and influences. For example, policies from both Austria and Germany commonly mention gender equity. However, the concepts of patenting and review & editorial activities appear considerably more frequently in German policies than in Austria, with the fewest mentions of service to the profession across the sample also found in Germany. • India: Contrary to all other countries, we find no evidence of policies referring to review & editorial activities as a criterion for promotion, and very few cases that refer to the number of publications that a given researcher has produced. On the other hand, mentions of journal metrics were very common in the policies sampled from Indian institutions (67%, n = 8), while less common among the other countries. • Portugal: All sampled universities mention engagement with the public, which is a strong exception in the sample. Furthermore, many institutional policies mention service  Figure 3. To allow for an investigation of which criteria are more common in a given country than in the rest of the sample, we project country profiles into this space. These "supplementary variables" do not have an influence on the layout of the indicators. The countries' positions are to be interpreted as projections onto the respective axes by examining their distance to indicators that are central to the respective dimension (see Section 3.1 and Figure 3 for the interpretation of the axes).
to the profession, engagement with industry, patents, as well as the number of publications. Indicators that are less common across the sample, such as citizen science, software, and citations, as well as engagement with policy makers, are not found at all in Portugal. • United Kingdom: All sampled universities mention service to the profession, and fourfifths mention publication quality. Equally, institutions from the UK mention all three dimensions of engagement beyond academia (industry, public, policy makers) considerably more frequently than the average of the sample. Finally, policies mention patents and the number of publications considerably less frequently than institutions from other countries. • United States: In line with the overall finding of a low propensity of indicators across universities from the United States (Table 5), all indicators were found at a slightly lower rate than in other countries. Given that the sample deliberately included more institutions from the United States, institutional policies from the United States are quite close to the average across the whole sample ( Figure 4). The biggest deviations from the sample average are found with engagement with industry, which is mentioned least frequently in the United States compared with all other countries.
Overall, we find a low uptake of alternative evaluation criteria covering open and responsible research. However, there is substantial variation between countries ( Figure 5). Summarizing eight core criteria ("Citizen science," "Data," "Engagement with industry," "Engagement with policy makers," "Engagement with the public," "Gender equality," "Open Access," "Software"), we find the highest uptake of alternative criteria in Brazil, Portugal, and the United Kingdom, with 1.9, 1.8, and 1.8 alternative criteria per university on average. Uptake is lower in Austria and Germany, and particularly low in the United States (0.7 criteria on average) and India (0.3). Figure 5. Uptake of alternative indicators. Here we display how frequently alternative indicators are found in the policies. We consider the following eight indicators: "Citizen science," "Data," "Engagement with industry," "Engagement with policy makers," "Engagement with the public," "Gender equality," "Open Access," "Software". Dots represent the mean across all universities of a given country, with bootstrapped confidence intervals (95%).

Comparison with Citation Ranking
Previous research has found no evidence of an association between universities' ranking positions and the prevalence of traditional or alternative criteria when controlling for geographic region (Rice et al., 2020). Here we conduct a similar analysis, examining differences in the citation ranking and its relationship to the set of criteria, while controlling for country. Removing the influence of countries is meaningful in this context, because an institution's location and its citation ranking are clearly linked ( Figure S4).
After controlling for country, we find only small differences in the prevalence of indicators with respect to an institution's ranking ( Figure 6). Institutions with a low as well as with a medium citation ranking are very close to the sample average on both dimensions. Both are characterized by slightly above-average mentions of the dimension of engagement beyond academia, as well as the dimension of service. Highly ranked institutions are characterized by slightly lower than average mentions of service, review & editorial activities, and engagement, but slightly higher mentions of publication quality, citations, and journal metrics (see also Figure S5).

"Numbers Help"? Journal Metrics, Publication Quantities, and Publication Quality
We next look further into the ways in which two problematic practices (use of journal-level metrics and numbers of publications as indicators of quality and productivity respectively) are expressed in policies, as well as how policies discuss publication quality.
More than a quarter of the policies we examined mention the Journal Impact Factor or some other measure of journal/venue prestige as an assumed proxy for the quality of research published there. This was highest in India (67%) and Austria (50%). In the latter, unambiguous use Figure 6. Relationship between indicators with superimposed citation ranking groups. The relationships between criteria displayed in this figure are the same as in Figure 3. To allow for an investigation of which criteria are more common in a given ranking group than in the rest of the sample, we project the respective profiles into this space. These "supplementary variables" do not have an influence on the layout of the indicators. Ranking categories are calculated within-country to control for the influence of country on an institution's citation ranking. The ranking positions are to be interpreted as projections onto the respective axes, by examining their distance to indicators which are central to the respective dimension (see Section 3.1 and Figure 3 for the interpretation of the axes). of the Journal Impact Factor was found: "The evaluation is based on the journal rankings according to the impact factors from the unchanged ranking lists of the Institute of Scientific Information (ISI)" (Medical University of Vienna, AT_4a). Brazil (42%) also relied heavily on journal-level metrics, specifically the "QUALIS-CAPES classification," the Brazilian official system of journal classification (Pinto, Matias, & Moreiro González, 2016). Use of such metrics was least visible in the United Kingdom, where the 14% of policies that mention them also tend to be more circumspect in their language (e.g., "Excellence might be evidenced […] (in part) by proxies such as journal impact factors" (Teesside University, GB_6)).
Numbers of publications as a criterion are present in around one in five policies, invoked in various ways. This criterion is especially common in Austria (67%), (e.g., "The list of publications of a habilitation candidate must include at least 16 scientific publications in international relevant journals with peer review procedures, which have been published in the last 12 years," Medical University of Innsbruck, AT_3). Such quantification is sometimes used in the contexts of strict formulas that also used journal-level metrics (as at the aforementioned Medical University of Vienna (AT_4a): "The basic requirement for a habilitation is 14 points, with 1 point for a standard paper and 2 points for a top paper"). In the United States, quantity of publications is mentioned in 17% of cases, but usually emphasized as just one factor amongst others (e.g., "Quantity can be a consideration but quality must be the primary one" (University of Missouri-St Louis, USA_25)). In the striking words of one US institution, however, "numbers help" [emphasis ours] when reporting "the total number of peer-review articles or other creative and research outputs" (University of Nevada, Las Vegas, USA_31). In Germany, only 25% of institutions focused on publication numbers, but often emphasized "not to set a fixed minimum number, but rather an approximate guideline" (TU Dortmund, DE_12). However, in Austria and Germany we also found that as a matter of course many institutions ask for full publication lists as part of their criteria. We did not code these as explicitly supporting publication quantity as an indicator for assessment. However, in practice we might assume that the length of publication lists may be used as an unofficial factor in decisions.
As with journal metrics, UK policies only very rarely mention publication numbers as a factor ( just 4%). This is in stark contrast to the number of UK policies mentioning publication quality as an important criterion (79%). Here, the influence of initiatives such as DORA and the United Kingdom's Forum for Responsible Metrics, and the way these have translated into the UK national assessment exercise, the Research Excellence Framework (REF), is clearly visible. We found, for example, exhortations to produce "high quality" work "that is judged through peer review as being internationally excellent or better in terms of originality, significance and rigour" (University of Sheffield, GB_9a). As we discuss below, this language is highly similar to that of the REF itself, suggesting that institutions have adapted their assessment policies to REF criteria.

DISCUSSION
The need to reform reward and recognition structures for researchers to mitigate effects of overquantification and incentivize uptake of open and responsible research practices is well understood. Our results show just how far there is to go.
We found that policies for assessing researchers for review, promotion, and tenure among an international sample of 107 institutions in seven countries largely relied on traditional criteria (service to professions, review & editorial activities, publication quality, and patents). Alternative criteria related to open and responsible practices were much less prevalent. Here, considerations related to Responsible Research and Innovation, such as engagement with industry, the public, and policy makers fared better, present in between 22% and 37% of policies. Gender elements (including commitments to gender equity and gender balance of reviewers) were only present in between 6% and 9% of cases, and only found in Austria and Germany. Criteria related to Open Science were very rare-sharing of data and Open Access publishing were not found in any policy. These general findings across countries hence largely confirm previous findings, which have focused on US/Canadian institutions Alperin et al., 2020;McKiernan et al., 2019;Niles et al., 2020) or a particular discipline (Rice et al., 2020).
Regarding differences between the countries studied, we found substantial differences with some common patterns. Overall, we found very few criteria in the policies from India and the United States. India likely constitutes a special case, with institutions often seeming to lack written policies beyond those implied by the required criteria in application forms. For the case of the United States, the distinction between general RPT policies and those from specific departments and schools is crucial. The analyzed policies represented general policies, which in many cases laid out general principles but did not include more specific criteria, which were to be defined by each school or department. This is in contrast to policies from Austria or Germany, where the policies applied equally university-wide; however these policies were very specific and we did not find evidence of further policies at faculty or department level.
Criteria relating to engagement beyond academia (industry, policy makers, public) appeared together very often (correlations around r = 0.5), and most commonly in Brazil, Portugal, and the United Kingdom. The high share of UK institutions mentioning this type of outreach can be related to the influence of the REF as an organizing principle. Twenty-five per cent of the profile for an institutional score in the REF is attributed to "Impact," defined as "an effect, change or benefit beyond academia" (Sutton, 2020), and the prevalence of broader impact criteria in current institutional policies seems to reflect this importance. Similarly, the effect of the REF on definitions of quality of outputs and the diminished use of journal metrics as a proxy for quality can clearly be seen in the United Kingdom, whereas journal metrics were barely mentioned and many policies focused on the quality of publications themselves were foregrounded, sometimes literally using the REF definition of "Quality of research in terms of originality, significance and rigour," as in the case of (University of Sheffield, GB_9a) quoted above.
There are high similarities between the United Kingdom, Portugal, and Brazil, with a strong emphasis on service, review & editorial activities, and the dimension of engagement beyond academia. However, the United Kingdom is very distinct from Portugal and Brazil in terms of patents and publication quality. While patents are very frequently mentioned in Brazil and Portugal, but almost not at all in the United Kingdom, publication quality was found very often in the United Kingdom but not at all in Brazil, and only in one out of six universities in Portugal.
Relationships between the presence or absence of specific indicators and an institution's relative level of "prestige" (imperfectly captured here via their citation ranking) are weak. Our findings align with results by Rice et al. (2020), who reported statistically insignificant coefficients for universities' ranking positions on the uptake of traditional and alternative indicators, after controlling for country. Institutional policies explicitly rewarding high levels of citations or publications in journals with a high Journal Impact Factor do not seem to translate easily to an institution's increased success in this regard. Hence, not only are such policies problematic in incentivizing gaming of metrics and potentially fostering bad practices (Higginson & Munafò, 2016;Ioannidis, 2005), but they do not even necessarily work as desired to raise an institution's position in such rankings.
In addition, that the prestige of institutions is less of a factor than we may have expected spotlights the extent to which local or regional norms of research assessment must be further studied. We suggest that a main takeaway from our findings is that although the overall trends (predominance of traditional and quantitative indicators and general lack of newer metrics of open and reproducible practices) are visible across the countries, substantial differences in the emphasis on specific sets of criteria exist. This has implications for reform of reward and recognition.
Current RPT policies result from a complex network of factors, including diverging evaluation cultures, differing levels of institutional autonomy, and institutional preferences (Adler et al., 2009;Brown, 2014;Coonin & Younce, 2009;Gardner & Veliz, 2014;King, Acord, & Earl-Novell, 2010;McGill & Settle, 2011;Seipel, 2003;Walker et al., 2010). The best route forward on reforming these local assessment cultures is thus to be decided in light of historical and contextual considerations. Strict, one-size-fits-all reforms would be insufficient in bringing about desired outcomes across the complex web of differing evaluation cultures.
In addition, open and responsible research practices are still moving into the mainstream, and at different rates in differing countries, regions, and types of institutions. Factors such as levels of resources mean local contexts will have different levels of preparedness to adopt open and reproducible research practices. Because, we assert, it would be unfair for an institution to expect these practices before being able to adequately support their implementation (through training, services, and infrastructure), it is essential that reforms are built upon adequate institutional foundations for performing open and reproducible research (Ross-Hellauer, Reichmann et al., 2022). In addition, particular open and responsible practices are of different relevance across disciplines, and hence reform must respect disciplinary cultures. Indeed, two recent surveys of research institutions by the European University Association highlight several barriers to change in research assessment from the institutional point-of-view (Morais & Borrell-Damian, 2018;Saenen, Morais et al., 2019). Primary among these is the sheer complexity of the issue, which (as already stated) must account for differences in disciplines and career stage but also the various levels at which rewards and incentives can be structured, such as the level of research groups, departments, faculties, and institutions, as well as (cross-)national actors such as governments and research funders. Other factors identified by the EUA survey include lack of capacity, need to align policies with national or international agendas, resistance to reform from researchers or management, worries about increased costs, and lack of evidence on benefits (Morais & Borrell-Damian, 2018;Saenen et al., 2019).

CONCLUSION
Current assessment of research and researchers forms a major barrier to the uptake of open and reproducible research, retaining too much focus on inappropriate indicators, productivity as determined by quantity, and individual achievements rather than collaborative open research practices, and the socioeconomic impact of research. In this paper, we have quantitatively demonstrated that across countries, inclusion of such criteria remains rare. Although outreach to stakeholders beyond academia (public, industry, policy makers) is somewhat better represented, practices related to Open Science are certainly not. Our sample is unique in addressing many countries across disciplines, and as such demonstrates that although general trends of overquantification and undervaluing of open and responsible research can be observed, important differences between countries can be seen. In seeking reform, care must be taken to respect the historical and contextual reasons for such divergences.

Limitations and Future Work
The collection of policies was conducted under time constraints and in two rounds (1 year apart). In addition, obtaining policies was difficult as they were often internal documents. Hence, there may have been some time lag in that the policies we examined were not the most recent versions.
The search protocol for the construction of the sample implied looking for the general RPT policy for the whole institution. This left out departmental policies and other policies such as Ethics, Diversity, and Open Access. In the Portuguese case, most universities have separate Open Access policies that in some cases are tied with promotion criteria, and therefore were not included in our analysis. The focus on general policies might also explain the overall low rate of policies found for the United States.
All analyses presented in this paper are of exploratory nature. Our paper strives to explore the landscape of RPT policies to gather initial evidence on the prevalence of specific concepts. Future studies that focus on particular aspects and analyze them with a highly targeted approach (with preregistered hypotheses of particular relationships) would be a meaningful extension of our work.
This work only identifies the prevalence of concepts in documents. It does not further analyze the particular contexts of use. Future qualitative work would provide this. In addition, this work does not consider how these policies were actually put into practice, and future survey or interview work would help in understanding these broader contexts. As an example, many policies mention that candidates should submit a full CV, including a complete list of publications. We did not consider these to be instances that quantify research output (indicator "Number of publications"). However, it is reasonable to assume that all requested materials will be considered to some extent, and that longer publication lists might be helpful.
A further factor to consider is, of course, the degree to which policies actually guide practice in review, promotion, and tenure evaluations. As one of our anonymous reviewers astutely notes, some studies (e.g., Langfeldt, 2001) indicate that reviewers are often highly selective in adhering to such criteria. If, for example, the removal of explicit reference to Journal Impact Factors is not also followed by a cultural change whereby assessors are educated about the reasons they are not a good indicator of individual performance, then such factors may continue to play a role, even unofficially. Future work may build upon existing work (e.g., Hammarfelt, 2017;Hammarfelt & Rushforth, 2017) to further explore how criteria are implemented and weighted across disciplines.
Future work could also try to collect a larger number of RPT policies from a broader variety of geographical areas that are not included in our research. A greater sample would enable enhanced understanding of local contexts. Indeed, randomized sampling of a much broader range of countries might enable stronger claims about the state of criteria in use for RPT globally.