Abstract
Open Access (OA) facilitates access to research articles. However, authors or funders often must pay the publishing costs, preventing authors who do not receive financial support from participating in OA publishing and gaining citation advantage for OA articles. OA may exacerbate existing inequalities in the publication system rather than overcome them. To investigate this, we studied 522,411 articles published by Springer Nature. Employing correlation and regression analyses, we describe the relationship between authors affiliated with countries from different income levels, their choice of publishing model, and the citation impact of their papers. A machine learning classification method helped us to explore the importance of different features in predicting the publishing model. The results show that authors eligible for article processing charge (APC) waivers publish more in gold OA journals than others. In contrast, authors eligible for an APC discount have the lowest ratio of OA publications, leading to the assumption that this discount insufficiently motivates authors to publish in gold OA journals. We found a strong correlation between the journal rank and the publishing model in gold OA journals, whereas the OA option is mostly avoided in hybrid journals. Also, results show that the countries’ income level, seniority, and experience with OA publications are the most predictive factors for OA publishing in hybrid journals.
PEER REVIEW
1. INTRODUCTION
The unrestricted availability of Open Access (OA) publications is linked to the goal of granting all interested parties free access to scientific knowledge and ensuring greater equality of access (Munafò, Nosek et al., 2017). This view is strongly related to the consumers of scholarly knowledge, who then would not have to pay for access. However, when taking the authors of those articles into account, they are affected by OA in two different ways: when choosing a publication model for an article and when receiving citations (and hence reputation) for articles that have been published via a certain model (usually described as citation advantage; see, for example, Langham-Putrow, Bakker, and Riegelman (2021)). Those two aspects of OA may introduce significant biases and inequity into the scholarly publication and reputation system because they may restrict participation in OA in particular ways (Bahlai, Bartlett et al., 2019).
First, the OA publishing model generally shifts the publishing costs from readers to authors or their institutions and funders by introducing article processing charges (APCs). This can be a severe constraint for those authors who cannot afford these costs or do not receive any financial support. To overcome this issue, most publishers have implemented an APC waiver/discount policy for authors from, for example, low-income countries (Lawson, 2015). However, it is an open question as to how the different options for OA publishing and waivers/discounts are considered and adopted by researchers with various characteristics, such as their countries’ income level and also their seniority and gender—factors that are also often associated with the decision to publish OA (Iyandemye & Thomas, 2019; Olejniczak & Wilson, 2020; Simard, Ghiasi et al., 2021; Smith, Merz et al., 2021; Zhu, 2017). Rouhi, Beard, and Brundy (2022) discussed the waiver issues from the perspectives of the publisher, institutions, and developing countries. They mentioned the potential unfairness that authors are confronted with, which may be caused by APC-based models. They argued that waiver programs have yet to address this problem successfully. They suggested that meeting the equity standard requires a cross-functional approach involving publishers, funders, research institutions, individual researchers, libraries, and service providers.
To accommodate OA publishing costs, three funding options have emerged over time. First, diamond OA journals are funded by public institutions, such as libraries, which enable free reading and publishing for all researchers. Second, transformative agreements between public institutions and publishers have been introduced that include reading and publishing contracts and which are also funded by the institutions. In this case, there are no direct fees for authors, but their institutions pay the APCs as part of a consortium. Access to publishing and access to publications is limited to participating organizations only. Third, APCs could also be paid by the authors or their institutions themselves. The first option leads to gold OA at the journal level. Transformative agreements allow authors to publish in either gold OA or hybrid journals (which—for a fee—allow publishing individual articles as an OA-variant). The third option is often associated with hybrid journals. All other publishing models for journals usually require funding via subscriptions, resulting in closed-access (CA) articles that can only be read after paying the article or journal fee.
The publishing model is also strongly associated with the visibility of authors and articles. For many researchers, it makes a difference in which journals they publish (e.g., considering discipline-specific journal rankings). If they want to be noticed by others and/or seek promotion, it can be crucial to publish in reputable journals, especially for early-career researchers. To achieve this, not only do financial hurdles and APCs have to be overcome, but also, for example, English language skills and technical skills are needed, as well as institutions that can help with legal advice or infrastructure support. Against this background, researchers have to decide which publishing model to choose and whether OA is not only an altruistic but a feasible option at all.
The second possible source of bias and inequity is related to the paying for access case: It has been shown already that articles published as OA variants are more visible, leading to higher citation counts and altmetrics (Evans & Reimer, 2009; Fraser, Momeni et al., 2020; Lewis, 2018; McKiernan, Bourne et al., 2016; Ottaviani, 2016). Moreover, the Matthew effect shows that researchers who are already well known and widely cited receive even more citations (Farys & Wolbring, 2021)—which directly affects rewards for publication in prestigious journals, for prominence, and citations. For researchers, publications play a central role in their daily practice and the reputation system in which they operate. Publications enable researchers to build on the body of knowledge and refer to those findings by citing the publications (which accumulate reputation in this way). Hence, access to publications is crucial for the progress of science and building of reputation—both of which can be impeded by a lack of access to OA publishing options and the risk of CA articles not being cited as frequently as OA articles.
From that, we hypothesize that researchers with better access to financial resources have better access to publications—both in terms of access to read openly and in terms of access to publish openly. Associated with that may be an even stronger citation advantage for those researchers (usually WEIRD: Western, educated, industrialized, rich, and democratic (Henrich, Heine, & Norenzayan, 2010)) with extensive OA-publishing options. As such, OA may carry the risk of perpetuating already existing inequalities rather than resolving such marginalization in the scholarly communication system (Fox, Pearce et al., 2021).
2. RELATED WORK
Related work also indicates a strong association between economic factors, OA, and citation advantages. The scientific output of countries is associated with their economic evolution because scientific progress needs governments’ financial support. Samimi (2011) used a Granger Causality Test to examine the causal relationship between scientific output and GDP in 176 countries and found a two-way positive relationship between them. King (2004) compared published papers and their citation impacts across countries and found that only 31 countries contributed to 98% of the world’s highly cited papers and that the remaining 161 countries contributed less than 2%.
OA publishing is also highly influenced by the authors’ country of affiliation, because it determines APC waiver/discount policies or the availability of transformative agreements with publishers. Some publishers offer general waivers or have a discount policy for all of their journals for eligible authors, and the country’s income level mainly determines eligibility. Lawson (2015) has studied the waiver policy of the 32 most prominent publishers and found that 68% of them grant APC waivers. Simard et al. (2021) found that low-income countries publish and cite OA more than upper-middle and high-income countries. The positive correlation between OA citing and publishing is 1.3 times weaker for high-income countries than other countries. Similarly, Iyandemye and Thomas (2019) showed that biomedicine researchers from low-income countries have the highest percentage in OA publishing. Smith et al. (2021) reported the proportionately fewer OA articles published in Elsevier’s journals for low-income countries, despite their eligibility for APC waivers.
Olejniczak and Wilson (2020) studied the articles published by faculty members at research universities in the United States and found that in the United States, male and senior authors are more likely to publish in OA form. Zhu (2017) conducted a survey with over 1,800 researchers at 12 Russell Group universities1 to find the differences in OA publishing regarding discipline, seniority, and gender. Their results revealed disciplinary differences in OA publishing (Medical and Life Scientists are most likely to publish in gold OA journals), more tendency toward OA publishing for senior authors, and across genders for men.
The journal rank is a decisive factor in submitting the article in addition to its business model. Schroter, Tite, and Smith (2005) conducted a survey study with 28 international authors who submitted to the British Medical Journal and found that for authors, the journal’s ranking is more important than the availability of OA.
Many studies have investigated the OA citation outcome, and most found a citation advantage for OA articles (Evans & Reimer, 2009; Fraser et al., 2020; Lewis, 2018; McKiernan et al., 2016; Ottaviani, 2016). However, regarding biases (e.g., quality bias, self-selecting, mandating, self-archiving), different sampling and controlling data make it difficult to conclude that receiving more citations is only the effect of OA. Momeni, Mayr et al. (2021) studied the citation impact of flipping journals from CA to OA and generally found a slightly higher growth in receiving citations compared to journals in the same discipline and the impact factor’s range. However, they did not observe this trend in all scientific fields. Momeni, Mayr, and Dietze (2022) examined the correlation between different factors and the future authors’ h-index and found a positive but weak correlation between them.
One issue that is often discussed together with OA publishing and APCs is the problem of predatory publishing. Predatory publishers take advantage of the OA movement but work against good scientific practice. Ross-Hellauer, Reichmann et al. (2021) did a systematic review to study the threat to equity in science via open science implementations. They concluded that less well-resourced researchers, researchers from non-English-speaking countries, and early-career researchers are particularly affected by the predatory publishing problem.
3. RESEARCH QUESTIONS
We conduct our study on the association between publishing models, the economic background of researchers, and other author-specific and structural factors along three major research questions:
RQ1: What is the relationship between the income level of researchers’ affiliation countries and their publication behavior (do they prefer OA or CA)?
RQ2: What is the relationship between the income level of researchers’ affiliation countries and their publication behavior (OA or CA) with their citation impact?
To answer these questions, we categorize corresponding authors based on the income level of their affiliation country and compare the access status of articles they have published and their citation impact. Whereas the first two RQs are rather descriptive and aim at quantifying the extent to which access to publish openly and access to read openly (and along with it to make them easier/more likely to cite) are related to the economic background of authors, the third RQ takes a variety of factors into account that have been shown to be strongly associated with tendencies to publish OA (Iyandemye & Thomas, 2019; Olejniczak & Wilson, 2020; Simard et al., 2021; Smith et al., 2021; Zhu, 2017).
RQ3: What factors (e.g., journals, articles, authors, or their countries) are associated with selecting the business model of publications (OA against CA)?
Here we aim to give a detailed view of associating factors with OA publishing using correlation, regression, and machine learning analyses. To this end, structural features, such as APC waivers, are considered besides author-specific properties, such as gender or years of publishing activity (see Table 2). We will also look closely at the different access forms to publications such as gold OA, hybrid, and CA. Concerning the level of journals, the relationships between journal rankings, APCs, and research fields (Health Sciences, Life Sciences, Physical Sciences, Social Sciences, and multiple fields) will be examined. In addition, possible country-related influencing factors will be investigated, such as countries’ income level, transformation agreements’ existence, or opportunities for researchers to obtain APC discounts or waivers. At the journal article level, the ratio of OA to CA citations in an article and the number of authors involved are examined. Other author-specific influencing factors can be gender and age, the ratio of OA to CA publications in the past, or even the proportion of international coauthors.
4. DATA AND METHODOLOGY
To conduct our study, information on the business model, author characteristics, and article impact are needed, and several approaches and databases must be linked to receive a complete data set.
4.1. Data Selection
For the business model of journals (OA, hybrid, CA) it is possible to crawl the information from the journal’s or publisher’s website or to look up sources such as the Directory of Open Access Journals (DOAJ) and Unpaywall, which both include OA information. But information about the history of the business model of journals is rarely available. In recent years, many journals have converted (flipped) from CA to OA and vice versa, but often there is not enough information about the exact date of starting with the new access model. The Open Access Directory (OAD), a wiki hosted by the School of Library and Information Science at Simmons University2, is the only resource containing a list of a few flipped journals and the date of flipping. The OA start date of journals was available in the DOAJ dataset until 2020. Bautista-Puig, Lopez-Illescas et al. (2020) and Momeni et al. (2021) used the OAD and DOAJ for their studies about flipping journals. Unfortunately, the DOAJ has now stopped collecting that information: “As time progressed, open access models became more complicated … It has become harder to find the right answer to that seemingly simple question: when did open access start for this journal?”3Matthias, Jahn, and Laakso (2019) employed different snapshots of data sets that have OA status (Scopus, DOAJ, Ulrichsweb, publishers’ websites, etc.) and some other resources to find out the reverse flip (converting from OA back to CA) and verified them manually. For bibliometric analyses related to OA, it is necessary to know about the access status of journals for the period in which we study the effect of OA. Obtaining information more coherently requires looking into different journals’ business models and harmonizing them to make them comparable. In addition, every publisher has its own rules for APC exemptions to foster publishing in OA format. For example, eligibility for APC waivers for publishing in Elsevier’s journals is based on the “Research4Life program”4 and for Springer Nature based on “World bank classification.” Various transformative agreements with publishers and the period of their contracts are other influential factors that should be considered in studying the publishing behavior of each publisher separately.
Due to these varying APC-related rules for different publishers, we focused on one major publisher. To analyze papers for various disciplines and countries, we chose Springer Nature, the largest publisher of academic journals (more than 2,900 journals5) with worldwide authors from various disciplines, which provides us with a large amount of data and data diversity for more accurate results. Also, compared to Elsevier, the second most prominent publisher of scholarly journals (over 2,700 journals6), this publisher has a higher OA update (Sotudeh, Ghasempour, & Yaghtin, 2015; Sullo, 2016), resulting in less data skewness.
We downloaded the list of journals and their access status from the snapshot from the year 2019, which is available on the publisher’s website7. Three publishing models exist for these Springer Nature (SN) journals: Gold OA, Hybrid (with the open access option: Open Choice), and CA. Figure 1 displays the distribution of journals and their publishing models.
Distribution of Springer Nature’s journals by (a) publishing model and (b) field and publishing model.
Distribution of Springer Nature’s journals by (a) publishing model and (b) field and publishing model.
For the bibliometric analyses, we employed Scopus8. We matched the list of SN journals with journals in Scopus via title and ISSN. From 3,138 SN journals, we could match 2,757 journals, which we used for further analyses. Because of the problems regarding journals’ flipping mentioned above, we limited our data to two years, 2017 and 2018, to reduce the errors related to detecting the journals’ and articles’ business models. This resulted in 522,411 articles.
To detect the publishing model of articles in hybrid journals, we employed Unpaywall9 (the snapshot of 2019), a service to find the available version of articles. We obtained the publishing model of articles in hybrid journals from metadata in this data set.
We obtained the APC amount in U.S. dollars for 1,741 hybrid journals and 297 gold OA journals from the website of Springer Nature10. There was no fixed APC for 147 gold OA journals (only 5% of investigated articles belong to these journals), and we had to visit their website to obtain the exact amount for these journals. Therefore, we replaced the APC amount for these journals with null values (empty) and excluded them from the data for the classification task.
To detect the gender status of authors, we utilized a combined name and image-based approach introduced by Karimi, Wagner et al. (2016), which categorizes gender into male and female. Based on this method, we tried detecting gender using the API at Genderize.io11. For those names that the API couldn’t identify the gender of, we looked for names on the web. We detected their gender using image-based recognition algorithms, which increases the recall and accuracy compared to Genderize.io (Karimi et al., 2016). We acknowledge that the person’s gender is not a binary variable. Considering the social dimensions, more gender identities could not be identified with this approach, and that is left out of the analysis. Using Scopus author ID, we found 381,074 unique corresponding authors for the investigated articles, and 10,614 authors (about 3%) had only initials or no first name, and we could not detect their gender.
Overall, we identified the gender status for 49% of authors. Therefore, we excluded 254,044 articles (about 49%) for which we could not detect the gender status of their corresponding author from data in the regression analysis and classification task. One possible reason for the low rate of identifying gender is the large percentage of authors affiliated with Asian countries (136,591; above 35%)12 and probably originally from these countries. Previous studies tested gender detection tools for authors with different nationalities and found them less effective for Asian names (Karimi et al., 2016; Santamaría & Mihaljević, 2018). Table 1 shows the number and percentage of OA and CA publications belonging to the corresponding authors with a gender status across scientific fields. The percentage of detected gender of authors for OA publications is 4% more than for CA publications.
Number and proportion of articles among scientific fields and publishing model for which we detected the gender status of their corresponding author
. | Publishing model . | |
---|---|---|
CA model (%) . | OA model (%) . | |
Health Sciences | 31,642 (53) | 20,534 (49) |
Life Sciences | 23,011 (54) | 10,032 (57) |
Physical Sciences | 74,742 (48) | 9,927 (50) |
Social Sciences | 9,210 (40) | 2,020 (41) |
Multiple fields | 38,507 (52) | 48,742 (58) |
Total | 177,112 (50) | 91,255 (54) |
. | Publishing model . | |
---|---|---|
CA model (%) . | OA model (%) . | |
Health Sciences | 31,642 (53) | 20,534 (49) |
Life Sciences | 23,011 (54) | 10,032 (57) |
Physical Sciences | 74,742 (48) | 9,927 (50) |
Social Sciences | 9,210 (40) | 2,020 (41) |
Multiple fields | 38,507 (52) | 48,742 (58) |
Total | 177,112 (50) | 91,255 (54) |
4.2. Features and Definitions
To investigate the factors that are associated with higher rates of OA publishing, we defined some features presented in Table 2. Figure 2 presents an overview of data collection and preparation steps. The final analyzed data is available in a Git repository13.
Features used to study the associated factors with OA publishing
Feature type . | Feature . | Description . |
---|---|---|
Journal | journal_ranking | h-index ranking of the journal in the related discipline (for multidisciplinary journals, the average ranking among disciplines). |
journal_APC | The cost of APC to publish OA in the journal (US dollars). | |
field | Field of journal (if the journal has more than one field, the value is ‘multiple fields’). | |
Health Sciences | ||
Life Sciences | ||
Physical Sciences | ||
Social Sciences | ||
multiple fields | ||
Country | country_income | Income level (GDP per capita) of the country in which the corresponding author is affiliated. |
OA_agreement | If the corresponding author’s country of affiliation has an OA agreement with the publisher, it equals 1, otherwise 0. | |
discount_eligible | If the corresponding author’s country of affiliation belongs to the lower-middle income group, it equals 1, otherwise 0. | |
waiver_eligible | If the corresponding author’s country of affiliation belongs to the low-income group, it equals 1, otherwise 0. | |
Paper | OA_cite | Ratio of citing OA against CA in this paper |
authors_count | Number of authors | |
Author* | gender | For females equals 0 and for males 1. |
age | Years since first publication | |
OA_publish | Ratio of OA publications against CA in the past (number of previous OA publications divided by the number of CA publications) | |
international_coauthors | Proportion of international coauthors** to all coauthors in this paper |
Feature type . | Feature . | Description . |
---|---|---|
Journal | journal_ranking | h-index ranking of the journal in the related discipline (for multidisciplinary journals, the average ranking among disciplines). |
journal_APC | The cost of APC to publish OA in the journal (US dollars). | |
field | Field of journal (if the journal has more than one field, the value is ‘multiple fields’). | |
Health Sciences | ||
Life Sciences | ||
Physical Sciences | ||
Social Sciences | ||
multiple fields | ||
Country | country_income | Income level (GDP per capita) of the country in which the corresponding author is affiliated. |
OA_agreement | If the corresponding author’s country of affiliation has an OA agreement with the publisher, it equals 1, otherwise 0. | |
discount_eligible | If the corresponding author’s country of affiliation belongs to the lower-middle income group, it equals 1, otherwise 0. | |
waiver_eligible | If the corresponding author’s country of affiliation belongs to the low-income group, it equals 1, otherwise 0. | |
Paper | OA_cite | Ratio of citing OA against CA in this paper |
authors_count | Number of authors | |
Author* | gender | For females equals 0 and for males 1. |
age | Years since first publication | |
OA_publish | Ratio of OA publications against CA in the past (number of previous OA publications divided by the number of CA publications) | |
international_coauthors | Proportion of international coauthors** to all coauthors in this paper |
Corresponding author.
An international coauthor is a coauthor who has a different affiliation country than the corresponding author.
To compare publishing and citation behavior across countries, we classified countries by income based on the World Bank classification14 into four groups: low, lower middle, upper middle, and high-income economies. The income level of a country has been evaluated every year and its history is available15. From 218 listed countries by theWorld Bank, we excluded 20 countries with different income levels from 2015 to 2018. Springer Nature offers an APC waiver and discount to those articles with the corresponding author from low and lower middle income countries (classified by the World Bank), respectively16.
From the website Transformative Agreement Registry provided by ESAC17 we found three organizations with an open access agreement with this publisher during the investigated years 2017 and 2018 (KEMOE/FWF in Austria, Max Planck Society in Germany, and Bibsam consortium in Sweden) and two organizations (VSNU-UKB in Netherlands and FinELib consortium in Finland) in 2018. We obtained the list of involved institutions in the agreement by asking the KEMOE/FWF, Bibsam, and FinELib organizations. The list of participating institutions via VSNU-UK was available on the website of SN18. We assumed that publications with the corresponding author affiliated with institutions included in the transformative agreement are free of APC charges. To find Max Planck institutions, we used disambiguated institutional addresses for German institutions (Rimmert, Schwechheimer, & Winterhager, 2017) available on Scopus-KB. We manually looked up the participating institutions for the rest of the four countries. We found 12,323 articles and used them to set the feature “OA agreement” value.
Figure 3 represents the number of articles published in Springer Nature where their corresponding author is affiliated with a country with the respective income group. Sixty-seven articles had a corresponding author with multiple affiliation countries and we excluded them from the analyses. Publication distribution by countries and their income level are available on GitHub19.
Number of papers published by Springer Nature grouped by income level of countries.
Number of papers published by Springer Nature grouped by income level of countries.
We needed to identify authors and their publications to obtain the ratio of authors’ previous OA publications. Scopus Author Id enabled us to get each author’s published article list. For the variable Country income, we consider average GDP per capita in 2017 and 2018 obtained from the World Bank group20. We used the year of the first publication of authors indexed in Scopus to calculate their career age as a measurement of seniority.
To evaluate and rank the quality of journals, we employed the journal’s h-index, which Hodge and Lacasse (2011) suggested as a better measurement for ranking journals than the five-year impact factor in social science that has been used in previous studies (Barner, Holosko, & Thyer, 2014; Xia, 2012). We calculated the h-index of all journals in Scopus classified in 27 subject categories21 between the years 2011 and 2016.
4.3. Methodology
4.3.1. Normalizing the citation impact
We employed a similar normalizing approach to present the citation impact of articles. Because the citation count is confounded by time since publication, we consider the citations during a time window of 2 years since the publication, as in previous studies (Jannot, Agoritsas et al., 2013; Piwowar, Priem et al., 2018). Next, we categorized the articles into groups with the same subject category and publishing year and ranked them from 0 to 100 based on received citations. We define a PR of 50 (citation’s median) as a threshold for highly cited articles. An article is highly cited if its rank is above 50% of PR in its group, meaning that it has received more citations than half of the articles in the same subject category and publishing year. For articles belonging to multiple subject categories, we used wPR mentioned in Eq. 1, where sci is the ith subject category of the article, nsci is the number of articles in this subject category, and PRsci is the PR of the article in it.
4.3.2. Correlation analysis
To find the association between OA publishing and any feature defined in Table 2 we conducted a correlation analysis. The first variable in calculating the correlation is OA publishing, a dichotomous variable (a case of categorical variable). To assess the association with field, which is a categorical variable, we selected Cramer’s V coefficient. Cramer’s V is based on the chi-squared test and measures the strength of association between two variables. Its value ranges from 0 (no association) to 1 (complete association). The association with binary variables (OA_agreement, discount_eligible, waiver_eligible, gender) was examined with the phi coefficient (Ekström, 2011). This correlation coefficient ranges from −1 to +1 and shows the strength of the positive or negative correlation between two dichotomous variables. To measure the association with other numerical or continuous variables, we applied the point-biserial correlation coefficient, which is used instead of the Pearson correlation when a variable is dichotomous (LeBlanc & Cox, 2017) and can range from −1 to +1.
4.3.3. Regression analysis
We used multivariate logistic regression to find the relationship between various variables (defined in Table 2) and OA publishing. This is a common method for modeling the relationship between the dichotomous dependent variable and multiple independent variables. It allows us to understand the association of the dependent variable with an independent variable in the presence of other independent variables in the data.
4.3.4. Classification method
We employed a machine learning method to estimate the likelihood of choosing the publishing model. To this end, we categorized the publishing model of articles into two groups, OA and CA. Then, we utilized the value of defined features in Table 2 to predict the publishing model. This process is a classification task in machine learning.
To estimate the publishing model of articles, we use a supervised machine learning method, random forest (RF): a common tool for classification tasks (Behr, Giese et al., 2020; Kumar, Mukhopadhyay et al., 2019; Roy, Chopra et al., 2020; Yamak, Saunier, & Vercouter, 2016). We utilize this tool for binary classification (OA = 1 or CA = 0) and use the features introduced in Table 2 as independent variables. We implement the algorithm for hybrid journals in which authors can choose their paper’s business model. We used a k-fold cross-validation (k = 10) procedure to train and test the model.
Due to the skewed distribution in the target variable (91% CA and 9% OA publishing), we balance them by resampling data via SMOTE (synthetic minority oversampling technique), which is proven to be a suitable method to handle a class imbalance problem (Spelmen & Porkodi, 2018).
5. RESULTS
In this section, we first present some descriptive statistics about the publishing model of articles across four country groups and address RQ1. Next, we display their differences in terms of citation impact among different models to answer RQ2. Then we focus on RQ3 and present the correlation coefficient between the publishing model and features defined in Table 2 and multivariate logistic regression to show the relationship between variables. Also, we demonstrate the performance of estimating the publishing model of articles in hybrid journals and the importance of defined features in the estimation task to reveal the influential factors in selecting the OA model for publishing.
5.1. Countries’ Income Level of Corresponding Authors and Their Publishing Model
Figure 4 shows the distribution of articles categorized by publishing model and the country income level of the corresponding authors. Authors with affiliations in countries with the lowest income level and eligible for the APC waiver have the highest proportion of gold OA publications. In contrast to this, authors from lower middle income countries who are eligible for the APC discount have the lowest percentage in gold OA publishing.
Distribution of articles published in journals with three publishing models across four groups of countries. The access status of hybrid articles has been identified from Unpaywall (cases 2 and 3). For case 4 (hybrid, no access status), we could not find hybrid journals’ articles in Unpaywall.
Distribution of articles published in journals with three publishing models across four groups of countries. The access status of hybrid articles has been identified from Unpaywall (cases 2 and 3). For case 4 (hybrid, no access status), we could not find hybrid journals’ articles in Unpaywall.
5.2. Countries’ Income Level of Corresponding Authors and Their Citation Impact
Figure 5 shows the ratio of highly cited articles with different publishing models across country groups for the investigated articles. Generally, we observe a higher percentage of highly cited papers for corresponding authors from countries with higher income levels.
Percentage of highly cited papers published in different models. Hybrid Open Access/Closed Access belongs to articles published as OA/CA in hybrid journals.
Percentage of highly cited papers published in different models. Hybrid Open Access/Closed Access belongs to articles published as OA/CA in hybrid journals.
The ratio of highly cited articles among all countries for gold and hybrid OA models is higher than in other models. Also, this ratio is higher for gold OA articles and indicates the better citation impact of articles published in gold OA journals. The only exception is for countries with low-income levels, with more highly cited papers in the hybrid OA model. Compared to CA journals, journals in hybrid CA have more highly cited articles, except for countries with a high income level.
5.3. Influential Factors on the Publishing Model
First, we conducted a correlation analysis to find the associations between OA publishing and features. Table 3 shows the correlation coefficient between the publishing model (if open access is equal to 1 otherwise 0) and features in Table 2. We separated the data into two sets: set 1 for articles published in OA or CA journals (nonhybrid journals) and set 2 for articles in hybrid journals. Set 1 reveals the association of discount and waiver policies with OA publishing, and optional OA publishing for hybrid journals in set 2 displays more author-specific factors related to OA publishing. The weak negative correlation with gender demonstrates that the tendency toward gold OA publishing for women is slightly more than for men, which disagrees with previous findings (Olejniczak & Wilson, 2020; Zhu, 2017). As we observed the lowest proportion of OA publishing for countries with a lower middle income level in Figure 4, the negative correlation for discount_eligible (also a positive value for waiver_eligible) in Table 3 points out that the discount policies are insufficient to motivate the authors from these countries for gold OA publishing. Table 4 displays the relationship between the publishing model and features in Table 3 by considering all of the features in multivariate logistic regression. The results confirm the negative/positive correlation calculated in correlation analysis, except that the positive correlation between discount_eligible and the publishing model is inconsistent with the result in the correlation coefficient. The highest Odds Ratios for Social Sciences among fields in Table 4 reveal the highest proportion of OA publishing in this field. This field has experienced a dramatic growth of OA journals since 2009 (Liu & Li, 2018). The strong positive correlation between journal_ranking and the publishing model for the first set suggests that the journal’s rank is the dominant factor in choosing a gold OA journal to publish. Therefore, we estimate the publishing model for articles in set 2 (hybrid journals) to discover other feature categories rather than journal-specific factors influencing the authors’ decision for an OA option. Moreover, the optional choice of the OA model in hybrid journals better reveals characteristics leading to the OA model.
Correlation coefficient between independent variables and the target variable. The value of the target equal to 1 (0) means the paper has been published in the OA (CA) model
Feature . | Correlation test . | Correlation coefficient . | |
---|---|---|---|
Set 1 (nonhybrid) . | Set 2 (hybrid) . | ||
journal_ranking | Point-biserial | 0.70 | 0.07 |
journal_APC | Point-biserial | – | 0.10 |
field | Cramer’s V | 0.69 | 0.09 |
country_income | Point-biserial | 0.28 | 0.16 |
OA_agreement | Phi | 0.08 | 0.30 |
discount_eligible | Phi | −0.08 | – |
waiver_eligible | Phi | 0.06 | – |
OA_cite | Point-biserial | 0.42 | 0.13 |
authors_count | Point-biserial | 0.09 | 0.07 |
gender | Phi | −0.08 | −0.01 |
age | Point-biserial | −0.08 | 0.02 |
OA_publish | Point-biserial | 0.46 | 0.41 |
international_coauthors | Point-biserial | 0.17 | 0.11 |
Sample size: | 192,498 | 329,913 |
Feature . | Correlation test . | Correlation coefficient . | |
---|---|---|---|
Set 1 (nonhybrid) . | Set 2 (hybrid) . | ||
journal_ranking | Point-biserial | 0.70 | 0.07 |
journal_APC | Point-biserial | – | 0.10 |
field | Cramer’s V | 0.69 | 0.09 |
country_income | Point-biserial | 0.28 | 0.16 |
OA_agreement | Phi | 0.08 | 0.30 |
discount_eligible | Phi | −0.08 | – |
waiver_eligible | Phi | 0.06 | – |
OA_cite | Point-biserial | 0.42 | 0.13 |
authors_count | Point-biserial | 0.09 | 0.07 |
gender | Phi | −0.08 | −0.01 |
age | Point-biserial | −0.08 | 0.02 |
OA_publish | Point-biserial | 0.46 | 0.41 |
international_coauthors | Point-biserial | 0.17 | 0.11 |
Sample size: | 192,498 | 329,913 |
The results of logistic regression. The target variable is the publishing model and is equal to 1 for OA and 0 for CA publishing. The outputs are odds ratio, exp(β). (1 − exp(β)) shows the percentage change of the target variable per unit increase in an independent variable. So, an odds ratio greater/less than 1 displays a positive/negative correlation between variables
. | Set 1 . | Set 2 . | ||
---|---|---|---|---|
Odds ratio . | 95% CI . | Odds ratio . | 95% CI . | |
Intercept | 0.002*** (−72.4) | 0.001 to 0.002 | 0.00*** (−87.7) | 0.00 to 0.00 |
Independent variables | ||||
journal_ranking | 1.98*** (10.38) | 1.74 to 2.25 | 110.7*** (86.5) | 99.5 to 100.23 |
journal_APC | 1.00*** (8.05) | 1.0001 to 1.0002 | – | – |
field | ||||
Health Sciences | reference | reference | reference | reference |
Life Sciences | 1.01 (0.31) | 0.94 to 1.08 | 0.67*** (−9.55) | 0.62 to 0.73 |
Physical Sciences | 0.97 (−0.91) | 0.91 to 1.07 | 0.20*** (−44.29) | 0.18 to 0.21 |
Social Sciences | 1.90*** (13.81) | 1.73 to 2.08 | 3.49*** (12.2) | 2.86 to 4.27 |
multiple fields | 1.25*** (8.5) | 1.19 to 1.32 | 3.4*** (30.87) | 3.17 to 3.71 |
country_income | 1.00*** (33.88) | 1.000 to 1.000 | 1.000*** (16.18) | 1.00 to 1.00 |
OA_agreement | 14.9*** (65.07) | 13.78 to 16.22 | 0.93(−0.78) | 0.78 to 1.11 |
discount_eligible | – | – | 1.7*** (9.17) | 1.52 to 1.90 |
waiver_eligible | – | – | 20.19*** (5.53) | 8.29 to 77.5 |
OA_cite | 0.55*** (−12.97) | 0.500 to 0.600 | 1.55*** (8.4) | 1.39 to 1.71 |
authors_count | 1.003 (0.80) | 0.99 to 1.01 | 1.17*** (33.15) | 1.16 to 1.18 |
gender | 0.94** (−2.8) | 0.90 to 0.98 | 0.93* (−2.5) | 0.88 to 0.98 |
age | 1.05*** (29.63) | 1.05 to 1.1.054 | 0.97*** (−15.36) | 0.96 to 0.98 |
OA_publish | 196.79*** (105.65) | 178.46 to 217.09 | 23.86*** (50.58) | 21.1 to 26.99 |
international_coauthors | 1.17*** (18.21) | 1.15 to 1.19 | 1.03 (1.34) | 0.99 to 1.06 |
McFadden’s pseudo R2 | 0.25 | 0.60 | ||
Sample size | 96,674 | 162,773 |
. | Set 1 . | Set 2 . | ||
---|---|---|---|---|
Odds ratio . | 95% CI . | Odds ratio . | 95% CI . | |
Intercept | 0.002*** (−72.4) | 0.001 to 0.002 | 0.00*** (−87.7) | 0.00 to 0.00 |
Independent variables | ||||
journal_ranking | 1.98*** (10.38) | 1.74 to 2.25 | 110.7*** (86.5) | 99.5 to 100.23 |
journal_APC | 1.00*** (8.05) | 1.0001 to 1.0002 | – | – |
field | ||||
Health Sciences | reference | reference | reference | reference |
Life Sciences | 1.01 (0.31) | 0.94 to 1.08 | 0.67*** (−9.55) | 0.62 to 0.73 |
Physical Sciences | 0.97 (−0.91) | 0.91 to 1.07 | 0.20*** (−44.29) | 0.18 to 0.21 |
Social Sciences | 1.90*** (13.81) | 1.73 to 2.08 | 3.49*** (12.2) | 2.86 to 4.27 |
multiple fields | 1.25*** (8.5) | 1.19 to 1.32 | 3.4*** (30.87) | 3.17 to 3.71 |
country_income | 1.00*** (33.88) | 1.000 to 1.000 | 1.000*** (16.18) | 1.00 to 1.00 |
OA_agreement | 14.9*** (65.07) | 13.78 to 16.22 | 0.93(−0.78) | 0.78 to 1.11 |
discount_eligible | – | – | 1.7*** (9.17) | 1.52 to 1.90 |
waiver_eligible | – | – | 20.19*** (5.53) | 8.29 to 77.5 |
OA_cite | 0.55*** (−12.97) | 0.500 to 0.600 | 1.55*** (8.4) | 1.39 to 1.71 |
authors_count | 1.003 (0.80) | 0.99 to 1.01 | 1.17*** (33.15) | 1.16 to 1.18 |
gender | 0.94** (−2.8) | 0.90 to 0.98 | 0.93* (−2.5) | 0.88 to 0.98 |
age | 1.05*** (29.63) | 1.05 to 1.1.054 | 0.97*** (−15.36) | 0.96 to 0.98 |
OA_publish | 196.79*** (105.65) | 178.46 to 217.09 | 23.86*** (50.58) | 21.1 to 26.99 |
international_coauthors | 1.17*** (18.21) | 1.15 to 1.19 | 1.03 (1.34) | 0.99 to 1.06 |
McFadden’s pseudo R2 | 0.25 | 0.60 | ||
Sample size | 96,674 | 162,773 |
Significance: *p < 0.05, **p < 0.01, ***p < 0.001. z-values of coefficients in parentheses. CI: Confidence interval.
Table 5 shows the performance of the RF classifier for the second set (hybrid journals). Figure 6 displays the permutation importance of features employed to predict the publishing model implemented for this set. The permutation importance of a feature shows a decrease in the model performance when the feature’s value is randomly shuffled, but the values of other predictors remain unchanged. A higher value for a feature shows more predictive power in the proposed model. The highest importance values for country_income and age in Figure 6 indicate that the most significant factors in selecting an OA model are the income level of countries and seniority. The lowest value for the variable gender presents that gender has a lower impact on the authors’ decision for the OA model compared to other factors. OA_agreement is one of the weakest features in predicting the publishing model, and the correlation analysis also shows a weak correlation between them. One possible reason for the weak effect is that only 2.3% of papers have been involved in transformative agreements. In addition, the income level of countries is the most important feature, and regarding the positive correlation of this feature with OA publishing, it is more likely for authors from high-income countries (even without a transformative agreement) to publish in the OA model. This may also smooth the association of the agreement with OA publishing.
Performance of predicting the publishing model of papers with random forest method
Classification . | OA . | CA . |
---|---|---|
Precision | 0.85 | 0.94 |
Recall | 0.95 | 0.83 |
F1 score | 0.89 | 0.88 |
Accuracy | 0.92 |
Classification . | OA . | CA . |
---|---|---|
Precision | 0.85 | 0.94 |
Recall | 0.95 | 0.83 |
F1 score | 0.89 | 0.88 |
Accuracy | 0.92 |
Permutation importance of features employed to predict the publishing model of papers with the Random Forest method for the articles published in hybrid journals.
Permutation importance of features employed to predict the publishing model of papers with the Random Forest method for the articles published in hybrid journals.
6. CONCLUSION AND DISCUSSION
This work presents a detailed study of the relationship between author-specific and structural factors (e.g., income level of authors’ affiliation country), OA publishing, and OA citation advantage. First, we investigated the relationship between the income level of countries and OA publishing for articles published by Springer Nature in the years 2017 and 2018. We found that authors from lower middle income countries with eligibility to use APC discounts have a lower proportion of gold OA publications in all published papers by this publisher compared to other countries. It indicates that discounted APC is still too much for these authors to pay for a gold OA model and agrees with the statement of Rouhi et al. (2022), who pointed out that waiver and discount issues could not bring author equity in reading and publishing. In contrast, the proportion of authors from countries with a low income level who receive APC waivers is higher than authors from other countries. This result conflicts with the study results by Smith et al. (2021), which found fewer OA paper proportions published by Elsevier for these countries compared to others. The reason could be stricter conditions that this publisher considers for waiver eligibility.
We examined the citation impact of these articles and compared the percentage of highly cited papers among the publishing models and the income levels of the corresponding authors’ countries. For all countries, the OA model in gold OA or hybrid has the highest percentage of highly cited papers. Also, the results demonstrate a higher proportion of highly cited articles for countries with higher income levels. Although it displays more citation impact for OA models, this can result from confounding factors such as self-selection and quality biases (Gargouri, Hajjem et al., 2010). Also, examining the preprint and green OA publishing effect (where the article has been published in the CA model, but a free version is available in a repository outside of the publisher’s website) will result in more accurate analyses (Fraser et al., 2020; Wang, Glänzel, & Chen, 2020).
We conducted correlation, regression, and machine learning analyses to find more characteristics (e.g., author, journal, paper) related to OA publishing. The results of the correlation analysis displayed the strength of positive/negative correlation between the publishing model and every feature defined in Table 2. Using regression analysis, we examined the association of each factor while accounting for other factors. The results reinforced the correlation outcomes. The only conflict between these two methods was the negative correlation between discount_eligibility with OA publishing in the correlation analysis, whereas it was positive in regression evaluation. In addition, we estimated the publishing model of articles (OA or CA) using an RF-based machine learning approach and examined the impact of each feature on the estimation task. The results show that the country’s income and more experiences in OA rather than CA publishing are the most influential factors in estimating the publishing model. We discovered that the tendency toward OA publishing was slightly higher for women, but it was a less important feature than other features in estimating the OA model.
7. LIMITATIONS AND FUTURE WORK
One obvious limitation of this study is that we included articles from just one publisher, Springer Nature. Authors’ publishing behavior may differ among articles published by other publishers, which limits the generalizability of the results of our study.
We obtained the access status of journals in 2019 based on the list published on Springer Nature’s website (the same for the access status at the article level from Unpaywall). Some journals may have flipped from CA to OA (Momeni et al., 2021) or vice versa, and we did not detect this, which may cause errors in results. Furthermore, we did not control the correctness of external data (Springer Nature and Unpaywall). The accuracy of these data affects the results’ precision. We identified the gender of 49% authors and removed 49% of articles without gender status for the corresponding authors in the regression and machine learning analyses. In addition, 2% of the data have been removed because of the null value in other features (e.g., journals’ APC). Because the gender detection approach does not work well for Asian names, especially Chinese ones, we have a lower proportion of these authors with gender status in the data set, which also creates biases in our analyses.
For future work, we can consider other publishers to examine how the different APC policies among publishers impact OA publishing. Also, controlling for articles’ language in the analyses encourages future studies. Springer Nature is an international publisher and publishes mostly articles in English22, and articles in other languages are underrepresented in this study. Considering other publishers with non-English content and the articles’ language in the analyses may reveal the role of languages in publishing international OA articles and citation advantages.
AUTHOR CONTRIBUTIONS
Fakhri Momeni: Conceptualization, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing—original draft, Writing—review & editing. Kristin Biesenbender: Conceptualization, Resources, Writing—review & editing. Philipp Mayr: Funding acquisition, Project administration, Writing—review & editing. Stefan Dietze: Methodology, Supervision, Writing—review & editing. Isabella Peters: Funding acquisition, Project administration, Supervision, Writing—review & editing.
COMPETING INTERESTS
The authors have no competing interests.
DATA AVAILABILITY
The data set analyzed during the current study and code are available at https://github.com/momenifi/open_access_springer_nature.git.
FUNDING INFORMATION
This work is financially supported by BMBF project OASE, grant number 01PU17005A. We acknowledge the support of the German Competence Center for Bibliometrics (grant: 01PQ17001) for maintaining the used data set for the analyses.
Notes
The in-house Scopus database maintained by the German Competence Centre for Bibliometrics (Scopus-KB), 2021 version.
Authors from Armenia, Azerbaijan, Georgia, Kazakhstan, Russia, and Turkey, which belong to both Asia and Europe, are not included in this list.
REFERENCES
Author notes
Handling Editor: Ludo Waltman