Can the quality of published academic journal articles be assessed with machine learning?

Abstract Formal assessments of the quality of the research produced by departments and universities are now conducted by many countries to monitor achievements and allocate performance-related funding. These evaluations are hugely time consuming if conducted by postpublication peer review and are simplistic if based on citations or journal impact factors. I investigate whether machine learning could help reduce the burden of peer review by using citations and metadata to learn how to score articles from a sample assessed by peer review. An experiment is used to underpin the discussion, attempting to predict journal citation thirds, as a proxy for article quality scores, for all Scopus narrow fields from 2014 to 2020. The results show that these proxy quality thirds can be predicted with above baseline accuracy in all 326 narrow fields, with Gradient Boosting Classifier, Random Forest Classifier, or Multinomial Naïve Bayes being the most accurate in nearly all cases. Nevertheless, the results partly leverage journal writing styles and topics, which are unwanted for some practical applications and cause substantial shifts in average scores between countries and between institutions within a country. There may be scope for predicting articles’ scores when the predictions have the highest probability.


INTRODUCTION
Higher education has become increasingly monitored and managed by national states over the past half century (Amaral, Meek et al., 2003). As part of this, countries typically have competitive systems in place to fund academic research. In addition to project-based funding, some nations also now directly reward research quality through systematic procedures to assess this at the departmental level (e.g., Poland: Kulczycki, Korzeń, and Korytkowski (2017); but not Germany: Hinze, Butler et al. (2019)). The UK's Research Excellence Framework (REF) uses expert postpublication peer review to evaluate the outputs of academic researchers at approximately the departmental level every 6 or 7 years, combining the results with evaluations of case studies and institutional environments (Wilsdon, Allen et al., 2015a). New Zealand's Performance Based Research Fund (Buckle & Creedy, 2019) and Excellence in Research Australia (Hinze et al. 2019) are similar peer review schemes. Italy also has a performance-based funding scheme that includes evaluating the quality of researchers' outputs, albeit exempting journal articles meeting bibliometric thresholds (Franceschini & Maisano, 2017).
Large-scale post publication peer review is hugely time consuming, employing many researchers (over 1,000 in the United Kingdom) to make postpublication quality assessments on the outputs. It is therefore logical to investigate whether the peer review part of this process could be streamlined in any way, such as by automation. In the UK, academics in 2019 believed a n o p e n a c c e s s j o u r n a l Citation: Thelwall, M. (2022). Can the quality of published academic journal articles be assessed with machine learning? Quantitative Science Studies, 3(1), 208-226. https://doi.org/10.1162 /qss_a_00185 that technology would be used to enhance research assessment in the future (Parks, Rodriguez-Rincon et al., 2019), perhaps thinking of this. In November 2021, the four UK higher education funding bodies published a call for a systematic review of the potential for technology to support research assessment, and particularly the labor-intensive REF (Gov.uk, 2021). Motivated by this, the current article uses a dummy automated peer review exercise to underpin a discussion of the potential for artificial intelligence to replace peer review.
Few previous studies have attempted to automatically score the quality of academic research. This is presumably because only aggregate scores are made public by national evaluation exercises, so there is no output-level data to leverage to build effective algorithms. Nevertheless, some have attempted to find indicators that correlate with overall university quality profiles, such as hyperlinks to university websites (Thelwall, 2002) or citation-based indicators (Traag & Waltman, 2019). One report by the team organizing the UK REF has analyzed the raw output scores, however, finding moderate correlations with citation-based indicators and altmetrics (Wilsdon, Allen et al., 2015b). The task of predicting long-term citations for articles is related because citation counts in some fields are approximate indicators of scientific impact. Investigations of this possibility, often using regression rather than machine learning, have found a range of article metadata factors to associate with higher citation counts. These include the number of authors, the number of countries in the author team, the readability of the abstract, and keyword repetition (Hall, Vogel et al., 2018;Lei & Yan, 2016;Li, Zhao et al., 2019;McCannon, 2019;Sohrabi & Iraj, 2017;Stegehuis, Litvak, & Waltman, 2015). Text mining has rarely been used to predict citations but has been used to detect plagiarism (Foltýnek, Meuschke, & Gipp, 2019) or statistical errors (Nuijten & Polanin, 2020) and to investigate topics in fields (Heo, Kang et al., 2017), to identify research trends (Kim & Delen, 2018;Nie & Sun, 2017), to map science (Chen, 2017), and to predict journal or conference reviewing decisions (Checco, Bracciale et al., 2021;Thelwall, Papas et al., 2020).
A few papers have used machine learning approaches to predict article citations, using methods including Support Vector Machines (Fu & Aliferis, 2010), k-Nearest Neighbors, and Bagging (Wang, Jiao et al., 2020), Stochastic Gradient Descent, Random Forest, XGBoost classifier, LGBoost classifier (Klemiński, Kazienko, & Kajdanowicz, 2021), Decision Trees (Su, 2020), and CART (Yuan, Tang et al., 2018). Different deep learning architectures have been proposed and tested on small full text article sets, apparently convenience samples of collections of papers with full text online. These include long-term physics article sets (Zhao & Feng, 2022), library, information, and documentation articles (Ruan, Zhu et al., 2020), historical Markov chain articles (Xu, Li et al., 2019), two computational linguistics conferences , and five prestigious journals (using annual citations rather than full text: Abrishami & Aliakbary, 2019). A comparison of deep learning with other approaches found Support Vector Machines to be the most accurate for computer science publications (Zhu & Ban, 2018). Almost all previous machine learning studies have focused on a single topic or a small set of fields. The only exception, which took a science-wide approach, sampled 12,374 random Web of Science articles from 2015, ignoring field classifications rather than comparing between fields (Akella, Alhoori et al., 2021). Also, no previous study seems to have used a development set that is separate from the training/testing sets, risking overfitting (the current article also does not use separate training/test sets but reports separate scores for 326 fields and does not customize the methods for any field). The machine learning studies so far have not given comparative information about the relative attractiveness of machine learning for different fields of science, and none have addressed the issue of scientific quality estimation.
The research goal for this article is to investigate whether it is possible to assess the quality of published academic journal articles with machine learning, but the research questions are only indirectly related to this and hence need justification. Because there are no large-scale sources of postpublication quality control scores for academic articles (with the partial exception of biomedical science: Mohammadi & Thelwall, 2013), a proxy source of quality is used. For this article, the citation rate (defined in detail below) of the publishing journal is used as the quality proxy. This is a poor proxy but is used in the absence of a better one. Journal impact factors of various types are widely recognized and are considered to be indicators of scientific quality to some extent by some researchers, varying between countries and fields. This is because in some fields, citations are accepted as (very approximate) indicators of scientific worth and so journals attracting better articles tend to be more cited. UK evidence suggests that journal citation rates correlate moderately positively with the quality of the articles that they publish in the medical and physical sciences and economics, weakly in engineering and social sciences, and negatively or not at all in the arts and humanities (Wilsdon et al., 2015b ,  Table A18). Moreover, in some fields, Journal Impact Factors correlate with researchers' opinions of journals (Haddawy, Hassan et al., 2016;Serenko & Bontis, 2021;Serenko & Dohan, 2011), although not in others (Maier, 2006). In fields where this logic is accepted, there is a tendency for it to become truer over time because there is more competition to be published in journals with higher impact factors. On the other hand, citations are irrelevant in some fields and impact factors reflect journal specialisms to some extent. Thus, in this article, average journal citation rate is used as an approximate indicator of article quality, accepting that in some fields it is irrelevant to quality. Articles are split into three groups by journal citation rate, mirroring the UK REF, where articles are allocated a weighting of 0, 0.25 or 1 for qualityrelated funding (https://re.ukri.org/funding/quality-related-research-funding/). The specific research questions are therefore as follows.
1. How accurately can machine learning identify the journal impact third of published journal articles from other metadata in different fields and years? 2. Which textual features are most powerful at detecting the journal impact third of published journal articles in different fields and years? 3. Do the machine learning results have systematic biases against any genders, countries or institutions?

METHODS
The research design was to gather a reasonably comprehensive sample of academic journal articles, allocate journal impact-based thirds to the articles, and apply machine learning to detect the probable journal third of each article. The same parameters (sample size, machine learning method, feature set size) were applied to each field so that the results could be compared.

Data
Scopus was chosen as the source of journal articles for its wide coverage of academic literature and fine-grained field classification scheme. The Web of Science could have been used but Scopus is slightly larger, giving more data. Only documents of type journal article were included to give consistency. The UK REF excludes review articles, so these were not included.
All Scopus documents of type journal article with publication years between 2014 and 2020 were downloaded from Scopus in January 2021 using its API. The years 2014 to 2020 were chosen to mimic REF2021, and January 2021 citation data is appropriate because it would be available at the start of the original assessment period (although the start was delayed due to the COVID-19 pandemic). The quality of the citation data thus varies between years, with 2014 citation data being relatively mature and 2020 citation data being poor quality due to December 2020 articles having almost no time to attract citations, whereas January 2020 article had about a year. This is taken into account in the discussion but not in the methods.
Journal thirds were identified for each Scopus narrow field with a multistage approach. First, the Normalized Log-transformed Citation Score (NLCS) (Thelwall, 2017) was calculated for each article. This uses log transformation to reduce the skewing of citation count data so that the result is not dominated by a small number of highly cited articles. The score for an article is 1 if its citation rate is the world average, with scores above 1 indicating a greater citation rate than the world average and scores below 1 indicating fewer citations than the world average. These scores are normalized within the Scopus narrow field in which an article is classified (or the average of all fields for articles in multiple fields). Second, the arithmetic mean of the NLCS for all articles in each journal was calculated as its average citation score, called here JMNLCS ( Journal Mean NLCS). This differs from the Journal Impact Factor ( JIF) in that the average is calculated from articles in a single year, includes all citations to date, and uses NLCS instead of raw citation counts. This should be less affected by skewing than JIFs and should be more relevant to articles from the year with the data used for the calculation. Finally, JMNLCS thresholds were calculated to split the articles into approximately equal thirds. In some cases, this was not possible due to single very large journals dominating categories and the split generated approximate halves instead of thirds.

Features Analyzed
The machine learning stage requires a set of data about the articles to predict from. Because journal citation rates are used as a quality proxy for articles, they cannot also be used as inputs for the machine learning process. Also, because the purpose of quality control is to assess individuals or institutions, it is inappropriate to include these as inputs. The following features were included.
• NLCS for each article: This is a citation-based indicator, normalized for fields and year to be comparable between articles. The log transformation reduces skewing, which may make the feature more powerful for learning with linear-based algorithms. • Number of authors: Articles with more authors are likely to be more cited in many fields.
• Number of country affiliations: Articles with authors from more countries are likely to be more cited in many fields. • Word unigrams, bigrams, and trigrams: The quality of an article is presumably encoded in its text and figures. While full-text analysis is impractical and likely to confuse an algorithm with many irrelevant details from a paper, the title, abstract, and keywords may be helpful as a succinct summary. These were therefore extracted and added as features. Individual words and short phrases of two or three words were extracted, with those occurring only once in a field being discarded.
Abstracts were preprocessed to remove standard texts, such as publisher copyright statements and structured abstract headings. A large set of heuristics had been developed to remove these for previous automated text analyses of abstracts (Fairclough & Thelwall, 2022;Thelwall & Nevill, 2021) and these were reused for the current paper. These heuristics vary from generic (e.g., remove the first or last sentence if the first character is a copyright symbol or the first word is Copyright; remove the phrase, All rights reserved.) to publisherspecific (e.g., remove the first abstract sentence if it contains Elsevier and a copyright symbol; remove the first sentence if it contains Maney & Son Ltd starting in the first 20 characters). The heuristics included a list of common structured abstract headings, such as "Results:" and "PAR-TICIPANTS". Articles with abstracts with fewer than 500 characters after this stage were removed. This standardizes the machine learning task by ignoring articles without abstracts or with relatively trivial abstracts.
Each field and year combination formed a separate data set for training and evaluation. There were 330 nonempty fields per year, on average, over 7 years, so this gave 2,310 data sets to analyze. The smaller sets had too few articles to analyze, however, so the final number of field/year combinations analyzed was slightly less. The total number of articles analyzed was 31,273,062, varying between 3,846,106 in 2014 and 5,694,904 in 2020. This counts articles multiple times when they occur in multiple narrow fields but excludes articles with short or no abstracts. Exact numbers for each field and year are in the online supplement (column B of worksheet "Acc aboveAll

Machine Learning
There are many different machine learning algorithms and all have advantages and disadvantages, so there is not an obvious candidate for the machine learning task. Twenty classification or regression algorithms were compared (Table 1), as implemented in the standard scientific machine learning system scikit-learn on Python with their default settings. These include three that are general-purpose and accurate on a wide range of tasks: Support Vector Machines (Linear Support Vector Classification here), Gradient Boosting Classifier, and Random Forest Classifier. Two were discarded for the full testing (see table footnotes). All the classifiers were run a second time as ordinal classifiers by classifying two separate two-class problems: 1 vs (2 and 3) and (1 and 2) vs. 3, giving the result 1 from the first problem, 3 from the second task, and otherwise 2. Any cases classified as both 1 and 3 were instead classed as 2. This procedure takes into account the ordering of the classifications, so should, in theory, be superior to both classification (unordered) and sometimes regression (when it assumes a linear relationship).
Feature reduction (i.e., selecting a subset of the inputs to feed into each algorithm) was performed using the chi-square method, except forcing the citation, author, and country information to be kept. Tests with a range of training set sizes and feature set sizes suggested that the performance of the algorithms increases as either or both increases, so there was not an optimal choice for either one. As a compromise, 1,000 features and 1,000 articles for training were selected as large enough to be close to the optimal accuracy without slowing the algorithms too much. Fields were trained on 90% of the articles or 1,000 articles (whichever was the smaller) and evaluated on the remainder. The algorithms were trained and evaluated on 30 separate random test/train splits (rather than, for example, 30-fold cross-validation, because the training set size needs to be fixed) and the average accuracy reported. The predictions from the first iteration on each field data set were saved for further analysis of the individual predictions.

Analysis
Overall accuracy statistics (precision) were calculated separately for each field/year combination as the average accuracy of the 30 algorithm iterations on the evaluation sets. Recall and F1 measure were not calculated because the small number of classes (three) means that they give little extra information, and they are in any case subsumed in the score-based tests of the influence of the results, discussed below. For ease of comparison between fields, the main statistic reported is the level of accuracy above the baseline (the percentage of articles in the most common class). This is fairer than comparing accuracy between fields, because some fields have substantially higher baseline accuracies than others.
To assess the influence of the machine learning on the overall scores of countries, institutions, and two genders, for each field, the weighted average true score ( JMNLCS thirds) and machine learning predicted scores were compared. Within each country, the results were compared between institutions and male/female first author genders. Any differences suggest a machine learning bias (accidental or systematic) towards or away from the group in question. First author genders were assigned by checking their first name against a list of country-based gendered first names from Gender-API.com, allocating a gender only when the probability of a correct assignment was above 95%. These tests were reported for the main three machine learning methods only, to avoid reporting low-value information.
To identify the types of term with the greatest discriminatory power in the machine learning, chi-square tests were conducted on all terms used for the machine learning stage in each field ** Inaccurate and slow in all tests with 1,000 features and so was not use for the full testing. and the top term selected for all 314 Scopus narrow fields in 2014 with three categories. These terms were informally investigated using the Key Word In Context (KWIC) approach by identifying their most common single context in the field that caused their high chi-square values.

Comparison of Methods and Years
The single most accurate method was gbc 46% of the time, followed by rfc (45%) and mnb (3%). The mnb method was rarely accurate for the early years but was relatively more accurate on the 2020 data. Overall, gbc and rfc had similar levels of accuracy, but all were substantially more accurate than all 30 other methods, on average. The regression classifiers had relatively poor accuracy, and the ordinal versions of classifiers surprisingly tended to be less accurate overall than the standard versions ( Figure 1). Accuracy is generally highest in 2014 and substantially lower in 2020 than in other years, with 2019 also being lower than 2014-2018. For 2014, this confirms that citations are useful in helping to predict journal thirds. Accuracy is presumably lower in 2019 and 2020 because early published articles have a substantial citation advantage over late published articles due to longer citation windows. This tends to confirm that citations are less valuable for newer articles. Another possible explanation is that the journal thirds are less coherent for 2020 because the citation data has had less time to mature.

Fields with Highest and Lowest Relative Accuracy
The Scopus narrow fields in 2014 with the highest accuracy for any machine learning method tended to be small (fewer than 1,000 articles) for "miscellaneous" (mixed) fields, or both ( Table 2). Small fields may be easier to predict because they contain fewer journals, so prediction is a simpler problem. For example, Review and Exam Preparation had three journals: one in the top third (Clinical Teacher), two in the mid third ( Journal for Nurses in Professional Development and Journal of Continuing Education in Nursing), and none in the bottom third.
Miscellaneous fields may be easier to predict because they contain journals from relatively different topics, so making journal-related predictions from text may be easier (because abstracts would contain more distinctive terms for each journal). For example, Veterinary (misc.) includes relatively different titles, such as Brazilian Journal of Veterinary Pathology, Journal of Fish Diseases, and Parasite. The top chi-square words/phrases for this narrow field suggested another cause, however. They were "opinion," "opinion on," "opinion on the," "scientific," "scientific opinion," and "scientific opinion on," which all originated primarily from the titles of articles in the EFSA Journal (e.g., https://doi.org/10.2903/j.efsa.2014.3573). This is a publication of the European Food Standards Agency that published its outputs, which seem to be conclusions of expert scientific committees after deliberation. It is not a peer reviewed academic journal, even though it publishes expert scientific outputs. The standardization of title phrases has made its articles easily identified by machine learning. No similar standard phrases were discovered for the other three miscellaneous categories, although some near- top terms (sixth to ninth highest) from Health Professions (misc.) were from an unusual structured abstract phrase, "Conclusions and implications for practice," in the Psychiatric Rehabilitation Journal that had not been filtered out.
The Scopus narrow fields with the lowest accuracy attainable with the 32 methods include three general fields with "all" in their title, four humanities fields, and two mathematical fields. General "all" fields may contain journals with relatively similar or general scopes that are difficult to detect through word frequency analyses because their abstract texts tend to contain similar words. Humanities fields may contain many small specialist journals with highly diverse topics, complicating the machine learning problem (Table 3). Mathematics journals, in contrast, may be jargon-dense, with little overlap between articles in the terminologies used for the relatively specialist topic addressed in each one. As citations have little relevance to mathematics and the humanities, the citation data may also not be useful in these fields.
The high accuracy set for 2020 has four overlaps with the 2014 set and a similar pattern of mainly small or miscellaneous narrow fields (Table 4). The top chi-square terms for the highest scoring narrow field with at least 1,000 articles, Small Animals, suggest a combination of animal-specificity and incompletely cleaned structured abstracts. The top and third terms are "and relevance" and "relevance", from the nonstandard structured abstract phrase "Conclusions and relevance" in the Journal of Feline Medicine and Surgery. This journal also had other nonstandard abstract headings, such as "Case summary" and "Relevance and novel information." The second and fifth terms were animal-specific, "cats" and "in cats," associating with the two feline journals. The high accuracy was also helped by the presence of a journal specializing in reproduction, Theriogenology, which is associated with a set of relatively unique terminology with high chi-square scores, including "embryo," "sperm," "pregnancy," and "oocytes." The low accuracy set for 2020 has three overlaps with the 2014 set and a similar pattern of two general "all" narrow fields, three mathematics fields, and two humanities fields (Table 5). Nevertheless, some other fields do not fit this pattern. In particular, Energy (misc.) is an anomaly. This Scopus field had its top third dominated by a single general journal, Energies (5,219 articles), and the generalities of the topics in this journal make the task of machine learning difficult from text. A similar issue occurred for Fluid Flow and Transfer Processes, with a different single large general top third journal: Applied Sciences (8,396 articles).

Terms with the Highest Chi-Square Value in Each Field
A manual analysis of the terms with the highest chi-square value for each of the 314 Scopus narrow fields from 2014 with three categories (the remainder had two) revealed three main contexts (Table 6). In 13% of cases, the term most discriminating between journal thirds, at least in terms of the highest chi-square value, originated from journal mandatory text, such as structured abstract headings or (presumably) mandated keywords. While the initial data cleaning was designed to remove all structured abstract headings, many rare structured headings had not been removed. In theory, these could be removed with additional data filtering steps, although this is time-consuming.
Unsurprisingly, in almost half (44%) of all narrow fields checked from 2014, topic-related terms were the most discriminatory between journal thirds. This category includes some methods-related terms (e.g., "p" [-value], "setting participants") that may primarily differentiate between empirical and conceptual papers, but this dichotomy was not explored due to the difficulty in making this distinction. Some of the topic words also specified a geographic location that may be secondary to the main topic of a paper (e.g., "Romania") for categories with nationally focused journals. The commonness of topic terms is unsurprising because almost all journals have topic specializations, although generalist journals might span the entire scope of a Scopus narrow field. Of course, a topic term can be discriminatory if only one journal in a Scopus narrow field has a narrow scope, as its topic terms will associate with its journal third. Thus, it seems likely that some topic terms are discriminatory in all Scopus narrow fields, even though they are the top terms in under a half.
Perhaps more surprisingly, stylistic devices are the top discriminatory terms in 43% of all Scopus narrow fields. These terms may be optional custom and practice followed by authors in some journals. Conversely, some stylistic terms might be mandated by journals, such as the use of the active voice or first-person plural "we" rather that the passive voice when describing methods. Journals might also suggest or give examples of phrases that might be useful for authors to include in their abstract (e.g., "in this study we show") to ensure that key points are not omitted.

Prediction Accuracy by Gender, Country, and Institution
If the predicted scores are compared to the actual scores for each article separately for male and female first authors, it is possible to detect whether the prediction algorithms indirectly favor one of these two genders compared to the other (Figure 2). The results do not show universal patterns. The predictions favor females in the United States, Japan, and Brazil but males in Germany and France. The gender advantage varies between method for the other five of the The term associates with a topic or method. "for nursing management," "early childhood," "brain injury," "fixed point," "librarians," "Romania," "setting participants," "urban," "consumers," "electrochemical," "education," "energy," "p," "painter," "vaccine," "wastewater," "wound" Style 134 (43%) The term is a stylistic device, whether optional or journal mandated.
10 countries with the most gendered first authors. No method seems to systematically favor one of the two genders. The effects are relatively large, however, accounting for a gender shift of up to 4%.
Changing the proxy scores with predictions would have an even more substantial impact on the overall scores of individual institutions (Figure 3). Taking the 10 largest institutional affiliation addresses in the United Kingdom as an example (not merging different affiliations for the same overall university), machine learning could introduce a 7% shift in relative score  between institutions. Because the scores translate into money, this would mean a relative 7% shift in research funding. There is a tendency for the Random Forest Classifier to make prediction gains and the others to make prediction losses for these institutions, but this is not relevant because the funding is shared from a fixed amount for the United Kingdom. The swing is largest for mnb (7%), followed by rfc (5%) and gbc (3%). As the ultimate goal of the REF is to allocate funding to institutions on the basis of the scores, gbc has a substantial advantage over the other two algorithms.
It is relevant to examine international differences in the effect of replacing scores with predictions, in case international organizations, such as the European Union, adopt this approach. The results show that the predictions have an enormous impact on the relative scores of countries, with some increasing substantially and others decreasing (Figure 4). Although based on an artificial experiment, these figures suggest that international comparisons with machine learning would be highly problematic.

DISCUSSION
The results are limited by the coverage of Scopus and its categorization of articles into fields primarily at the journal level, which is not optimal (Klavans & Boyack, 2017). Although a large set of machine learning algorithms have been tested, different results may have been gained from others, including an appropriate deep learning framework. Different training set sizes and feature set sizes may also change the results. Similarly, more accurate predictions could be expected if additional features had been included, such as author-level career achievements and citing-cited document information. The results are also limited by the incomplete removal of journal boilerplate text, although this seemed to influence a minority of fields. The JMNLCS is equal to the NLCS for journals with a single article in a year, giving an unfair advantage, although this did not seem to be common. The lack of a development set to select the model to use for each field is also likely to have resulted in slight overestimation of the accuracy achievable with machine learning models in Tables 2 and 4 reporting the highest accuracy of any method, although it should not affect the scores for the individual methods (e.g., Figure 1). Finally, from an interpretation perspective, recall that associating average citation levels with the quality of the articles in them is inaccurate in all fields. While there may be a moderate statistical association between article quality and journal citation rates in some fields (e.g., Biological Sciences, Clinical Medicine, Economics) there is a weak or even a negative association in others (e.g., the arts and humanities, some social sciences) (Wilsdon et al., 2015b, Table A18).
Compared to previous investigations of machine learning for citation prediction, this study evaluates the most different algorithms and analyzes the most different separate fields. It also has a different target to all previous studies (predicting journal thirds rather than citation counts or citation percentiles), so the results are not directly comparable. Nevertheless, the results confirm the relative accuracy of Support Vector Machines (Zhu & Ban, 2018) and Random Forest and Gradient Boosting Classifiers (Klemiński et al., 2021) for citation-related tasks. In contrast, Multinomial Naïve Bayes is suggested here for the first time as the most accurate algorithm for a minority of narrow fields.
The results show that machine learning can predict the citation-based journal third of articles in all Scopus narrow fields based on its citations and article metadata (excluding journalrelated information). While the prediction accuracy tends to be higher for older papers (2014), the same is also true in the worst case for 2020 articles, with citation information collected at the end of the publication year (i.e., January 2021 for articles published in 2020). One implicit factor that text mining machine learning studies can exploit is the topic of papers (Chen & Zhang, 2015): By learning highly cited topics, they can predict how often a paper is likely to be cited from its topic. This factor may help to explain the above-chance predictions for all Scopus narrow fields in 2020, but the predictions can also leverage natural variations between journals in topics (including methods and contexts) and writing styles. Thus, the predictions may be based on topic rather than citations. The substantially greater accuracy for older years suggests that citation factors are important, however, so the predictions tend to be more successful when they can leverage citation-related factors.

Prediction Accuracy for Individual Documents
The accuracy of the predictions for each narrow field can be increased by focusing on a subset of articles for which the algorithm reports a higher probability of a correct prediction, as follows. Some of the algorithms (including the top three) report a probability that each document falls within each class. The documents for which the probability for one class is much higher than for the other two classes tend to have a higher probability of the prediction being correct than average. If the documents predicted are arranged in descending order of this difference (highest class probability minus second-highest class probability) then a subset of documents can be identified with relatively high prediction accuracy. This would allow an accuracy threshold to be set, accepting the machine learning results for documents falling above the threshold and using an additional round of human reviewer evaluation for the remaining documents. The proportion of articles that can be predicted with a high level of accuracy varies between fields, however. Taking the materials science narrow fields as an example, 40% of the articles in both Materials Science (all) and Ceramics and Composites can have their classes predicted with above 90% accuracy, compared to 5% for Materials Science (misc.) and 2% for Electrical, Optical and Magnetic Materials ( Figure 5).

Prediction Accuracy Without Text
Text factors can be excluded from the machine learning inputs to reduce dependency on article topics, leaving three factors: NLCS, number of authors, and number of countries. This reduces the median accuracy for most algorithms (Figure 6, compared to Figure 1) and years, especially for 2020. This reduced number of inputs especially reduces the accuracy of Multinomial Naïve Bayes and the Random Forest Classifier. The leading partial exception is the Gradient Boosting Classifier in 2019 and 2018, although it is not clear why it performs relatively well in these two years. The Gradient Boosting Classifier and its ordinal variant are still considerably more accurate than the remaining algorithms on this reduced set of inputs.

Implications for Post Peer Review Score Prediction
Returning to the motivating goal of this paper, the results give some insights into the potentials and limitations of estimating the quality of an academic article from citation counts and metadata available at publication time, including title, abstract and keyword text and the number of authors and countries. Author-related factors (e.g., h-index) and journal-related factors (e.g., JIF) were not included because they were potentially inappropriate for this type of exercise, where the focus is on evaluating the quality of individual outputs, irrespective of publishing platform or context. Using publishing journal thirds as approximate proxies for article quality (e.g., an article is assumed to be more likely to be high quality if it is in a journal in the top citation third than if it is in a journal in the other two citation thirds), the results suggest that article quality prediction with machine learning is possible to some extent for all fields. Nevertheless, the results are not convincing because the algorithms partly leverage topic and style. More importantly, the fact that the algorithms leverage both topic and style suggests that both have strong associations with journals. Algorithms directly learning article quality from human reviewer scores (rather than using journal impact as a proxy for article quality, as used here) are therefore likely to indirectly learn which journals predominantly publish from one quality category (high, medium, low). As journals can be learned from article styles or topics, algorithms can predict article quality thirds at least partly from the publishing journal. In the UK REF this is a potential cause for concern because of the explicit instructions to reviewers to ignore publication venues when allocating quality scores. Of course, the human reviewers may consciously or subconsciously leverage similar factors to the machine learning algorithms when making their decisions.
In addition to the journal-related factors discussed above, leveraging topic when predicting article quality (rather than journal third) may be an unwanted characteristic for machine learning for another reason. It would reward weak articles on topics with generally strong results and penalize strong articles in weak research areas (e.g., debunking the weak research or surpassing it by a quantum increase in quality). If topic-related information is excluded then machine learning accuracy is reduced (Figure 6), so this issue needs careful consideration.
In terms of the potential for machine learning to introduce biases, the country-level results strongly suggest that international comparisons based on machine learning are inappropriate due to the potential for substantial shifts in scores caused by predictions. There is also a small Figure 6. Average (from 30 tries) accuracy above baseline (on a scale of 0 = baseline to 1 = 100% accurate) across the 326 Scopus narrow fields for 29 different machine learning methods (excluding three slow and inaccurate methods compared to Figure 1). Each has a training set of 1,000 articles, using three features (NLCS, authors, countries), and evaluated on the remaining articles. gender effect of prediction, but varying in size and direction between countries. This may be a second-order effect of gender differences in research topics and methods (Thelwall, Bailey et al., 2019). Most worryingly for the United Kingdom, the results suggest that replacing some or all human peer review scores with machine learning predictions could result in substantial shifts between institutions in the allocation of the block funding grant based on output scores (Figure 3). If the accurate classifier giving the least variation, gbc, is used, then this amounts to 3% in the worst case for the 10 largest institutions, which represents a substantial amount of money. For institutions with fewer articles, the shift can be larger, such as the 18.5% shift from Kingston University (277 articles in 2014; loss of 9.7% from predictions) to Edge Hill University (113 articles in 2014; gain of 8.8% from predictions). Although these figures are based on an artificial task, they illustrate the potential for machine learning to systematically skew results for or against institutions even in the absence of institutional and author career information.

CONCLUSIONS
The results suggest that journal citation-based thirds can be predicted with above baseline accuracy in all Scopus narrow fields, even at the end of the year of publication. They also show that the Gradient Boosting Classifier or Random Forest Classifier are the most accurate from the set tested in almost all fields, with Multinomial Naïve Bayes being the most accurate in a minority. Deep learning methods were not tested. The results also show that machine learning can leverage topics and writing styles, which can associate with journals. Thus, even if all journal-level information is excluded from article quality prediction and all journal boilerplate text is removed, algorithms can still leverage indirect indicators of the publishing journal. This undermines the goal of generating algorithms that predict the quality of an article on its own merits rather than indirectly through its publishing journal. Organizations wishing to evaluate article quality without journal-level influences must therefore investigate and discuss further to consider whether the influences found are substantial enough to rule out machine learning altogether. Organizations should also carefully consider the influence of topic on the predictions, irrespective of publishing journals. As topic seems much easier to detect than quality, machine learning algorithms may tend to predict quality primarily based on topic, which may generate a perverse incentive to focus on high citation topics.
In terms of practical applications of machine learning for article quality prediction for comparisons between countries or institutions, the results seem to rule out its use for international comparisons due to variability between countries in results. Differences in average scores between institutions are also substantially affected by the machine learning methods for the task here, so this aspect needs to be seriously considered with testing on post peer review scores before any such approach can be implemented. The gender differences in results are perhaps less of a concern because they are smaller than institutional differences but should also be considered as a potential unwanted indirect influence on the scores.