Textual analysis of artificial intelligence manuscripts reveals features associated with peer review outcome

We analysed a dataset of scientific manuscripts that were submitted to various conferences in artificial intelligence. We performed a combination of semantic, lexical and psycholinguistic analyses of the full text of the manuscripts and compared them with the outcome of the peer review process. We found that accepted manuscripts scored lower than rejected manuscripts on two indicators of readability, and that they also used more scientific and artificial intelligence jargon. We also found that accepted manuscripts were written with words that are less frequent, that are acquired at an older age, and that are more abstract than rejected manuscripts. The analysis of references included in the manuscripts revealed that the subset of accepted submissions were more likely to cite the same publications. This finding was echoed by pairwise comparisons of the word content of the manuscripts (i.e. an indicator or semantic similarity), which were more similar in the subset of accepted manuscripts. Finally, we predicted the peer review outcome of manuscripts with their word content, with words related to machine learning and neural networks positively related with acceptance, whereas words related to logic, symbolic processing and knowledge-based systems negatively related with acceptance.


Introduction
Peer review is a fundamental component of the scientific enterprise and acts as one of the main source of quality control of the scientific literature [1]. The primary form of peer review occurs before publication 1 arXiv:1911.02648v2 [cs.DL] 3 Mar 2020 [2] and it is often considered as a stamp of approval from the scientific community [3,4]. Peer-reviewed publications have a considerable weight in the attribution of research and academic resources [5,6,7].
One of the main concern about peer review is its lack of reliability [8,9,10]. Most studies on the topic find that agreement between reviewers is barely greater than chance [11,12,13], which highlights the considerable amount of subjectivity involved in the process. This leaves room for a lot of potential source of bias, which have been reported in several studies [14,15,16]. A potential silver lining is that it appears that the process has some validity. For instance, articles accepted at a general medicine journal [17] and journals in the domain of ecology [18] were more cited than the rejected articles published elsewhere, and the process appears to improve the quality of manuscripts, although marginally [19,20,21]. It is therefore surprising that a process that has little empirical support of its effectiveness, but a lot of evidence of its downsides [22] has so much importance.
The vast majority of studies on peer review have focused on the relationship between the socio-demographical attributes of the actors involved in the process and its outcome [23]. Comparatively, little research has focused on the association between the content of the manuscripts and the peer review process. This isn't surprising given that there is little publicly available datasets of manuscripts annotated as rejected or accepted, and whenever they are made available to researchers it is usually through smaller samples designed to answer specific questions. Another factor contributing to this gap in the litterature is that it is more time consuming to analyse textual data (either the referee's report or the reviewed manuscript) than papers' metadata. However, the increasing popularity of open access [24,25] allows for a greater access to the full text of scientific manuscripts.
By scraping the content of arXiv, one of those repositories, [26] developed a new method to identify manuscripts that were accepted at conferences after the peer review process based on submissions around the time of major NLP, machine learning (ML) and artificial intelligence (AI) conferences. These preprints were then matched with manuscripts that were published at the target venues as a way to determine whether they were accepted or "probably-rejected". In addition, the manuscripts and peer-review outcomes were collected from conferences that agreed to share their data. [26] were able to achieve decent accuracy at predicting the acceptance of the manuscripts in their dataset. Other groups were able to obtain good performance at predicting paper acceptance with different machine learning models based on the text of the manuscripts [27], sentiment analysis of referee's reports [28], or the evaluation score given by the reviewers [29].
In this manuscript, we take advantage of the full text access to those manuscripts and explore linguistic and semantic features that correlates with the peer review outcome. Such features are relevant to two types of biases that could be involved in the peer review process: language and content bias. In the language bias, authors who aren't native English speakers could receive more negative evaluations due to the linguistic level of their manuscripts [30,31,32]. Our understanding of the extent to which such a bias could play a role in research evaluation is still limited, which is worrying given the increasingly globalized scientific system that relies on one language: English [33]. In terms of content bias, innovative and unorthodox methods are less likely to be judged favourably [14]. This type of bias is also quite likely to play a role in fields that are dominated by a few mainstream approaches such as AI [34]. Conservatism in this field could impede the emergence of breakthrough or novel techniques that don't fit with the current trends.
In this manuscript, we address both types of biases by comparing the textual data (title, abstract and introduction) of the manuscripts. We first used two readability metrics (the Flesch Reading Ease (FRE) and the New Dale-Chall Readability (NDC) Formula), as well as some indicators of scientific jargon content, and found that manuscripts that were less readable and used more jargon were more likely to get accepted.
Accepted and rejected manuscripts were compared on their psycholinguistic and lexical attributes and we found that accepted manuscripts used words that were more abstract, less frequent and acquired at a later age compared to rejected manuscripts. We then compared manuscripts on their word content and their referencing patterns through bibliographic coupling, and found that the subset of accepted manuscripts were semantically closer than rejected manuscripts. Finally, we used the word content of the manuscripts to predict their acceptance, and found that specific topics were associated with greater odds of acceptance.

Manuscript data
We used the publicly available PeerRead dataset [26] to analyse the semantic and lexical differences between accepted and rejected submissions to some natural language processing, artificial intelligence and machine learning conferences. We therefore used content from six platforms archived in the PeerRead dataset: three arXiv sub-repositories tagged by subject including submissions from 2007 to 2017 (AI: artificial intelligence, CL: computation and language, LG: machine learning), as well as submissions to three

Semantic similarity
The textual data of each article, including the title, abstract, introduction were cleaned by making all words lowercase, eliminating punctuation, single character words and common stopwords. For all analyses except for the readability, scientific jargon and psycholinguistic matching, the stem of the word was extracted using the porter algorithm [35]. We used the Term Frequency Inverse Document Frequency (tf-idf) algorithm to create vectorial representations based on the field of interest (title, abstract or introduction). We then used those vectors to compute the cosine similarity between the pairs of documents.

Reference matching
In order to obtain manuscript's bibliographic coupling, we developed a reference matching algorithm because their format was not standardized across manuscripts. We used four conditions to group references together: 1-They were published the same year 2-They had the same number of authors 3-They had a similarity score above 0.7 (empirically determined after manual inspection of matching results) with a fuzzy matching procedure (Token Set Ratio function from the FuzzyWuzzy python library, https://github.com/seatgeek/fuzzywuzzy) on the author's names and 4-the article's title.

Psycholinguistic and readability variables
For the word frequency estimation, we used the SUBTLEX US corpus [36] from which we used the logarithm of the estimated word frequency + 1. For the concreteness, we used the [37] dataset providing concreteness rating for 40,000 commonly known English words. For the age of acquisition, we used the [38] age of acquisition ratings for 30,000 English words.
We used the readability functions as implemented in [39]. We used the Flech Reading Ease (FRE; [40,41]) and the New Dale-Chall Readability Formula (NDC; [42]). The FRE is calculated based on the number of syllables per words and the number of words per sentence. The NDC is based on the number of words per sentence and the proportion of difficult words that are not part of a list of "common words". We also included two sources of jargon developed by [39]. The first one are science-specific common words, which are words used by scientist which are not in the NDC's list of common words.
The other is the general science jargon, which are words frequently used in science, but aren't specific to science (see [39] for methods). Finally, we complied a list of AI jargon from three online glossaries (htt ps : //developers.google.com/machine − learning/glossary/ , htt p : //www.wildml.com/deep − learning − glossary/ and htt ps : //en.wikipedia.org/wiki/Glossary_o f _arti f icial_intelligence).

Data analysis
Because of the exploratory nature of the study and the large size of the datasets, null hypothesis significance testing has many shortcomings [43]. In some cases, we performed statistical analysis of the results and reported the p-value, but those results should be interpreted carefully. Our analyses rely on the effect size, as well as the cross-validated effects on the independent subsets of the PeerRead dataset (manuscripts from different venues and online repositories). All error bars represent the standard error of the mean.

Identification of geographic location
We searched through the email addresses of the authors to identify research within the United-States (US) and outside the US. We considered a manuscript as US based if at least one author had an email address that ended with ".edu".

Code and data availability
The custom python scripts and the data used to generate the results of this manuscript can be found at https://github.com/lamvin/PeerReviewAI.git.

Readability
The readability of scientific articles has been steadily declining in the last century [39]. One possible explanation for this is that writing more complex sentences and using more scientific jargon increase the likelihood that a manuscript will get accepted at peer review. To investigate this hypothesis, we used two measures of readability on our data: the Flesch Reading Ease (FRE) and the New Dale-Chall Readability Formula (NDC). FRE scores decrease as a function of a ratio of the number of syllables per word and the number of word per sentence. NDC scores increase as a function of the number of words in each sentence and as the proportion of difficult words increase (words that are not present in the NDC list of common words). We also included the proportion of words from a science-specific common words and general science jargon list (constructed by [39]). In order to control for potential demographic confounds, we divided our dataset in the two categories: manuscripts within and outside the United States (US).
We found that both indicators of readability were correlated with the peer review outcome. FRE (higher score = more readable) was lower for accepted manuscripts, while NDC (higher score = less readable) was higher for accepted manuscripts (Fig. 1). This was the case for both US and non-US manuscripts (with no sizeable differences), for every section of the manuscript (title, abstract and introduction) and the effect was replicated within most platforms, except for the introduction with the FRE indicator.
As [39] reported that the proportion of scientific jargon has increased over the last century, we also wondered if the peer review process would reflect this effect. Using a list of general and specific scientific jargon, we found that the manuscripts containing the higher ratios of jargon were associated with higher acceptance rates. However, our list of science jargon was biased towards content from the life sciences. To confirm the relevance of these results to our dataset, we generated an AI jargon list (see Methods). Using this new list, we found a robust effect across platforms and document section, where a larger proportion of AI jargon predicted greater odds of acceptance for the manuscripts.
The replication of our results independently for US and non-US based manuscripts suggest that the effect is not driven by geographic locations. Statistical analysis are summarised in Tables 6 and 7.

Lexical correlates of peer review outcome
We then investigated the differences between accepted and rejected submissions based on lexical and psycholinguistic attributes. Given that we haven't found consistent differences between US and non-US based submissions, we pooled all manuscripts together for the rest of the analysis. We used the number of tokens (total number of words in a document) as well as two measures of lexical diversity: the number of types (unique words in a document) and the ratio between the types and token (Type-Token Ratio, TTR). We also used three psycholinguistic variables: the age of acquisition (AOA), concreteness and frequency (on a logarithmic scale). We computed the average values of those psycholinguistic variables on all types and all tokens.
We found consistent effects between the psycholinguistic variables and across platforms and section, with few exceptions. Words used in accepted manuscripts were less frequent, acquired later in life and more abstract than in rejected manuscripts on average (Fig. 2). The effects were consistent across all platforms except ICLR (which is smaller than the other ones).
Interestingly, we found that shorter titles, abstract and introduction were all associated with higher acceptance rates. Unsurprisingly, this translated in a bias towards manuscript with lower total types (for every section) for manuscript acceptance. However, when taking the ratio between the two (TTR, an indicator of lexical richness), we found that this variable was positively associated with manuscript acceptance. Results from the statistical analysis are summarised in Table 8. We then looked at how similar the accepted manuscripts were compared to the rejected ones based on their semantic content. First we looked at the similarity of their title, abstract or introduction based on a tf-idf representation of their word content. Secondly, we looked at their degree of bibliographic coupling. We only compared pairs of manuscripts that shared at least one common reference for the next analysis (see  As the two approaches quantify the content similarity of the documents, we wanted to verify whether those two metrics measured different aspects of the document content. It was previously reported that there is a moderate correlation between the two measures in the field of economics [44]. We correlated the semantic distance with the bibliographic coupling of the document submitted to each platform. We used a semantic distance metric based on the cosine similarity between the tf-idf representation of each document, as well as both the citation intersection (# common references) and the Jaccard similarity coefficient (#references in common/ # references in total) as a measure of bibliographic coupling. We found a moderate correlation (Pearson r > 0.20 and < 0.40) between both measures of bibliographic coupling and semantic distance when pooling all platforms together depending on what section of the manuscript were compared ( Fig. 3). This suggests that those two measures aren't redundant features of semantic content, and that they might capture different aspects of it. This also validates our algorithm for citation disambiguation as comparable correlations between the bibliographic coupling and textual similarity were reported in [44]. Figure 3: Correlation between semantic similarity and bibliographic coupling.

Bibliographic coupling and peer review outcome
We then looked at how accepted and rejected manuscripts differed based on the characteristics of their cited references (bibliographic coupling). We compared all pairs of manuscripts on the two indicators of bibliographic coupling (intersection and Jaccard index). Each pair of manuscripts was categorized as one of the following: "accepted": the two submissions were accepted, "rejected": the two submissions were rejected, and "mixed", one document was rejected and the other was accepted.
We found that accepted manuscripts had more references in common ( Fig. 4) than the two other cat-egories of manuscripts. The effect was slightly weaker for the Jaccard similarity (intersection over union of citations) and less consistent across platforms than the intersection. However, both metrics account for about 0.2% of the variance (All platforms, Jaccard: 0.228% and intersection: 0.21%).

Average # common citations
Average proportion of common citations Intersection Similarity Figure 4: Bibliographic coupling between accepted and rejected manuscripts.

Semantic similarity and peer review outcome
Having established that semantic similarity and bibliographic coupling capture different aspects of the relationship between documents, we also analysed the semantic similarity of the documents from the four platforms. Thus, for each platform we computed the td-idf distance between all pairs of document based on their word stem.
Overall, we found that accepted manuscripts were more similar to each other than rejected manuscripts based on their abstracts and introduction (Fig.5). We found a stronger effect for the similarity between the introduction (R 2 = 0.01) than for the abstract (R 2 = 0.006) when all platforms were pooled together. In other words, accepted pairs of manuscripts were more similar to each other compared to the other two pair types.
This analysis of the semantic similarity of documents (for both citations and text) showed some high levels trends based on whether or not the manuscripts were accepted after peer review. We therefore next examined the text content of the manuscripts with a more detailed approach to gain more insights on the patterns uncovered by the analysis on bibliographic coupling and textual similarity.

Introduction
Abstract Title Similarity Similarity Similarity Figure 5: Semantic similarity between accepted and rejected manuscripts.

Words as a predictor of acceptance
Finally, after having established high-level associations between the rejected and accepted manuscripts, we attempted to predict the peer review outcome with a logistic regression using a bag-of-words approach.
Overall, the model was fairly successful at predicting the peer review outcome on a 10-fold cross-validated dataset (Tables 3,4 & 5, random performance ∼ 0.5 for all three metrics). The model was the most successful when the text of the introduction was used, followed by the text of the abstract and of the title.
After having established that we could predict to some extent the outcome of the peer review process with the word content of the manuscripts, we performed a more detailed analysis to try to get some insight about the key predictors of the outcome. We therefore computed the average tf-idf score of each stem for accepted and rejected manuscripts, and obtained measure of "importance" based on the difference between the two averages. This approach allowed us to identify the most important keywords predicting the    Table 5: Introduction based prediction performance (macro averaging).
Although some differences were noticeable across platforms regarding the predictors of acceptance (Tables 9, 10 & 11) and rejection (Tables 12, 13 & 14), some robust patterns emerged. Word stems related to sub fields of neural networks and machine learning (e.g., learn, neural, gradient, train) were increasing the odds of the manuscript to be accepted. However word stems related to the sub fields of logic, symbolic processing and knowledge representation (e.g, use, base, system, logic, fuzzi, knowledg, rule) were decreasing the odds that a manuscript would get accepted.
Introduction Title Figure 6: Most important word stems for predicting peer review outcome.

Summary of results
From a linguistic point of view, our results suggest that accepted manuscripts could be written in a more complex, less accessible English. Using two indices of readability, one of which is agnostic to the word content of the manuscript (FRE), we found that the accepted manuscripts obtained lower readability scores.
Strikingly, we found the same effect for almost all our independent datasets. The same pattern was also observed for the title, the abstract and the introduction. Using a different type of readability indicator -the proportion of general, specific scientific and AI jargon words -we found that manuscripts that contained a greater proportion of jargon words were more likely to be accepted. This finding may partly explain that the readability of manuscripts has steadily declined during the last century [39]. In other words, it is possible that part of this decline could be attributed to a selection process taking place during peer review.
When considering the word content of the accepted manuscripts, we found that they had words that were acquired at a later age, that were more abstract and that were less common than the words from the rejected manuscripts. Additionally, these manuscripts were shorter and had increased lexical diversity. Once again, the effect size were small given the highly multivariate determination of the peer review outcome. The effects were replicated across multiples independent datasets from different fields in AI, which strengthens the conclusions of our analysis.
From a content point of view, we compared manuscripts based on their referencing patterns and word content. We compared the coupling both based on the raw number of common references (intersection) and the fraction of overlap between the manuscripts' references (Jaccard similarity). We found that accepted pairs had a larger intersection than other pairs, and found a similar, but less reliable effect for the similarity. We used a tf-idf vectorial representation of the text from all manuscripts in the database, compared all possible pairs of manuscripts, and we found that pairs of accepted manuscript had considerably more overlap between their word content. This high level analysis of the manuscript's content revealed that some topics might be associated with different odds of acceptance. We performed a correlation between the bibliographic coupling and the semantic similarity to get an idea of the how independent was the information provided by these two semantic indicators. As reported previously [44], we found a weak to moderate correlation between the two, which suggest that they provide distinct sources of information about topic similarity in accepted manuscripts.
Finally, we built a logistic regression to predict the peer review outcome, which revealed that using the title, abstract or introduction words lead to robust predictions. Our results are compatible with the presence of content bias, where trending topics in AI such as machine learning and neural networks were linked with greater acceptance rate, whereas words related to symbolic processing and knowledge-based reasoning lead to lower acceptance rates.

Implications
Taken together, our analysis of the linguistic aspects of manuscripts are coherent with linguistic biases during peer review. It has been reported that writers using English as their main language (L1) use words that are more abstract and less frequent than writers with English as their second language (L2) [45]. Additionally, this effect is exacerbated by the L2 proficiency (where larger differences are observed for beginners than advanced L2 speakers) [46]. The complexity of L2 writing was also shown to correlate with proficiency [47,48,49]. Our results are therefore compatible with the hypothesis that L2 writers are less likely to get their manuscript accepted at peer review.
Our results are also compatible with a content bias where manuscripts on the topics of machine learning learning in the domain of AI [50,51,52]. Recent successes of deep learning and neural networks might explain their dominance in the field, but a bias against other techniques might impede developments similar to the ones that lead to the breakthroughs underlying the deep learning revolution [53]. Following this idea, several researchers have indicated that symbolic processing could hold the answer to shortcomings of deep learning [54,50,55].
Other than biases during peer review, our results have implications for the quality of scientific communications. A recent report on reproducibility in machine learning found a positive correlation between the readability of a manuscript and successful replication of the claims that it makes [56]. Selecting for less readable manuscripts during peer review may therefore increase the proportion of non reproducible research findings.

Limitations
Although the main objective of our analysis was to investigate the presence of content or linguistic biases in peer review, all our of analysis are correlational, and there are possible confounds that could explain our results. For instance, while our findings that some linguistic aspects of the manuscripts -the readability and psycholinguistic attributes -were correlated with the peer review outcome, we cannot infer that those variables directly lowered the odds of acceptance.
Similarly, we cannot infer that there is a bias against manuscripts on the topic of machine learning techniques and neural networks. For instance, reviewers favouring high benchmark performance might accept more manuscripts using the state of the art techniques. In the scenario, reviewers would not reject a manuscript using a non-mainstream technique because of bias against it, but simply because they value some aspects where it under performs.
Another limitation to our findings is the methodology of the peer read dataset [26]. For most manuscripts included in the dataset, their status is inferred and the true outcome of the peer review process is unknown.
Although [26] validated their method on a subset of their data, the accuracy is not perfect. However, we believe that the large size of the dataset is enough to counteract this source of noise. Only the minority of manuscripts included in their dataset had a true peer review outcome provided by the publishing venue.
This highlight the need for publishers and conferences to open their peer review process in order to further advance our understanding of the strengths and limitations of the peer review process.
In sum, our results are compatible with the presence of a linguistic and a content bias in the peer review process of major conferences in AI. Although we were able to replicate our results across different dataset, similar studies have to be conducted both in the field of AI and in other disciplines to validate the conclusions of our study.