ABSTRACT
Nowadays, natural language processing (NLP) is one of the most popular areas of, broadly understood, artificial intelligence. Therefore, every day, new research contributions are posted, for instance, to the arXiv repository. Hence, it is rather difficult to capture the current “state of the field” and thus, to enter it. This brought the id-art NLP techniques to analyse the NLP-focused literature. As a result, (1) meta-level knowledge, concerning the current state of NLP has been captured, and (2) a guide to use of basic NLP tools is provided. It should be noted that all the tools and the dataset described in this contribution are publicly available. Furthermore, the originality of this review lies in its full automation. This allows easy reproducibility and continuation and updating of this research in the future as new researches emerge in the field of NLP.
1. INTRODUCTION
Natural language processing (NLP) is rapidly growing in popularity in a variety of domains, from closely related, like semantics [1, 2] and linguistics [3, 4] (e.g. inflection [5], phonetics and onomastics [6], automatic text correction [7]), named entity recognition [8, 9] to distant ones, like biobliometry [10], cybersecurity [11], quantum mechanics [12, 13], gender studies [14, 15], chemistry [16] or orthodontia [17]. This, among others, brings an opportunity, for early-stage researchers, to enter the area. Since NLP can be applied to many domains and languages, and involves use of many techniques and approaches, it is important to realize where to start.
This contribution attempts at addressing this issue, by applying NLP techniques to analysis of NLP-focused literature. As a result, with a fully automated, systematic, visualization-driven literature analysis, a guide to the state-of-the-art of natural language processing is presented. In this way, two goals are achieved. (1) Providing introduction to NLP for scientists entering the field, and (2) supporting possible knowledge update for experienced researchers. The main research questions (RQs) considered in this work are:
RQ1: What datasets are considered to be most useful?
RQ2: Which languages, other than English, appear in NLP research?
RQ3: What are the most popular fields and topics in current NLP research?
RQ4: What particular tasks and problems are most often studied?
RQ5: Is the field “homogenous”, or are there easily identifiable “subgroups”?
RQ6: How difficult is it to comprehend the NLP literature?
Taking into account that the proposed approach is, itself, anchored in NLP, this work is also an illustration of how selected standard NLP techniques can be used in practice, and which of them should be used for which purpose. However, it should be made clear that considerations presented in what follows should be treated as “illustrative examples”, not “strict guidelines”. Moreover, it should be stressed that none of the applied techniques has been optimized to the task (e.g. no hyperparameter tuning has been applied). This is a deliberate choice, as the goal is to provide an overview and “general ideas”, rather than overwhelm the reader with technical details of individual NLP approaches. For technical details, concerning optimization of mentioned approaches, reader should consult referenced literature.
The whole analysis has been performed in Python—a programming language which is ubiquitous in data science research and projects for years [18, 19, 20, 21, 22, 23]. Python was also chosen for the following reasons:
It provides a heterogeneous environment
It allows use of Jupyter Notebooks①, which allow quick and easy prototyping, testing and code sharing
There exists an abundance of data science libraries②, which allow everything from acquiring the dataset, to visualizing the result
It offers readability and speed in development [24]
Presented analysis follows the order of research questions. To make the text more readable, readers are introduced to pertinent NLP methods in the context of answering individual questions.
2. DATA AND PREPROCESSING
At the beginning of NLP research, there is always data. This section introduces the dataset consisting of research papers used in this work, and describes how it was preprocessed.
2.1 Data Used in the Research
To adequately represent the domain, and to apply NLP techniques, it is necessary to select an abundant, and well-documented, repository of related texts (stored in a digital format). Moreover, to automatize the conducted analysis, and to allow easy reproduction, it is crucial to choose a set of papers, which can be easily accessed, e.g. a database with a functional Application Programming Interface (API). Finally, for obvious reasons, open access datasets are the natural targets for NLP-oriented work.
In the context of this work, while there are multiple repositories, which contain NLP-related literature, the best choice turned out to be arXiv (for the papers themselves, and for the metadata it provided), combined with the Semantic Scholar (for the “citation network” and other important metadata; see Section 3.3.1).
Note that other datasets have been considered, but were not selected. Reasons for this decision have been summarized in Table 1.
Database . | The reason for in applicability in this research task . |
---|---|
Google Scholar | Google Scholar does not contain actual data (text, PDF, etc.) of any work—there are only links to other databases. Moreover, performed tests determined that the API (Python “scholarly” library) works well with small queries, but fetching information about thousands of papers results in download rate limits, and temporary IP address blocking. Finally, Google Scholar is criticized, among others, for excessive secrecy [25], biased search algorithms [26], and incorrect citation counts [27]. |
PubMed | PubMed is mainly focused on medical and biological papers. Therefore, the number of works related to NLP is somewhat limited, and difficult to identify using straightforward approaches. |
ResearchGate | There are two main problems with ResearchGate, as seen from the perspective of this work: lack of easy-accessible API and restrictions on some articles’ availability (large number of papers has to be requested from authors—and such requests may not be fulfilled, or wait time may be excessive). |
Scopus | The Scopus API is not fully open-access, and has restrictions on the number of requests that can be issues within a specific time. |
JSTOR | Even though the JSTOR website③ declares that API exists, the link does not provide any information about it (404 not found). |
Microsoft Academic | The Microsoft Academic API is very well documented, but it does not provide true open access (requires a subscription key). Moreover, it does not contain the actual text of works; mostly metadata. |
Database . | The reason for in applicability in this research task . |
---|---|
Google Scholar | Google Scholar does not contain actual data (text, PDF, etc.) of any work—there are only links to other databases. Moreover, performed tests determined that the API (Python “scholarly” library) works well with small queries, but fetching information about thousands of papers results in download rate limits, and temporary IP address blocking. Finally, Google Scholar is criticized, among others, for excessive secrecy [25], biased search algorithms [26], and incorrect citation counts [27]. |
PubMed | PubMed is mainly focused on medical and biological papers. Therefore, the number of works related to NLP is somewhat limited, and difficult to identify using straightforward approaches. |
ResearchGate | There are two main problems with ResearchGate, as seen from the perspective of this work: lack of easy-accessible API and restrictions on some articles’ availability (large number of papers has to be requested from authors—and such requests may not be fulfilled, or wait time may be excessive). |
Scopus | The Scopus API is not fully open-access, and has restrictions on the number of requests that can be issues within a specific time. |
JSTOR | Even though the JSTOR website③ declares that API exists, the link does not provide any information about it (404 not found). |
Microsoft Academic | The Microsoft Academic API is very well documented, but it does not provide true open access (requires a subscription key). Moreover, it does not contain the actual text of works; mostly metadata. |
2.1.1 Dataset Downloading and Filtering
The papers were fetched from arXiv on 26 August 2021. The resulting dataset includes all articles, which have been extracted as a result of issuing the query “natural language processing”④. As a result, 4712 articles were retrieved. Two articles were discarded because their PDFs were too complicated for the tools that were used for the text extraction (1 710.10229v1—problems with chart on page 15; 1803.07136v1 — problems with chart on page 6; see, also, section 2.2). Even though the query was not bounded by the “time when the article was uploaded to arXiv” parameter, it turned out that a solid majority of the articles had submission dates from the last decade. Specifically, the distribution was as follows:
192 records uploaded before 2010-01-01
243 records from between (including) 2010-01-01 and 2014-12-31
697 records from between (including) 2015-01-01 and 2017-12-31
3580 records uploaded after 2018-01-01
On the basis of this distribution, it was decided that there is no reason to impose time constraints, because the “old” works should not be able to “overshadow” the “newest” literature. Moreover, it was decided that it is worth keeping all available publications, as they might result in additional findings (e.g., as what concerns the most original work, described in Section 3.7.4).
Finally, all articles not written in English were discarded, reducing the total count to 4576 texts. This decision, while somewhat controversial, was made to be able to understand the results (by the authors of this contribution) and to avoid complex issues related to text translation. However, it is easy to observe that the number of texts not written in English (and stored in arXiv) was relatively small (< 5%). Nevertheless, this leaves open a question: what is the relationship between NLP-related work that is written in English and that written in other languages. However, addressing this topic is out of scope of this contribution.
2.2 Text Preprocessing
Obviously, the key information about a research contribution is contained in its text. Therefore, subsequent analysis applied NLP techniques to texts of downloaded papers. To do this, the following preprocessing has been applied. The PDFs have been converted to plain text, using pdfminer.six (a Python library⑤). Here, notice that there are several other libraries that can also be used to convert PDF to text. Specifically, the following libraries have been tried: pdfminer⑥, pdftotree⑦, BeautifulSoup⑧. On the basis of performed tests, pdfminer.six was selected, because it provided the simplest API, produced results, which did not have to be further converted (as opposite to, e.g., BeautifulSoup), and performed the fastest conversion.
Use of different text analysis methods may require different preprocessing. Some methods, like keyphrase search, work best when the text is “thoroughly cleaned”; i.e. almost reduced to a “bag of words” [28]. This means that, for instance, words are lemmatized, there is no punctuation, etc. However, some more recent techniques (like text embeddings [29]) can (and should) be trained on a “dirty” text, like Wikipedia [30] dumps⑨ or Common Crawl⑩. Hence, it is necessary to distinguish between (at least) two levels of text cleaning: (A) “delicately cleaned” text (in what follows, called “Stage 1” cleaning), where only parts insignificant to the NLP analysis are removed, and (B) a “very strictly cleaned” text (called “Stage 2” cleaning). Specifically, “Stage 1” cleaning includes removal of:
charts and diagrams improperly converted to text,
arXiv “watermarks”,
references section (which were not needed, since metadata from Semantic Scholar was used),
links, formulas, misconverted characters (e.g. “ff”).
Stage 2 cleaning is applied to the results of Stage 1 cleaning, and consists of the following operations:
All punctuation, numbers and other non-letter characters were removed, leaving only letters.
Adposition, adverb, conjunction, coordinating conjunction, determiner, interjection, numeral, particle, pronoun, punctuation, subordinating conjunction, symbol, end of line, space were removed. Parts of speech left after filtering were: verbs, nouns, auxiliaries and “other”. The “other” category is usually tagged for meaningless text, e.g. “asdfgh”. However, these were not deleted in case the algorithm detected something that was, in fact, important, e.g. domain-specific shortcuts and abbreviations like CNN, RNN, etc.
Words have been lemmatized.
Note that while individual NLP techniques may require more specific data cleaning, the two (Stage 1 and Stage 2) workflows are generic enough to be successfully applied in the majority of typical NLP applications.
3. PERFORMED EXPERIMENTS, APPLIED METHODS AND ANALYSIS OF RESULTS
This section traverses research questions RQ1 to RQ6 and summarizes the findings for each one of them. Furthermore, it introduces specific NLP methods used to address each question. Interested readers are invited to study referenced literature to find additional details.
3.1 RQ1: Finding Most Popular Datasets Used in NLP
As noted, a fundamental aspect for all data science projects is the data. Hence, this section summarizes the most popular (open) datasets that are used in NLP research. Here, the information about these datasets (names of datasets) was extracted from the analyzed texts, using Named Entity Recognition and Keyphrase search. Let us briefly summarize these two methods.
3.1.1 Named Entity Recognition-NER
Named Entity Recognition (NER) can be seen as finding an answer to “the problem of locating and categorizing important nouns, and proper nouns, in a text” [31]. Here, automatic methods should facilitate extraction of, among others, named topics, issues, problems, and other “things” mentioned in texts (e.g. in articles). Hence, the spaCy [32] NER model “en-core-web-lg”⑪ has been used to extract named entities. These entities have been linked by co-occurrence, and visualized as networks (further described in section 3.4).
3.1.2 Key phrase Search
Another simple and effective way of extracting information from text, is keyword and/or keyphrase search [34, 35]. This technique can be used not only in the preliminary exploratory data analysis (EDA), but also to extract actual and useful findings. Furthermore, keyphrase search is also complementary to, and extends, results of Named Entity Recognition (NER) (Section 3.1.1).
To apply keyphrase search, first, texts were cleaned with Stage 2 cleaning (see Section 2.2). Second, they were converted to phrases (n-grams) of lengths 1-4. Next, two exhaustive lists were created, based on all phrases (n-grams): (a) allowed phrases (609 terms), and (b) banned phrases (1235 terms). The allowed phrases contained word and phrases, which were meaningful for natural language processing or were specific enough to be considered separate, e.g. TF-IDF, accuracy, annotation, NER, taxonomy. The list of banned phrases contains words and phrases, which on their own carried no significant meaning for this research, e.g. bad, big, bit, long, power, index, default. The banned phrases also contained some incoherent phrases, which slipped through the previous cleaning phases. These lists were used to filter the phrases found in the texts. Obtained results were converted to networks of phrase co-occurrence, to visualize phrase importance, and relations between phrases.
3.1.3 Approaches to finding names of most popular NLP datasets
Keyword search was used to extract names of NLP datasets used in collected papers. To properly factor out dataset names and omit noise words, two approaches were applied: unsupervised and list-based.
Unsupervised approach included extracting words (proper nouns detected with Python spaCy⑬ library) in the near neighborhood (max 3 words before or after) of words “data”, “dataset” and similar.
3.1.4 Findings Related to RQI; What are the Most Popular NLP Datasets
This section presents the findings, which answer RQ1, i.e. which datasets are most often used in NLP research. To best show datasets that are popular, and outline which are used together, a heatmap has been created. It is presented in Figure 1. In general, a heatmap allows getting not only a general ranking of features (looking only at the diagonal), but also provides the information of correlation of features, or lack thereof.
It can be easily seen that the most popular dataset, used in NLP, is Wikipedia. Among the top 4 most popular datasets, one can find also: Twitter, Facebook, and WordNet. There is a high correlation between use of datasets, which were extracted from Twitter and Facebook, which are very frequently used together. This is both intuitive and observable in articles dedicated to social network analysis [36], social text sentiment analysis [37], social media mining [38] and other social science related texts [39]. Manual checking determined also that Twitter is extremely popular in sentiment analysis and other emotion-related explorations [40].
3.2 Findings Related to RQ2: What Languages are Studied in NLP Research
The second research question concerned languages that were analyzed in reported research (not the language the paper was written in). This information was mined using the same two methods, i.e. keyphrase search and NER. The results were represented in two ways. The basic method was a co-occurrence heatmap presented in Figure 2.
For clarity, the following is the ranking of top 20 most popular languages, by number of papers in which they have been considered:
English: 2215
Chinese: 809
German: 682
French: 533
Spanish: 416
Arabic: 306
Japanese: 299
Italian: 257
Russian: 239
Czech: 221
Dutch: 209
Latin: 171
Hindi: 166
Portuguese: 154
Turkish: 144
Greek: 133
Korean: 130
Finnish: 125
Swedish: 125
Polish: 98
As it is visible in Figure 2, the most popular language is English, but it may be caused by the bias of analyzing only English-language-written papers. Next, there is no particular positive, or negative, correlation between languages. However, there are slight negative correlations between languages Basque and Bengali, Irish and Thai, and Thai and Urdu, which means that these languages are very rarely researched together. There are two observations regarding these languages. (1) All of them are niche and do not have a big speaking population. (2) All pairs have very distant geographical origins, so there may be a low demand for their co-studying.
3.3 Findings Related to RQ3: What are the Popular Fields, and Topics, of Research
Let us now discuss the finding related to the most popular fields and topics of reported research. In order to ascertain them, in addition to keyphrase search and NER, metadata mining and text summarization have been applied. Let us now introduce these methods in some detail.
3.3.1 Metadata Mining
In addition to the information available within the text of a publication, further information can be found in its metadata. For instance, the date of publishing, overall categorization, hierarchical topic assignment and more, as discussed in the next paragraphs.
Therefore, metadata has been fetched both from the original source (arXiv API) and from the Semantic Scholar⑰. As a result, for each retrieved paper, the following information became available for further analysis:
data: title, abstract and PDF,
metadata: authors, arXiv category and publishing date,
citations/references,
topics.
Note that the Semantic Scholar topics are different from the arXiv categories. The arXiv categories follow a set taxonomy⑱, which is used by the person who uploads the text. On the other hand, the Semantic Scholar “uses machine language techniques to analyze publications and extract topic keywords that balance diversity, relevance, and coverage relative to our corpus.”⑲
The metadata from both sources was complete for all articles (there were no missing fields for any of the papers). Obviously, one cannot guarantee that the information itself was correct. This had to be (and was) assumed, to use this data in further analysis.
3.3.2 Matching Literature to Research Topics
In literature review, one may analyze all available information. However, it is much faster to initially check if a particular paper's topic is related to ones planned/ongoing research. Both Semantic Scholar and arXiv provide this information in the metadata. Semantic Scholar provides “topics”, while arXiv provides “categories”.
Figure 3 shows (1) what topics are the most popular (see the first column from the left), and (2) the correlation of topics. The measure used in the heatmap (correlation matrix) is the count of articles tagged with topics (logarithmic scale has been used).
Obviously, the most popular field of research is “Natural Language Processing”. It is also worth mentioning that Artificial intelligence, Machine Learning and Deep Learning also score high in the article count. This is intuitive, as current applications of NLP are pursued using approaches from, broadly understood, artificial intelligence.
Moreover, the correlation, and high score, between “Deep Learning” and “Artificial Neural Networks” mirrors the influence of BERT and similar models. On the other hand, there are topics, which very rarely coincide. These are, for instance, Parsing and Computer Vision, Convolutional Neural Networks and Machine Translation, Speech Recognition and Sentiment analysis.
There is also one topic worth pointing out to: Baseline (configuration management). According to the Semantic Scholar, it is defined as “an agreed description of the attributes of a product, at a point in time, which serves as a basis for defining change”⑳. This topic does not suit the NLP particularly, as it is too vague, and it could have been incorrectly assigned by the machine learning algorithm on the backend of Semantic Scholar.
Yet another interesting aspect is the evolution of topics in time, which gives a wider perspective of what topics are on the rise in, or fall from, popularity. Figures 4 show the most popular categories in time. The category cs.CL (“Computation and Language”) is a dominating in all periods because it is the main subcategory of NLP. However, multiple interesting observation can be made. First, categories that are particularly popular nowadays are: cs.LG (Machine Learning), cs.AI (Artificial Intelligence), cs.CV (Computer Vision and Pattern Recognition). Second, there are categories, which experience a drop in interest. These are: stat. ML (Machine Learning) and cs.NE (Neural and Evolutionary Computing).
Moving to “categories” from arXiv, it is important to elaborate the difference between them and “topics”. As mentioned, arXiv follows a taxonomy with two levels: primary category (always a single one) and secondary categories (may be many).
To best show this relation, as well as categories’ popularity, a treemap chart has been created, which is most suitable for “nested” category visualization. It is shown in Figure 5.
Similarly to the Semantic Scholar “topics”, the largest primary category is cs.CL (Computation and Language), which is a counterpart to the NLP topic from the arXiv nomenclature. Its top secondary categories are cs.LG/stat.ML (both categories of Machine Learning) and cs.AI (Artificial Intelligence). This is, again, consistent with previous findings and shows how these domains overlap each other. It is also worth noting the presence of cs.CV (Computer Vision and Pattern Recognition), which, although to a lesser degree, is also important in the NLP literature. Manual verification shows that, in this context, computer vision refers mostly to image description with text [41], visual question answering [42], using transformer neural networks for image recognition [43, 44], and other image pattern recognition, vaguely related to NLP.
Similarly, as for topics, a trend analysis has been performed for categories. It is presented in Figure 6. The most popular topic over time is NLP, followed by Artificial neural network, Experiment, Deep learning, and Machine learning. Here, no particular evolution is noticeable, except for rise in interests in the Language model topic.
3.3.3 Citations
Another interesting metainformation, is the citation count [45, 46]. Hence, this statistic was used to determine key works, which were then used to establish key research topic in NLP (addressing also RQ1-3).
It is well known that, in most cases, the distribution of node degree in a citation network is exponential [47]. Specifically, there are many works with 0-1 citations, and very few with more than 10 citations. In this context, the citations network of top 10% of most highly cited papers is depicted in Figure 7. The most cited papers are 1810.04805v2 [48] (5760 citations), 1603.04467v2 [49] (2653 citations) and 1606.05250v3 [50] (1789 citations). The first one is the introduction of the BERT model. Here, it is easy to notice that this papers absolutely dominates the network in terms of the degree. It is the networks focal point. This means that the whole domain not only revolves around one particular topic, but also around a single paper.
The second paper concerns TensorFlow, the state-of-the-art library for neural networks construction and management. The third introduces “Squad”—a text dataset with over 100,000 questions, used for machine learning. It is important to note that these are the top 3 papers when considering not only works published after 2015, but also when the “all time most cited works” are searched for.
How can two papers cite each other. An interesting observation has been made, during the citation analysis. Typically, relation, where one paper quotes another paper, should be one-way. In other words when paper A cites paper B, that means that paper B is a reference for paper A. So the set of citations and reference should be disjoint. This is true for over 95% of works. However, 363 of papers have an intersection between citations and references, with the biggest having even 10 common positions. Further, manual, analysis determined that this “anomaly” happens due to the existence of preprints, and all other cases where a paper appeared publicly (e.g. being a Technical Report) and then was revised and cited a different paper. This may happen, for instance, when a paper is criticised and it is reprinted (an updated version is created) to address the critique.
3.4 RQ3 Related Findings Based on Application of Keyphrase and Entity Networks
As discussed, NER has been used to determine NLP datasets and languages analyzed in papers. It can also be used when looking for techniques used in research. However, to better visualize the topic of interest, it can be combined with network analysis. Specifically, work reported in the literature involves many-to-many relations, which provide information of what techniques, methods, problems, languages etc., are used alone, in tandem or, perhaps, in groups. To properly explore the area, four dimensional networks (see Figures 8 and 9) have been constructed, with: nodes (entities), node size (scaled by an attribute), edges (relations), edge width (scaled by an attribute). Moreover, since all networks are exponential and have very high edge density, only the top percentile of entities has been graphically represented (to allow readability). Networks have been built using networkx [51] and igraph [52] Python libraries.
Figure 9 shows very popular name entities, but skips the most often found ones. This has been done to allow other frequent terms to become visible. Specifically, the networks were trimmed by node weight, i.e. number of papers including the named entity. The Figure 9 contains terms between the 99.5 and 99.9 percentiles by node weight. In addition to some previously made observations, new entities appeared, which show what is also of considerable interest in NLP literature. These are:
As shown in the Figure 8 the majority of entities are related to models such as BERT, and neural network architectures (e.g. RNN, CNN). However, the findings show not only NLP-related topics, but all entities. Here, an important warning, regarding used NER models, should be stated. In most cases, when NER is applied directly, and without additional techniques, the entities are not disambiguated, or properly unified. For instance, surnames, like, Kim, Wang, Liang, Liu, Chen, etc. are not properly recognized as names of different persons and “bagged together”. Therefore, further interpretation of results of NER may require manual checking of results.
Moreover, corroborating earlier noted result, is that Wikipedia and Twitter, being the most popular data sources for NLP, can be observed.
Finally, among important entities, Association for Computational Linguistics (also shown as “the Association for Computational Linguistics” and “ACL”㉑) has been found. This society organizes conferences, events and also runs a journal about natural language processing.
GPU (Graphic Processing Unit), which are often used to accelerate neural network training (and use) [53]
WordNet—semantic network “connecting words” with regard to their meaning [54]㉒ and ImageNet —a image database using WordNet hierarchy to propose a network of images [55]㉓
SemEval—popular contents in NLP, occurring annually and challenging scientist with different NLP tasks㉔
and other particular methods like (citation contain example papers): Bayesian methods [56], CBOW (Continuous Bag of Words) [57], Markov processes [58]
As described in Section 3.1.2, the keyphrase search was used to extract these terms and findings, which might have been skipped in the NER results. For example, the word “accuracy” is a widely used metric in NLP and many other domains. However, it is not a named entity, because it is also an “ordinary” word in English and is not detected as such by the NER models. Applied analysis produced a network of keyphrase co-occurrence. Hence, network visualization was, again, applied (Figure 10). This allowed formulation of hypotheses, which underwent further (positive) manual verification, specifically:
BERT models are most commonly used in their pretrained “version” / “state”. BERT is already a pretrained model, but it is possible to continue its training (to get a better representation of particular language, topic or domain). The second approach is using BERT, or its pretrained variant, to train it on a target task, called downstream task (these techniques is also called “fine-tuning”).
Transformers are connected strongly with attention. This is because transformer (a neural network architecture) is characterized by the presence of attention mechanism in it. This is the distinguishing factor of this architecture [59].
“Music” is connected with “lyrics”. This shows that the intersection between NLP research and music domain is via lyrics analysis. The lack of correlation between music and other terms shows that audio analysis, sentiment analysis, etc. are not that popular in this context.
“Precision” is connected with “recall” These two extremely popular evaluation metrics for classification are often used together. Their main point is to handle imbalanced datasets, where the performance is not evaluated correctly by the “accuracy” [60] measure.
“Synset” is connected with “WordNet”. As shown, WordNet is most commonly used with Synset (a user programmer-friendly interface available in the NLTK framework㉕).
Quantum mechanics begins to emerge in NLP. The oldest works in the field of quantum computing (in the set under study) date back to 2013 [61], but most (>90%) of the recent works dates to 2019-2021. These provide answers to the to problems such as: applying NLP algorithm on “nearly quantum” computers [62], sentence meaning inference with quantum circuit model(s), encoding-decoding [63], quantum machine learning [64] or, even, ready-to-use Python libraries for the quantum NLP [65] are investigated. There are still very few works joining the worlds of NLP and quantum computing, but their number is significantly growing since 2019.
Graphs are very common in research related to semantic analysis. One of the the domains that NLP overlaps/includes is semantics. The entities network illustrates how important the concept of a graph is in semantics research (e.g. knowledge graphs). Some works touch these topics in tandem with text embedding [66], text summarization [67], knowledge extraction/inference/infusion [67] or question answering [68].
3.4.1 Text Summarization
Another approach to extract key information (including the field of research) is to reduce the original text to a brief and simple “conclusion”. This can be done with extractive and abstractive summarization methods. Both aim at allowing the user to comprehend the main message of the text. Moreover, depending on what sentences are chosen in the extractive summarization methods, one may find which abstracts (and papers) are most “summaritive”.
Extractive summarization. First, the extractive methods have been used to summarize the text of all abstracts. Specifically, the following methods have been applied.
Here, note that, due to formatting errors in the original texts, the library pysummarization㉖ had trouble with “sentences with periods” (e.g. “3.5% by the two models, respectively.” is only a part of a full sentence, but it contains a period character).
Abstractive summarization. Previous research found that abstractive summarization methods can “understand the sense” of the text, and build its summary [73]. It was also found that their overall performance is better than that of extractive methods [74]. However, most state-of-the-art solutions have limitations related to the maximum number of tokens, i.e. BERT-like models (e.g. distilbart-cnn-12-6 model [75], bart-large-cnn [75], bert-extractive-summarizer [76]) support maximum of 512 tokens, while the largest Pegasus model supports 1024 [77].
Nevertheless, very recent work proposes a transformer model for long text summarization, a “Longformer” [78], which is designed to summarize texts of 4000 tokens and more. However, this capability comes with a high RAM memory requirement. So, in order to test abstractive methods, Longformer was applied only to titles of most influential texts (top 5% of citation count).
The final note about text summarization is that, most recent research proposed innovative ways to overcome the length issue (see, [79]). There is thus a possibility to apply text summarization, for instance, to abstracts combined with introduction and conclusions of research papers. Testing this possibility may be a good starting point for research, but is out of scope of this contribution.
3.4.2 Summarization Findings
Listings 1, 2, 3, 4, show summaries of all abstracts and Listing 5 shows summary of all titles (as described in Section 3.4.1).
The common part for all summaries addresses (in a hierarchical order, starting from most popular features):
natural language processing and artificial intelligence,
translation and image processing,
neural networks,
deep neural network architectures, e.g. CNN, RNN, encoder-decoder, transformers, and
deep neural network models, e.g. BERT, ELMO.
Moreover, the main “ideas”, which appear in the summaries are: effectiveness, “state-of-the-art” solutions, and solutions “better than others”. This shows the “competitive” and “progress-focused” nature of the domain. Authors find it necessary to highlight how “good” or “better than” their solution is. It may also mean that there is not much space for “exploratory” and “non-results-oriented” research (at least this is the message permeates the top cited articles). Similarly, research indicating which approaches do not work in a given domain is not appreciated.
Summary with LSA (512.9 sec)
Natural language processing, as a data analytics related technology, is used widely in many research areas such as artificial intelligence, human language processing, and translation. [paper id: 1608.04434v1]
At present, due to explosive growth of data, there are many challenges for natural language processing. [paper id: 1608.04434v1]
Hadoop is one of the platforms that can process the large amount of data required for natural language processing. [paper id: 1608.04434v1]
KOSHIK is one of the natural language processing architectures, and utilizes Hadoop and contains language processing components such as Stanford CoreNLP and OpenNLP. [paper id: 1608.04434v1]
This study describes how to build a KOSHIK platform with the relevant tools, and provides the steps to analyze wiki data. [paper id: 1608.04434v1]
Summary with sumy-LSA (512.9 sec)
Natural language processing, as a data analytics related technology, is used widely in many research areas such as artificial intelligence, human language processing, and translation. [paper id: 1608.04434v1]
At present, due to explosive growth of data, there are many challenges for natural language processing. [paper id: 1608.04434v1]
Hadoop is one of the platforms that can process the large amount of data required for natural language processing. [paper id: 1608.04434v1]
KOSHIK is one of the natural language processing architectures, and utilizes Hadoop and contains language processing components such as Stanford CoreNLP and OpenNLP. [paper id: 1608.04434v1]
This study describes how to build a KOSHIK platform with the relevant tools, and provides the steps to analyze wiki data. [paper id: 1608.04434v1]
Summary with LexRank (11323.26 sec)
Many natural language processing applications use language models to generate text. [paper id: 1511.06732v7]
However, there is no known natural language processing (NLP) work on this language. [paper id: 1912.03444v1]
However, few have been presented in the natural language process domain. [paper id: 2107.07114v1]
Here, we show their effectiveness in natural language processing. [paper id: 2109.04712v1]
The other two methods however, are not as useful. [paper id: 2109.01411v1]
Summary with sumy-TextRank (497.67 sec)
Recently, neural models pretrained on a language modeling task, such as ELMo (Peters et al., 2017), OpenAI GPT (Radford et al., 2018), and BERT (Devlin et al., 2018), have achieved impressive results on various natural language processing tasks such as question-answering and natural language inference. [paper id: 1901.04085v5]
In chapter 1, we give a brief introduction of the history and the current landscape of collaborative filtering and ranking; chapter 2 we first talk about pointwise collaborative filtering problem with graph information, and how our proposed new method can encode very deep graph information which helps four existing graph collaborative filtering algorithms; chapter 3 is on the pairwise approach for collaborative ranking and how we speed up the algorithm to near-linear time complexity; chapter 4 is on the new listwise approach for collaborative ranking and how the listwise approach is a better choice of loss for both explicit and implicit feedback over pointwise and pairwise loss; chapter 5 is about the new regularization technique Stochastic Shared Embeddings (SSE) we proposed for embedding layers and how it is both theoretically sound and empirically effectively for 6 different tasks across recommendation and natural language processing; chapter 6 is how we introduce personalization for the state-of-the-art sequential recommendation model with the help of SSE, which plays an important role in preventing our personalized model from overfitting to the training data; chapter 7, we summarize what we have achieved so far and predict what the future directions can be; chapter 8 is the appendix to all the chapters. [paper id: 2002.12312v1]
We explore how well the model performs on several languages across several tasks: a diagnostic classification probing the embeddings for a particular syntactic property, a cloze task testing the language modelling ability to fill in gaps in a sentence, and a natural language generation task testing for the ability to produce coherent text fitting a given context. [paper id: 1910.03806v1]
Neural Architecture Search (NAS) methods, which automatically learn entire neural model or individual neural cell architectures, have recently achieved competitive or state-of-the-art (SOTA) performance on variety of natural language processing and computer vision tasks, including language modeling, natural language inference, and image classification. [paper id: 2010.04249v1]
Transfer learning in natural language processing (NLP), as realized using models like BERT (Bi-directional Encoder Representation from Transformer), has significantly improved language representation with models that can tackle challenging language problems. [paper id: 2104.08335v1]
‘The Natural Language Processing (NLT) is a new tool that can teach people about the world. The tool is based on the data collected by CNN and RNN. A survey of the Usages of Deep Learning was carried out by the 2015 MSCOCO Image Search. It was created by a survey of people in the UK and the US. An image is worth 16x16 words, and a survey reveals how many people are interested in the language.’
3.5 RQ1, RQ2, RQ3: Relations between NLP Datasets, Languages, and Topics of Research
Additionally, to separate results for RQ1, RQ2 and RQ3, there are situations when important information is the coincidence of these three aspects: NLP datasets, languages, and research topics. The triplet dataset-language-problem is usually fixed on two positions. For example, a research may be focused on machine translation (problem) into English (language), but with missing a corpus (dataset); or a group of Chinese researchers (language) has access to a rich Twitter API (dataset), but is considering what type of analysis (problem) is most prominent. This sparks a question what datasets are used, with which languages, and for what problems. Presented results of correlations between these 3 aspects are divided into two groups, for 2 most popular language: English and Chinese. They are shown in Figure 11. The remaining results for the selected languages, from the most popular ones, can be found in Figure 12 and 13.
For English and Chinese languages (being the subject of NLP research) the distribution of problems is very similar. The top problems are: machine translation, question answering, sentiment analysis and summarization. The most popular dataset used for all of these problems is Wikipedia. Additionally, for sentiment analysis, there is a significant number of contributions that use also Twitter. All of these observations are consistent with previous results (reported in Sections 3.1 3.6 3.2).
Before going into languages other than English and Chinese, it is crucial to recall that this analysis focused only on articles written in English. Hence, reported results may be biased in the case of research devoted to other language(s). Nevertheless, there exists a large body of work about NLP applied to non-English languages, which is written in English. For instance, among all analyzed papers for this contribution, 41% were devoted to NLP in the context of neither English (non-english papers are 46% of the dataset) nor Chinese (non-chinese papers are 80% of the dataset).
The most important observation is that the distribution of problems for languages other than English and Chinese is, overall, similar (Machine Translation, Question-Answering, sentiment and summarization are the most popular ones). However, there are also some distinguishable differences:
For German and French, summarization, language modelling and natural language inference, and named entity recognition are the key research areas.
In Arabic and Italian, Japanese, Polish, Estonian, Swedish and Finish, there is a visible trend of interest in named entity recognition.
Dependency parsing is more pronounced in research on languages such as German, French, Czech, Japanese, Spanish, Slovene, Swahili and Russian.
In Basque, Ukrainian, Bulgarian the domain does not have particular homogeneous subdomain distribution. The problems of interests are: co-reference resolution, dependency parsing, dialogue-focused research, language modeling, machine translation, multitask learning, named entity recognition, natural language inference, part-of-speech tagging, question answering.
In Bengali, a special area of interest is part-of-speech tagging.
Research focused on Catalan have a particular interests in dialogue-related texts.
Research regarding Indonesian have a very high percent of sentiment analysis research. Even higher than most popular topic of machine translation.
Studies on Norwegian language are strongly focused on sentiment analysis, which peeks over the most common domain of most of the languages—machine translation.
Research focusing on Russian puts a special effort in analyzing dialogues and dependency parsing.
There are only minimal difference between datasets used for English and Chinese, and other languages. The key ones are:
Facebook is present as one of the main sources in many languages, being particularly popular data source for: Bengali, and Spanish
Twitter is a key data source in research on languages: Arabic, Dutch, French, German, Hindi, Italian, Korean, Spanish, Tamil
WordNet is very often used in research involving: Moldovan and Romanian
Tibetan language research nearly never uses Twitter as the dataset.
3.6 Findings Concerning RQ4: Most Popular Specific Tasks and Problems
At the heart of the research is yet another key aspect—the specific problem that is being tackled, or the task, which is being solved. This may seem similar to the domain, or to the general direction of the research. However, some general problems contain specific problems (e.g. machine translation and English-Chinese machine translation, or named entity recognition and named entity linking). On the other hand, some specific problems have more complicated relation, e.g. machine translation, which in NLP can be solved using neural networks, but neural networks are also an independent domain on their own, which is also a superdomain (or a subdomain) of, for instance, image recognition. These complicated relations point to the need for a standardized NLP taxonomy. This, however, is also out of scope of this contribution.
Let us come back to the methods of analyzing specific results. To extract most popular specific tasks and particular problems, methods described above, such as NER, keyphrase search, metadata mining, text summarization, and network visualization were used. Before presenting specific results, an important aspect of keyphrase search needs to be mentioned. An unsupervised search for particular specific topics of research cannot be reasonably performed. All approaches of unsupervised keyphrase search that have been tried (in an exploratory fashion) produced thousands of potential results. Therefore, supervised keyphrase search has been applied. Therefore, the NLP problems were determined based on an exhaustive (multilingual) list, aggregating most popular NLP tasks㉗.
The list has been extracted from the website and pruned of any additional markdown㉘, to obtain a clean text format. Next, all keywords and keyphrases from the text of each paper has been compared with the NLP tasks list. Finally, each paper has been assigned a list of problems found in its text. Figure 14 shows the popularity (by count) of problems addressed in NLP literature.
Again, there is a dominating problem—machine translation. This is very intuitive, if one takes into account the recent studies [80, 81, 82, 83, 84] showing that lack of high fidelity machine translation remains the key barrier for world-wide communication. This problem seems very persistent, because it was indicated also in older research (e.g. in text from 1968 [85]). Here, it is important to recall that this contribution is likely to be biased towards translation involving English language, because it only analyzed English-written literature.
The remaining top 3 most popular problems are question answering [86] and sentiment analysis [87]. In both these domains, there are already state-of-the-art models ready to be used㉙. What is interesting, for both question answering and sentiment analysis, most of the models are based either on BERT or its variation, DistilBERT [88].
3.7 RQ5: Seeking Outliers in the NLP Domain
Some scientific research areas are homogeneous, and all publication revolve around similar topic (group of topics). On the other hand, some can be very diverse, with individual papers touching very different subfields. Finally, there are also domains where, from a more or less homogeneous set, a separate, distinguishable, subset can be pointed to. To verify the structure of the field of NLP, two methods have been used. One is, previously introduced, metadata mining. The second one was text embedding and cauterization. Let us briefly introduce the second one.
3.7.1 Text Embeddings
One of ubiquitous methods in text processing are word, sentence and document embeddings. Text embeddings, which “convert texts to numbers”, have been used to determine key differences/similarities between analyzed texts.
Embeddings can be divided into: contextualized and context-less [89]. Scientific papers often use words, which strongly depend on the context The prime example is the word “BERT” [48], which on the one hand is a character from a TV show, but in the NLP world it is a name of one of the state-of-the-art embedding models. In this context, envision application of BERT, the NLP method, to analysis of dialogues in children TV, where one of the dialogues would include BERT, the character. Similar situation concerns words like network (either neural network, graph network, social network, or computer network), “spark” [90] (either a small fiery particle, or the name of a popular Big Data library), lemma (either a proven proposition in logic, or a morphological form of a word), etc. Hence, in this study, using contextualized text embeddings is more appropriate. This being the case, very popular static text embeddings like Glove [91] and Word2Vec [92, 93] have not been used.
There are many libraries and models available for contextualized text embedding, e.g.: transformers [33], flair [94], gensim [95] and models: BERT [48] (and its variations like Roberta [96], DistilBERT [88]), GPT-2 [97], T5 [98], ELMo [99] and others. However, most of them require specific and high-end hardware to operate reasonably fast (i.e. GPU acceleration [100]). Here, the decision was to proceed with FastText [101]. FastText is designed to produce time efficient results, which can be recreated on standard hardware. Moreover, it is designed for “text representations and text classifiers”㉚, which is exactly what is needed in this work.
3.7.2 Embedding and Clustering
It is important to highlight that since FastText, like most embeddings, has been trained on a pretty noisy data [101], the input text of articles was preprocessed only with Stage 1 cleaning (see Section 2.2). Next, a grid search [102] was performed, to tune hyperparameters. While, as noted earlier, hyperparameter tuning has not been applied, use of grid search, reported here, illustrates that there exist ready-to-use libraries that can be applied when hyperparameter tuning is required. Overall, the best embeddings were produced by a model with the following hyperparameters㉛:
dimension: 20
minimum subword size: 3
maximum subword size: 6
number of epochs: 5
learning rate: 0.00005
Finally, the FastText model was further trained in an unsupervised mode (which is standard in majority of cases for general language modelling), on texts of papers, to better fit the representation.
After embeddings have been calculated, their vector representations have been clustered. Since there was no response variable, an unsupervised classifier was applied. Again (as in Section 3.7.1), the main goal was simplicity and time efficiency.
Out of all tested algorithms (K-means [103], OPTICS [104, 105], DBSCAN [106, 107], HDBSCAN [108] and Birch [109]), the best time efficiency, combined with relative simplicity of use, was achieved with K-means (see, also [110, 111]). Moreover, in found research, K-means clustering showed best results, when applied to FastText embeddings (see, [112]).
The evaluation of clustering has been performed using three clustering metrics: Silhouette score [113], Davies-Bouldin score [114], Caliński-Harabasz Score [115]. These metrics were chosen because they allow evaluation of unsupervised clustering. To visualize the results on a 2D plane, the multidimensional FastText vectors were converted with t-distributed stochastic neighbor embedding (T-SNE) method [116, 117]. T-SNE has been suggested by text embedding visualizations reported in earlier work [118, 119].
3.7.3 RQ5: Outliers Found in the NLP Research
Visualizations of embeddings are shown in Figure 15.
Note that Figure 15 is mainly aesthetic, as actual relations are rarely visible, when dimension reduction is applied. The number of clusters has been evaluated according to 3 clustering metrics (Silhouette score [113], Davies-Bouldin score [114], Cali-ski-Harabasz Score [115]) and the best clustering score has been achieved for 2 clusters. Hence, further analysis considers separation of the embeddings into 2 clusters. To further explore why these particular embeddings appear in the same group, various tests were performed. First, wordclouds of texts (titles and paper texts) in the clusters have been built. The texts for wordclouds were processed with Stage 2 cleaning. Title wordclouds are shown in Figure 2, while text wordclouds are shown in Figure 3.
Further, citation count comparison (Figures 16 and 17) and authors were checked for text in both clusters.
Based on the content of Figures 2, 3, 16, 17, 18, 19, 20, 21 and the author per cluster distribution analysis the following conclusions have been drawn:
Last, the differences in topics from Semantic Scholar (Figures 18 and 19) and categories from arXiv (Figures 20 and 21) have been checked.
There is one specific outlier, this is the cluster of work related to texts embeddings.
Content of texts shows strong topical shift towards deep neural networks.
Categories and topics of clusters are not particularly far away from each other, because their distribution is similar. There is a higher representation of computer vision and information retrieval area in the smaller cluster (cluster 0).
There are no distinguishable authors who are responsible for texts in both clusters.
The distribution of citation counts is similar in both clusters.
Furthermore, manual verification showed that deep neural networks is actually the biggest subdomain of NLP, and it touches upon issues, which do not appear in other works. These issues are strictly related to neural networks (e.g. attention mechanism, network architectures, transfer learning, etc.) They are universal, and their applications play an important role in NLP, but also in other domains (image processing[120], signal processing [121], anomaly detection [122], clinical medicine [123] and many others [124]).
3.7.4 “Most Original Papers”
In addition to unsupervised clustering, an additional approach to outlier detection has been applied. Specifically, metadata representing citations/reference information was further analyzed. On the one hand, of the “citation spectrum” are the most influential works (as shown in Section 3.3.3). On the other side, there are papers that either are new and have not been cited yet, or those that do not have high influence.
However, the true “original” works are papers which have many citations (they are in top 2 percentile), but very few references (bottom 2 percentile). Based on performed analysis, it was found that such papers are:
“Natural Language Processing (almost) from Scratch” [125]—a neural network approach to learning internal representations of text, based on unlabeled training data. A similar idea was used in future publications, especially, the most cited paper about BERT model [48].
“Experimental Support for a Categorical Compositional Distributional Model of Meaning” [126]—a paper about “modelling compositional meaning for sentences using empirical distributional methods”.
“Gaussian error linear units (gelus)” [127]—paper introducing GELU, a new activation function in neural networks, which was extensively tested in future research [128].
Each of these papers introduced novel, very innovative ideas that inspired further research directions. They can be thus treated as belonging to a unique (separate) subset of contributions.
3.8 RQ6: Text Comprehension
Finally, an additional aspect of text belonging to the dataset was measured; text comprehensibility. This is a very complicated problem, which is still being explored. Taking into account that one of the considered audiences are researchers interested in starting work in NLP, text difficulty, using existing text complexity metrics, was evaluated. An important note is that these metrics are known for problems, such as: not considering complicated mathematical formula; skipping charts, pictures and other visuals. Keeping this in mind, let us proceed further.
3.8.1 Text Complexity
The most common comprehensibility measures map text to school grade, in the American education system [129]. In this way, it is established what is the expected level of reader that should be able to understand the text. The used measures were:
All measures return results on equal scale (school grade). Furthermore, they were all consistent in terms of paper scores. To provide the least biased results, the numerical values (Section 3.8.2) have been averaged to achieve a single, straightforward, measure for text complexity. Here, it should be noted that this was done also because delving into discussion of ultimate validity of individual comprehensibility measurements and pros/cons of each of them is out of scope of current contribution. Rather, the combined measure was calculated to obtain a general idea as to the “readability” of the literature in question.
The results can be averaged together between metrics, because all of they refer to the same scale (school grade).
3.8.2 RQ6: Establishing Complexity Level of NLP Literature
Results of the text complexity (RQ6) are rather intuitive.
As shown in Figure 22, the averaged score of 15 comprehensibility metrics suggests that the majority of papers, in the NLP domain, can be understood by a person after “15th grade”. This matches roughly a person who finished the “1st stage” of college education (engineering studies, bachelor degree, and similar). Obviously, this result shows that use of such metrics to “scientific texts” has limited applicability, as they are based mostly on syntactic features of the text, while the semantics makes some of them difficult to follow even for the specialists. This, particularly, applies to texts which contain mathematical equations, which are being removed during text preprocessing.
3.9 Summary of Key Results
Let us now summarize the key finding, in the form of a question-answer for each of RQs that have been postulated in Section 1.
RQ1: What datasets are considered to be most useful?
The datasets used most commonly for NLP research are: Wikipedia, Twitter, Facebook, WordNet, arXiv, Academic, SST (The Stanford Sentiment Treebank), SQuAD (The Stanford Question Answering Dataset), NLI and SNLI (Stanford Natural Language Inference Corpus), COCO (Common Objects in Context), Reddit.
RQ2: Which languages, other than English, appear as a topic of NLP research?
Languages analyzed most commonly in NLP research, apart from English and Chinese, are: German, French and Spanish.
RQ3: What are the most popular fields and topics in current NLP research?
The most popular fields studied in NLP literature are: Natural Language Processing/Language Computing, artificial intelligence, machine learning, neural networks and deep learning and text embedding.
RQ4: What particular tasks and problems are most often studied?
Particular tasks and problems, which appear in the literature, are: text embedding with BERT and transformers, machine translation between English and other languages (especially English-Chinese), sentiment analysis (most popular with Twitter and Wikipedia datasets), question answering models (with Wikipedia and SQuAD datasets), named entity recognition, and text summarization.
RQ5: Is the field “homogenous”, or are there easily identifiable “subgroups”?
According to the text embedding analysis, there is not enough evidence to find a strongly distinguishable clusters. Hence, there are no outstanding subgroups in the NLP literature.
RQ6: How difficult is it to comprehend the NLP literature?
According to averaged standard comprehensibility measures, scientific texts related to NLP can be digested by a 15th graders, which maps to the 3rd year of higher education (e.g. College, Bachelor's degree studies etc.)
4. CONCLUDING REMARKS
This analysis used Natural Language Processing methods to analyze scientific literature related to NLP. The goal was to answer 6 research questions (RQ1-RQ6). A total of 4712 scientific papers in the field of NLP from arXiv were analyzed. The work used and illustrated at the same time the following NLP methods: text extraction, text cleaning, text preprocessing, keyword and keyphrase search, text embeddings, abstractive and extractive text summarization, text complexity and other methods such as: clustering, metadata analysis, citation/reference analysis, network visualization. This analysis focuses on only Natural Language Processing and its subdomains, topics, etc. Since the procedures of obtaining results reported here were fully automated, the same or similar analysis could be analogically done with ease for different literature languages and even fields. Hence, all the tools used for the analysis are available in a designated repository㉜ for future applications.
Specifically, the query had the form http://export.arxiv.org/api/query?search_query=all:%22natural%20language%20processing%22start=0&max_results=10000. Since such query may take long time to load; to reduce time, one can change the value of the max_results parameter to a smaller number, e.g. 5