Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

Wikipedia's contents are based on reliable and published sources. To this date, little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive dataset of citations extracted from Wikipedia. A total of 29.3M citations were extracted from 6.1M English Wikipedia articles as of May 2020, and classified as being to books, journal articles or Web contents. We were thus able to extract 4.0M citations to scholarly publications with known identifiers -- including DOI, PMC, PMID, and ISBN -- and further labeled an extra 261K citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI. Scientific articles cited from Wikipedia correspond to 3.5% of all articles with a DOI currently indexed in the Web of Science. We release all our code to allow the community to extend upon our work and update the dataset in the future.


Introduction
"Citations have several important purposes: to uphold intellectual honesty (or avoiding plagiarism), to attribute prior or unoriginal work and ideas to the correct sources, to allow the reader to determine independently whether the referenced material supports the author's argument in the claimed way, and to help the reader gauge the strength and validity of the material the author has used." 1 Wikipedia plays a fundamental role as a source of factual information on the Web: it is widely used by individual users as well as third-party services, such as search engines and knowledge bases [21,25]. 2 Most importantly, Wikipedia is often perceived and relied upon as a source of "neutral" information [26]. The confidence that users and services place on Wikipedia has been found to be usually justified: Wikipedia's content is of general high-quality and up-to-date [33,16,11,19,32,7].
To reach this goal, Wikipedia's verifiability policy mandates that "people using the encyclopedia can check that the information comes from a reliable source." A reliable source is defined, in turn, as a secondary and published, ideally scholarly one. 3 Despite the community's best efforts to add all the needed citations, the majority of articles in Wikipedia might still contain unverified claims, in particular lower-quality ones [22]. The citation practices of editors are also at times not systematic [6,10]. As a consequence, the efforts to expand and improve Wikipedia's verifiability through citations to sources are increasing [9,35].
A crucial question to ask in order to improve Wikipedia's verifiability standards, as well as to better understand its dominant role as a source of information, is the following: what sources are cited in Wikipedia?
A high portion of citations to sources in Wikipedia refer to scientific or scholarly literature [28], as Wikipedia is instrumental in providing access to scientific information and in fostering the public understanding of science [20,13,22,37,24,46,23,41]. Citations in Wikipedia are also useful for users browsing low-quality or underdeveloped articles, as they allow them to look outside of the platform [30]. The literature cited in Wikipedia has been found to positively correlate with the journal popularity, the journal impact factor and to its open access availability [27,43,1]. Being cited in Wikipedia can also be considered as an 'altmetric' indicator of impact in itself [42,18]. A clear influence of Wikipedia on scientific research has in turn been found [44], despite the general lack of acknowledgement of Wikipedia in the scientific literature [15,45]. Nevertheless, the evidence on what scientific and scholarly literature is cited in Wikipedia is quite slim. Early studies point to a relative low coverage, indicating that between 1% and 5% of all published journal articles are cited in Wikipedia [34,39,49]. Nevertheless, these studies either use of proprietary databases with limited coverage, or only consider specific publishers (PLoS) and academic communities (computer science).
Answering the question of what exactly is cited in Wikipedia is challenging for a variety of reasons. First of all, editorial practices are not uniform: citations are often given using citation templates somewhat liberally, 4 making it difficult to detect citations to the same source. Secondly, while some citations contain stable identifiers (e.g., DOIs), others do not. A recent study found that 4.42% Wikipedia articles contain at least one citation with a DOI [24]: a low number which might indicate that we are missing a non-negligible fraction of citations without identifiers. This is a significant limitation since existing databases, such as Altmetrics, do provide Wikipedia citation metrics relying exclusively on citations with identifiers. 5 This in turn limits the scope of results relying on these data.
Our goal here is to overcome these two challenges and expand upon previous work [12], by providing a dataset of all citations from the English Wikipedia, equipped with identifiers and including the code to replicate and improve upon our work. The dataset is available on Zenodo [40] and the accompanying repository contains all code and further documentation to replicate our results. 6 This article is organized as follows. We start by describing our pipeline focusing on its three main steps: 1) citation template harmonization -to structure every citations in Wikipedia using the same schema; 2) citation classificationto find citations to books and journal articles; and 3) citation identifier look-up -to find identifiers such as DOIs. We subsequently evaluate our results, provide a description of the published dataset, and conclude by highlighting some possible uses of the dataset as well as ideas to improve it further.

Methodology
We start by briefly introducing Wikipedia-specific terminology: • Wikicode: The markup language used to write Wikipedia pages; also known as Wikitext or Wiki markup.
• Template: A page that is embedded into other pages to allow for the repetition of information, following a certain Wikicode format. 7 Citation templates are specifically defined to embed citations.
• Citation: A citation is an abbreviated alphanumeric expression, embedded in Wikicode following a citation template, as shown in Figure 1; it usually denotes an entry in the References section of a Wikipedia page, but can be used anywhere on a page too (e.g., Notes, Further work ).

Overview
Our process can be broken down into the following steps, as illustrated in Figure  2:    1. Citation data extraction: A Wikipedia dump is used to extract citations from all pages and considering various citation templates. The extracted citations are then mapped to a uniform set of key-value pairings.
2. Citation data classification: A classifier is trained to distinguish between citations to journal articles, books, or other Web content. The classifier is trained using a subset of citations already equipped known identifiers or URLs, allowing to label them beforehand. All the remaining citations are then classified.
3. Citation data lookup: All newly found citations to journal articles are labeled with identifiers (DOIs) using the Crossref API.

Citation data extraction
The citation data extraction pipeline is in turn divided into two steps, which are repeated for every Wikipedia article: 1) extraction of all sentences which contain text in Wikicode format, and filtering of sentences using the citation template Wikicode; 2) mapping of extracted citations to the uniform template and creation of a tabular dataset. An example of Wikicode citations, extracted during step 1, is given in Table 1. The same citations after mapping to a uniform template are given in Table 2.

Extraction and filtering
We used the English Wikipedia XML dump from May 2020 and scraped it to get the content of each article/page. The number of unique pages is 6,069,685 after removing redirects since they do not have any citations of their own.   Table 1. When multiple citations to the same source are given in a page, we only consider the first one. The number of extracted citations is 29,276,667.

Mapping
Citation templates can vary, and different templates can be used to refer to the same source in different pages. Therefore, we mapped all citations to the same uniform template. For this step, we used the wikiciteparser parser. 9 This parser is written in Lua and it can be imported into Python using the lupa library. 10 The uniform template we use comprises 29 different keys. Initially, the wikiciteparser parser only supported 17 citation templates, thus we added support for an additional 18 of the most frequently used templates. More details on the uniform template keys and the extra templates we implemented can be found in the accompanying repository.
The resulting uniform key-value dataset can easily be transformed in tabular form for further processing. In particular, this first step allowed us to construct a dataset of citations with identifiers containing approximately 3.928 million citations. These identifiers -including DOI, PMC, PMID and ISBN -allowed us to use such citations as training data for the classifier.

Citation data classification
After having extracted all citations and mapped them to a uniform template, we proceed to train a classifier to distinguish among three categories of cited sources: journal articles, books and Web content. Our primary focus are journal articles, as those cover most citations to scientific sources. We describe here our approach to label a golden dataset to use for training, the features we use for the classifier, and the classification model.

Labeling
We labelled available citations as follows: • Every citation with a non-null PMC or PMID was labeled as a journal article.
• Every citation with a non-null PMC, PMID or DOI and using the citation template for journals and conferences, was labeled as a journal article.
• Every citation which had a non-null ISBN was labelled as a book.
• All citations with their URL top level domain belonging to the following: nytimes, bbc, washingtonpost, cnn, theguardian, huffingtonpost, indiatimes, were labeled as Web content.
• All citations with their URL top level domain belonging to the following: youtube, rollingstone, billboard, mtv, metacritic, discogs, allmusic, were labeled as Web content.
After labelling, we removed all identifiers and the type of citation template as features, since they were used to label the dataset. We also removed the fields: URL, work, newspaper, website, for the same reason. The final number of data points used for training and testing the classifier is given in Table 3, and was partially sampled in order to have a comparable number of journal articles, books and Web content.

Features
We next describe the features we used for the classification model: • Citation text: The text of the citation, in Wikicode syntax.
• Citation statement: The text preceding a citation in a Wikipedia article, as it is known that certain statements are more likely to contain citations [35]. We have used the 40 words preceding the first time a source is cited in an article.
• Part of Speech (POS) tags: POS tags for citation statements could also be correlated to citations. [35]. These were generated using the NLTK library. 11 • Citation section: The article section a citation occurs in.
• Order of the citation within the article, and total number of words of the article.

Classification model
The model which we constructed is a hybrid deep learning pipeline illustrated in Figure 3. The features were represented as follows: • Citation text: The citation text in Wikicode syntax was fed to a characterlevel bidirectional LSTM [36] on the dummy task of predicting whether the citation text is to a book/journal article or other Web content. The traintest split was done using a 90-10 ratio, yielding a 98.56% test accuracy. We used this dummy task in order to avoid the effects of vocabulary sparsity due to Wikicode syntax. The character-level embeddings are of dimension 300, we aggregated them for every citation text via summation and normalized the resulting vector to sum to one. We used character-level embeddings to deal with Wikicode syntax. The citation text embeddings were trained on the dummy task and froze afterwards.
• Citation statement: The vocabulary for citation statements contains approximately 443,000 unique tokens, after the removal of tokens which appear strictly less than 5 times in the corpus. We used fastText to generate word-level embeddings for citation statements, using subword information [3]. FastText allowed us to deal with out of vocabulary words. We used the fastText model pre-trained on English Wikipedia. 12 • POS tags: The POS tags of citation statements were represented with a bag of words count vector. We were considering the top 35 tags by count frequency.
• Citation section: We used a one-hot encoding for the 150 most common sections within Wikipedia articles. The order of the citation within the article and total number of words of the article were represented as scalars. Once the features had been generated, citation statements and their POS tags were further fed to an LSTM of 64 dimensions to create a single representation. All the resulting feature representations were concatenated and fed into a fully connected neural network with four hidden layers, as shown in Figure 3. A final Softmax activation function was applied on the output generated by the fully connected layers, to map the output to one of the three categories of interest. We trained the model for five epochs using a train and test split of 90% and 10% respectively. For training, we used the Adam optimizer [17] and a binary crossentropy loss. The model's initial learning rate was set to 0.001, and reduced minimally to 0.00001 once the accuracy metric has stopped improving.

Citation data lookup
The lookup task entails finding a permanent identifier for every citation missing one. We focused on journal articles for this final step, since they make up the bulk of citations to scientific literature found in Wikipedia for which a stable identifier can be retrieved. We used the Crossref API to get DOIs. 13 Crossref allows to query its API 50 times per second, we used the aiohttp and asyncio libraries to process requests asynchronously. For each citation query, we get a list of possible matches in descending ordered according to a Crossref confidence score. We kept the top three results from each query response.

Evaluation
In this section we discuss the evaluation of the citation classification and lookup steps.

Classification Evaluation
After training the model for five epochs, we attained an accuracy of 98.32% on the test set. The confusion matrix for each of the labels is given in Table 4. The model is able to distinguish among the three classes very well. The model was then used to classify all the remaining citations from the 29.276 million dataset, that is to say approximately 22.282 million citations. Some examples of results from the classification step are given in Table 6. The resulting total number of citations per class are given in Table 5.

Crossref Evaluation
For the lookup, we evaluated the response of the Crossref API in order to assess how to pick results from it. We tested the API using 10,000 random citations with DOI identifiers and containing 9764 unique title-author pairs. We split this subset into a 80-20 split, tried out different heuristics on 80% of the data points and tested the best one on the remaining 20%. Table 7 shows the results for different heuristics, which confirms that the simple heuristic of picking the first result from Crossref works well.
This still leaves open the question of what Crossref confidence score to use. We picked the threshold for the confidence score to be 34.997 which gave us a precision of 70% and a recall of 67.55% to reach a balance between the two in the evaluation (Figure 4).
We finally tested the threshold using the 1953 held-out examples, out of which 1246 examples had the correct identifier with the first heuristic (out of 1297) and the threshold, 646 examples gave a different result out of which 521 are over the threshold and only 10 requests were invalid for the API. Hence, the first metadata result is the best result from the Crossref API.
The lookup process was performed by extracting the title and the first author (if available) for all the potential journal articles and was queried against the Table 7: Results for each heuristic tested on 80% of the subset.

Heuristic
Matched Not  matched   Invalid request  1st result  5258  2510  43  2nd result  345  7407  59  3rd result  96  7647  67 CrossRef API to get the metadata. The top 3 results from the metadata were taken into account if they existed, and their DOIs and confidence scores were extracted. 260,752 citations were equipped with DOIs using the lookup step and 320,887 unique DOIs were found relating to each of these citations.

Dataset
The resulting Wikipedia Citations dataset is composed of 3 parts: 1. The main dataset of 29.276 million citations from 35 different citation templates, out of which 3.928 million citations already contained identifiers (Table 8), and 260,752 out of 947,233 newly-classified citations to journal articles were equipped with DOIs from Crossref.
2. An example subset with the features for the classifier.
3. Citations classified as journal and their corresponding metadata/identifier extracted from Crossref to make the dataset more complete.

Descriptive analysis
We start by comparing our dataset with previous work, which focused on citations with identifiers [12]. The total number of citations per identifier type is found to be similar (Table 9). Minor discrepancies are likely due to the fact that we do not consider here all the edit history of every Wikipedia page, therefore missing changes between revisions, and that we consider a more recent dump. The total number of distinct identifiers across all Wikipedia, both previously known and newly-found, are given in Table 10. Considering that in the Web of Science there are 38,829,128 articles with a DOI (version of March 2020), Wikipedia is citing approximately 3.5% of them (1,347,893). We show in Figure 5 the number of citations to books and journal articles published over the time period 2000 to 2020. This figure highlights how books appear to take longer to get cited in Wikipedia after publication. A similar plot, but considering a much wider publication time span (1500-2020) is given in Figure 6. Most published material in Wikipedia dates from the 1800 onward. We note that a total of 89,098 journal article citations and 193,336 book citations do not contain a publication year.     Out of all the 28 template keys including the citation, most are not complete. For example, identifiers are present only in 13.42% of citations whereas URLs are present in 85.25% of citations. This implies that many citations refer to Web contents.
Out of 6,069,685 pages on Wikipedia, 407,777 have at least one or more citations with a DOI, that is about 6.7%; the proportion goes up to 12.84% for pages with at least one ISBN instead. This higher percentage of pages with DOIs, when compared to previously reported values [24], is in large part due to our newly found identifiers from Crossref which allowed us to equip with DOIs citations coming from Wikipedia pages with no previous presence of DOIs. We eventually considered the distribution of distinct DOIs per Wikipedia page and it was found that most of the pages have few citations with DOI identifiers, as Table 9: Number of citations equipped with identifiers (not including the identifiers through lookup), per type and compared with [12]. Note: a citation might be associated with two or more identifier types.

Id.
Our  shown in Figure 7. The top journals are listed in Table 11), and contain wellknown mega journals (Nature, Science, PNAS) or other reputed venues (Cell, JBC).

Research opportunities
The Wikipedia Citations dataset can be useful for research and applications in a variety of contexts. We suggest a few here.

Map of Wikipedia sources
What seems to us a low-hanging fruit is a map of Wikipedia sources, following the well-known science mapping and visualization methodologies [38,4,5]. Such work would allow to comprehensively answer the question of what is cited from Wikipedia, from which Wikipedia articles, and how knowledge is reported and negotiated in Wikipedia. Answering these questions is critical to inform the community work on improving Wikipedia by finding and filling knowledge gaps and biases, all the same guaranteeing the quality and diversity of the sources Wikipedia relies upon [26,31,14,32,47].

Citation recommendation
Link prediction in general, and citation recommendation in particular, have been explored for Wikipedia since some time [9,29,48]. Recent work has also focused on finding Wikipedia statements where a citation to a source might be needed [35]. Our dataset can further inform these efforts, in particular easing and fostering work on the recommendation of scientific literature to Wikipedia editors.

Citations as features
Citations from Wikipedia can be used as 'features' in a variety of contexts. They have already been considered as altmetrics for research impact [42], while they can also be used as features for machine learning applications such as those focused on improving knowledge graphs, starting with Wikidata [8]. It is our hope that more detail and novel use cases will also lead to a gradual improvement of the first version of the dataset which we release here.

Conclusion
We publish the Wikipedia Citations dataset, consisting of a total of 29.276M citations extracted from 6.069M articles from English Wikipedia. Citations are equipped with persistent identifiers such as DOIs and ISBNs whenever possible. Specifically, we extracted 3.928M citations with identifiers -including DOI, PMC, PMID, and ISBN -from Wikipedia itself, and further equipped an extra 260,752 citations with DOIs from Crossref. In so doing, we were able to raise the number of Wikipedia articles with at least one DOI from less than 5% to more than 6.7% (which is an additional 105,018 pages with a DOI) and found that Wikipedia is citing approximately 3.5% of the journal articles indexed in the Web of Science. We also release all our code to extend upon our work and update the dataset in the future.
A set of limitations are worth highlighting, which also constitute possible directions for future work. First of all, the focus on English Wikipedia can and should be rapidly overcome to include all languages in Wikipedia. Our approach can easily be adapted to other languages, provided that external resources (e.g., language models and lookup APIs) allow for them. Secondly, the dataset currently does not account for the edit history of every citation from Wikipedia: this would allow to study knowledge production and negotiation over time: adding 'citation versioning' would be important in this respect. Thirdly, citations are used for a purpose, in a context; our choice to focus on the citation network means that an extension of the dataset could include all the citation statements as well, in order to allow researchers to study the fine-grained purpose of citations. Lastly, the querying and accessibility of the dataset is limited by its size; more work is needed in order to make Wikipedia contents better structured and easier to query [2].
We highlighted a set of possible uses of our dataset, from mapping the sources Wikipedia relies on, to recommending citations and using citation data as features. It is our hope that this release will start a collaborative effort by the community to study, use, maintain and expand work on citations from Wikipedia.

Data availability
The dataset is made available on Zenodo [40] and the accompanying repository contains all code and further documentation to replicate our results: https:// github.com/Harshdeep1996/cite-classifications-wiki/releases/tag/0. 2.