Abstract
Wikipedia is one of the most visited websites in the world and is also a frequent subject of scientific research. However, the analytical possibilities of Wikipedia information have not yet been analyzed considering at the same time both a large volume of pages and attributes. The main objective of this work is to offer a methodological framework and an open knowledge graph for the informetric large-scale study of Wikipedia. Features of Wikipedia pages are compared with those of scientific publications to highlight the (dis)similarities between the two types of documents. Based on this comparison, different analytical possibilities that Wikipedia and its various data sources offer are explored, ultimately offering a set of metrics meant to study Wikipedia from different analytical dimensions. In parallel, a complete dedicated data set of the English Wikipedia was built (and shared) following a relational model. Finally, a descriptive case study is carried out on the English Wikipedia data set to illustrate the analytical potential of the knowledge graph and its metrics.
1. INTRODUCTION
On January 15, 2001, Wikipedia was born under the umbrella of Nupedia, an encyclopedia project that was based on a peer review system. Due to the lack of agility in publishing articles, Wikipedia was created as a feeder project, as its objective was to make the creation of new articles easier before they were reviewed (History of Wikipedia, 2021). Wikipedia combined in a single project different elements that were new on the web and that made possible for the first time a universal encyclopedia (Reagle, 2009). It was successful enough to make Nupedia disappear in 2 years, experiencing steady growth. Since then, Wikipedia has become one of the most visited websites in the world (https://www.semrush.com/website/top/, accessed August 4, 2022), having 328 different editions, 285 of them having more than 1,000 articles (https://meta.wikimedia.org/wiki/List_of_Wikipedias, accessed August 4, 2022). Although this is the most successful project of Wikimedia Foundation, there are also other well-known knowledge projects using wikis as a basis (e.g., the Wiktionary dictionary or the Wikidata knowledge base).
Wikipedia has been a disruptive innovation, finding in its open nature and decentralized knowledge development one of its key elements (Olleros, 2008). Not only can everyone access its contents free of charge, but they can also participate in its construction, in a fully transparent process. This social construction of the knowledge can be seen in the differences found among language editions of the same Wikipedia pages (Hara & Doney, 2015). Wikipedia contents are also the result of consensus among editors or Wikipedians. This consensus is built in open discussions in the Wikipedia talk pages (Maki, Yoder et al., 2017; Yasseri, Sumi et al., 2012), open to anyone and capturing transnational debates around Wikipedia contents (Kopf, 2020). Some of these talks and debates have sometimes transcended Wikipedia itself (O’Neil, 2017).
As an online encyclopedia, Wikipedia is not exempt from problems. The reliability of its content has been much debated, as it is based on contributions from anonymous individuals (Olleros, 2008). The quality of Wikipedia pages’ content has been studied numerous times from different perspectives, especially with regard to medical content pages, pointing out limitations, such as occasional incomplete or imprecise information (Adams, Montgomery et al., 2020; Candelario, Vazquez et al., 2017; Weiner, Horbacewicz et al., 2019). The importance of integrating Wikipedia into academia, both in its use and in its development, has been highlighted (Jemielniak, 2019). Social and cultural inequalities have also been pointed out, such as racial and gender gaps in its biographies (Adams, Brückner, & Naslund, 2019; Tripodi, 2021).
Wikipedia is not free of bots and vandalism, although they do not constitute a serious threat to its contents and reliability and Wikipedia’s policy does not allow detrimental use of the activity of bots or automated accounts. Most of the bots on Wikipedia are publicly identified (https://en.wikipedia.org/wiki/Special:ListUsers/bot), and they contribute to improving the content and structure of Wikipedia articles (Arroyo-Machado, Torres-Salinas et al., 2020; Zheng, Albano et al., 2019). Bots also help to control and reduce problems of vandalism and trolls, as they eliminate their harmful edits of articles in advance of human editors. There is also no shortage of proposals for methods based on machine learning to prevent this type of harmful activity (Martinez-Rico, Martinez-Romo, & Araujo, 2019).
In spite of all of these issues, the general idea is that Wikipedia is a transparent and reliable source of encyclopedic information (Lageard & Paternotte, 2021), with value of its own to be the subject of scientific research.
1.1. Wikipedia as Source for Informetric Research
Wikipedia has been researched from different scientific perspectives. One of them is informetrics, quantitatively studying the contents and activity generated on Wikipedia. Thus, Wikipedia has been studied from the points of view of scientometrics, bibliometrics, and webometrics, which are discussed in detail below.
Bibliographic references made in Wikipedia have been studied, particularly since the emergence of the notion of “altmetrics” (Priem, Taraborelli et al., 2010), which considered citations on Wikipedia to scientific literature as part of its realm1. Wikipedia citations are one of the most popular sources covered in altmetric aggregators (Ortega, 2020; Zahedi & Costas, 2018) such as Altmetric.com, PlumX, or Crossref Event Data. In addition to altmetric data providers, there are also several other open data sources providing extensive metadata on Wikipedia citations (Singh, West, & Colavizza, 2020; Zagorova, Ulloa et al., 2022). Moreover, other proposals, such as Scholia, enable the exploration of bibliographic data at different levels through Wikidata (Nielsen, Mietchen, & Willighagen, 2017). In Table 1 a summary of previous studies on Wikipedia bibliographic references are presented.
Reference . | Type . | Application . | Data . | Methodological approach . | Language edition . | Topic analyzed . |
---|---|---|---|---|---|---|
Mühlhauser and Oser (Mühlhauser & Oser, 2008) | Qualitative | Content and quality analysis | – | Check list | German | Health care |
Candelario et al. (Candelario et al., 2017) | Content and quality analysis | 33 pages | Scoring system | English | Medication | |
Kaffee and Elsahar (Kaffee & Elsahar, 2021) | Analyze the editors’ citation process | – | Survey and interviews | Multilingual | Multidisciplinary | |
Nielsen (Nielsen, 2007) | Quantitative | Analyze citation patterns | 30,368 citations | Descriptive statistics | English | Multidisciplinary |
Kousha and Thelwall (Kousha & Thelwall, 2017) | Evaluate the impact of references | 36,191 citations | Descriptive statistics | Multilingual | Multidisciplinary | |
Lewoniewski et al. (Lewoniewski, Węcel, & Abramowicz, 2017) | References coverage across languages | 6.8 million pages 41 million citations | Descriptive statistics | Multilingual | Multidisciplinary | |
Maggio et al. (Maggio, Willinsky et al., 2017) | Analyze citation patterns | 229,857 pages 1,049,025 citations | Descriptive statistics | English | Medicine | |
Pooladian and Borrego (Pooladian & Borrego, 2017) | Evaluate the impact of references | 982 citations | Descriptive analysis | Multilingual | Multidisciplinary | |
Jemielniak et al. (Jemielniak, Masukume, & Wilamowski, 2019) | Rank journals by citations | 11,325 pages 137,889 citations | Citation analysis | English | Medicine | |
Torres-Salinas et al. (Torres-Salinas, Romero-Frías, & Arroyo-Machado, 2019) | Mapping of knowledge structure | 25,555 pages 41,655 citations | Cocitation analysis | English | Arts & Humanities | |
Arroyo-Machado et al. (Arroyo-Machado et al., 2020) | Mapping of knowledge structure | 193,802 pages 847,512 citations | Cocitation analysis | English | Multidisciplinary | |
Colavizza (Colavizza, 2020) | Publications coverage | 3,083 ref. pub. | Topic modeling and regression analysis | English | COVID-19 | |
Nicholson et al. (Nicholson, Uppala et al., 2021) | Reviewing citation quality | 1,923,575 pages 824,298 ref. pub. | Classification modeling | English | Multidisciplinary | |
Singh et al. (Singh et al., 2020) | Data set creation | 4 million citations | Text mining | English | Multidisciplinary | |
Zagorova et al. (Zagorova et al., 2022) | Data set creation | 6,073,708 pages 55 million citations | Text mining | English | Multidisciplinary |
Reference . | Type . | Application . | Data . | Methodological approach . | Language edition . | Topic analyzed . |
---|---|---|---|---|---|---|
Mühlhauser and Oser (Mühlhauser & Oser, 2008) | Qualitative | Content and quality analysis | – | Check list | German | Health care |
Candelario et al. (Candelario et al., 2017) | Content and quality analysis | 33 pages | Scoring system | English | Medication | |
Kaffee and Elsahar (Kaffee & Elsahar, 2021) | Analyze the editors’ citation process | – | Survey and interviews | Multilingual | Multidisciplinary | |
Nielsen (Nielsen, 2007) | Quantitative | Analyze citation patterns | 30,368 citations | Descriptive statistics | English | Multidisciplinary |
Kousha and Thelwall (Kousha & Thelwall, 2017) | Evaluate the impact of references | 36,191 citations | Descriptive statistics | Multilingual | Multidisciplinary | |
Lewoniewski et al. (Lewoniewski, Węcel, & Abramowicz, 2017) | References coverage across languages | 6.8 million pages 41 million citations | Descriptive statistics | Multilingual | Multidisciplinary | |
Maggio et al. (Maggio, Willinsky et al., 2017) | Analyze citation patterns | 229,857 pages 1,049,025 citations | Descriptive statistics | English | Medicine | |
Pooladian and Borrego (Pooladian & Borrego, 2017) | Evaluate the impact of references | 982 citations | Descriptive analysis | Multilingual | Multidisciplinary | |
Jemielniak et al. (Jemielniak, Masukume, & Wilamowski, 2019) | Rank journals by citations | 11,325 pages 137,889 citations | Citation analysis | English | Medicine | |
Torres-Salinas et al. (Torres-Salinas, Romero-Frías, & Arroyo-Machado, 2019) | Mapping of knowledge structure | 25,555 pages 41,655 citations | Cocitation analysis | English | Arts & Humanities | |
Arroyo-Machado et al. (Arroyo-Machado et al., 2020) | Mapping of knowledge structure | 193,802 pages 847,512 citations | Cocitation analysis | English | Multidisciplinary | |
Colavizza (Colavizza, 2020) | Publications coverage | 3,083 ref. pub. | Topic modeling and regression analysis | English | COVID-19 | |
Nicholson et al. (Nicholson, Uppala et al., 2021) | Reviewing citation quality | 1,923,575 pages 824,298 ref. pub. | Classification modeling | English | Multidisciplinary | |
Singh et al. (Singh et al., 2020) | Data set creation | 4 million citations | Text mining | English | Multidisciplinary | |
Zagorova et al. (Zagorova et al., 2022) | Data set creation | 6,073,708 pages 55 million citations | Text mining | English | Multidisciplinary |
Kaffee and Elsahar (2021) explored the flow that Wikipedians follow to include references in Wikipedia articles. Kousha and Thelwall (2017), and Pooladian and Borrego (2017) described the problems of Wikipedia citations in performance evaluation. Nicholson et al. (2021) studied the quality of cited references in Wikipedia. Lewoniewski et al. (2017) showed that the different language editions of the same Wikipedia page tended to cite common sources, with the largest overlap between English and German and some differences depending on the topics. Colavizza (2020) studied the coverage of the scientific literature on COVID-19 on Wikipedia, showing that although there was only a small percentage of scientific literature on COVID-19 in Wikipedia, it was sufficiently representative of its various topics. Arroyo-Machado et al. (2020) and Torres-Salinas et al. (2019) mapped Wikipedia cocitations patterns, showing fundamental differences in the use of scientific literature in Wikipedia compared to the academic realm. Bould, Hladkowicz et al. (2014), Li, Thelwall, and Mohammadi (2021), and Tomaszewski and MacDonald (2016) studied academic citations in scientific publications to Wikipedia articles, proving that scientific publications also use Wikipedia content in their citations, as well as other digital encyclopedias, especially in areas such as chemistry, physics, or mathematics.
Wikipedia has also been the subject of webometric studies. For example, “Wikiometrics” were proposed as a rating system to rank universities or journals based on the features of their Wikipedia pages, also finding positive correlations with existing academic rankings (Katz & Rokach, 2017). The estimation of the importance of Wikipedia pages based on the PageRank algorithm was also studied, correlating positively with other page-view-based rankings (Thalhammer & Rettinger, 2016). Miquel-Ribé and Laniado (2018) showed that the different language editions of Wikipedia pages reflect cultural differences, as the contents cover local topics corresponding to different linguistic regions. Other studies focused on metrics about the attention generated around Wikipedia articles (e.g., likes or page view counts), showing how they reflect current topics of interest at a particular time/region (Dzogang, Lansdall-Welfare, & Cristianini, 2016; Mittermeier, Roll et al., 2019; Mittermeier, Correia et al., 2021; Roll, Mittermeier et al., 2016; Vilain, Larrieu et al., 2017), and even demonstrating the potential of Wikipedia pages to monitor the spread of diseases (Generous, Fairchild et al., 2014).
There are also numerous studies around Wikipedia’s informetric features. Wilkinson and Huberman (2007) found a correlation between the quality of Wikipedia articles and their number of edits. The relationship between the length of Wikipedia articles and their quality has been highlighted by Blumenstock (2008). Beyond quality, relationships between Wikipedia metrics have also been explored. Previous studies found positive correlations between views and the number of edits and editors (Mittermeier et al., 2021), and weak correlations between the length of Wikipedia pages and the length of their talk pages (Yasseri et al., 2012). Zhang, Ren, and Kraut (2018) suggested the value of using metrics in specific moments of the life cycles, for example the number of editors in the first 3 months of an article’s life was not when it was most strongly related to its future quality.
Although, as shown above, there is abundant scientific literature on Wikipedia and its informetric applications, most previous studies tended to focus on either limited sets of metrics (e.g., Nicholson et al. (2021), who were focused on the level of quality of scientific publications referenced in Wikipedia articles), or limited data sets (e.g., Mittermeier et al. (2021), who studied a large set of features in a data set of Wikipedia pages of 10,099 bird species across 251 language editions). Thus, large-scale study of Wikipedia, from both a large volume of pages and attributes, is still missing in the literature. Arguably, a potential reason for this lack of large-scale studies on Wikipedia is the lack of a conceptual framework that highlights both the large-scale data available from Wikipedia and the multiple informetric metrics that Wikipedia offers. Such absence has hindered the development of broader research perspectives, especially regarding the relationship of Wikipedia with science, where a contextualization of the relationships between the two is still needed.
In this study, we propose such a framework by means of developing an informetric-inspired knowledge graph, with the aim of enabling similar analytical approaches to those developed in scientometric research. Such a knowledge graph could work as a complement of other Wikipedia knowledge graphs such as Wikidata (https://www.wikidata.org/) or DBpedia (https://www.dbpedia.org/). Wikidata and DBpedia provide exhaustive Wikipedia knowledge graphs but they are more focused on content and semantic relationships, transforming Wikipedia pages into entities (e.g., people, places, music bands) and establishing different computer-understandable relationships between them. Our proposed knowledge graph aims at characterizing the attention and usage of Wikipedia pages using a relational model and incorporating activity metadata that are not present in the semantic graphs of Wikidata and DBpedia, capturing the attention and social engagement, such as views or edits, as well as the presence of scientific literature in Wikipedia pages.
The paper is structured as follows: First, we describe our main objectives and our alignment with recent developments in the field of altmetrics. Second, we describe the informetric features of Wikipedia pages and their similarities with scientific publications, together with the existing data sources for data collection. Several informetric-inspired metrics (Wikinformetrics) are proposed for Wikipedia. Third, a Wikipedia knowledge graph, based on the combination of different Wikipedia data sources, is constructed and presented. Fourth, the data set is explored in a descriptive way to show the analytical possibilities of the knowledge graph and the proposed metrics. Finally, we conclude by discussing our findings and proposing future research venues.
1.2. Objectives
The main objective of this work is to explore the research value of Wikipedia from an informetric perspective, ultimately providing a complete Wikipedia knowledge graph. More specifically, three different objectives are targeted:
Theoretical objective: To establish a framework for Wikipedia analytics, by exploring the informetric features of Wikipedia pages (composition, categories, sources, data gathering, etc..) and proposing a set of informetric-inspired metrics (Wikinformetrics) for their quantitative study. This objective will help us to map the analytical possibilities of Wikipedia as a scientific object.
Instrumental objective: To create a large open Wikipedia knowledge graph. Once we are familiar with the main features of Wikipedia, we will construct a dedicated knowledge graph focused on the English-language edition of Wikipedia with the main information and data relationships coming from combining different data sources.
Applied objective: To conduct a descriptive quantitative study of Wikipedia metrics based on the knowledge graph data set, and to explore the proposed metrics and the different types of attention they capture.
This work and its objects align with novel developments on social media metrics (Díaz-Faes, Bowman, & Costas, 2019; Wouters, Zahedi, & Costas, 2019), contributing to the exploration of different science-society interactions that can be captured on Wikipedia (Costas, de Rijcke, & Marres, 2020). Our ambition is to frame Wikipedia as a data source with multiple informetric research possibilities. Furthermore, a dedicated data set of the English edition of Wikipedia is constructed for informetric purposes and is freely available at Zenodo (https://doi.org/10.5281/zenodo.6346899). R and Python were used together for its elaboration, with the scripts available on GitHub (https://doi.org/10.5281/zenodo.6959428). Many of the results presented here are novel, as to the best of our knowledge there is no previous literature that has explored the same large set of Wikipedia features and with the same large-scale perspective as in this study. This work is intended to be useful for a wide range of researchers, such as librarians, informetricians, sociologists, and data scientists.
2. WIKIPEDIA FROM AN INFORMETRIC PERSPECTIVE
2.1. Analogy Between Wikipedia Pages and Scientific Publications
In Wikipedia, the key components are the individual pages. Wikipedia pages are not only used for the publication of encyclopedia articles but also other numerous typologies of pages, such as categories, users, and talk pages, as well as relationships among them. The different types of pages are given by a pre-established namespace (a type of page with special features identifiable through a prefix included in the title). Wikipedia currently has 12 namespaces in use (article, user, Wikipedia, file, mediawiki, template, help, category, portal, draft, timedtext, and module), each with an associated “talk namespace” (or “talk page”) in which discussions are held around the contents and edits of the page, and two virtual namespaces (special and media).
There are several features of Wikipedia pages, in particular namespace article pages, for which it is possible to establish an equivalence with that of a scientific publication. First, they have a title and an associated page identifier (Wikipedia page ID). They may have one or more authors, it being possible to identify the first person who created it, and when, and those who have made a greater contribution or whose edition has been revoked. The contents may include multimedia files, links to external resources, and bibliographic references, among others. There are also internal links that enable Wikipedia pages to connect to each other, just like citations among scientific publications. Finally, Wikipedia pages can be classified with categories according to their contents to carry out its thematic classification, such as keywords and classifications applied to scientific publications. Most of these elements can be seen as metadata to be treated in the study of Wikipedia pages. However, there are several differences between Wikipedia pages and scientific publications that cannot be ignored (Table 2). The most important is that Wikipedia pages are a living resource and not static documents. The access and editing of the contents also differ between Wikipedia pages and scientific publications because Wikipedia pages do not focus on a specific audience (e.g., scientific publications mostly focus on academic audiences), but anyone can take an active part in editing them. It should be also noted that some pages may be temporarily limited or protected for editing (Hill & Shaw, 2015).
Wikipedia element description . | Wikipedia pages vs. Scientific publications . | ||
---|---|---|---|
Wikipedia page . | Scientific publication . | ||
State | Document state condition | Living | Static |
ID | Document identification number | Page ID | DOI, ISBN, URI … |
Name | Title of the document | Title | Title |
Type | Document typologies | Namespace (12 + 12 types) | Paper, proceeding, letter … |
Creation | Date from which it is available | First edition date | Publication date |
Authorship | Responsible for the work | Wikipedians | Authors |
Content | Type of content | Structured text | Structured text |
Language | Language of the resource | Edition dependent | Document dependent |
Discussion | Comments on the contents | Talk | Peer review |
Description | Work summary | Short description | Abstract |
Tags | Terms describing the content | Categories | Keywords |
Media | Audiovisual resources includable | Images, audios, and videos | Images, audios, and videos |
Internal links | Links to the related resources | Internal links | Citations |
Format | Standardized structure and content | Manual of style* | Format guidelines |
Bibliography | References of cited resources | References | References |
Access | Access model | Open | Closed/Open |
Audience | Document target audience | General | Specialized |
Wikipedia element description . | Wikipedia pages vs. Scientific publications . | ||
---|---|---|---|
Wikipedia page . | Scientific publication . | ||
State | Document state condition | Living | Static |
ID | Document identification number | Page ID | DOI, ISBN, URI … |
Name | Title of the document | Title | Title |
Type | Document typologies | Namespace (12 + 12 types) | Paper, proceeding, letter … |
Creation | Date from which it is available | First edition date | Publication date |
Authorship | Responsible for the work | Wikipedians | Authors |
Content | Type of content | Structured text | Structured text |
Language | Language of the resource | Edition dependent | Document dependent |
Discussion | Comments on the contents | Talk | Peer review |
Description | Work summary | Short description | Abstract |
Tags | Terms describing the content | Categories | Keywords |
Media | Audiovisual resources includable | Images, audios, and videos | Images, audios, and videos |
Internal links | Links to the related resources | Internal links | Citations |
Format | Standardized structure and content | Manual of style* | Format guidelines |
Bibliography | References of cited resources | References | References |
Access | Access model | Open | Closed/Open |
Audience | Document target audience | General | Specialized |
The English Wikipedia has its own manual of style (https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style).
The living nature of Wikipedia pages puts them at the center of a complex system (Ladyman, Lambert, & Wiesner, 2013), whose main elements are represented in Figure 1. Many of the elements of the pages are static or unalterable, such as the creation date or page ID, while others are in constant evolution, especially the contents themselves. This makes it difficult to study certain elements in Wikipedia (Détienne, Baker et al., 2016), as Wikipedia content is volatile and authorship and contribution roles can be diluted in contrast to the higher stability of scientific publications. In addition, the same page, especially encyclopedic articles, may have parallel versions in different language editions of Wikipedia, which may vary in content. This scenario becomes even more complex when taking into account that not only human users are involved in the development of Wikipedia pages but also bots, thus making the interactions that can occur more complex to analyze (Tsvetkova, García-Gavilanes et al., 2017).
2.2. Categorization
Wikipedia pages are not thematically organized according to a controlled language-based classification, such as Britannica’s subject organization system. Instead, Wikipedia pages have a category system that works like a folksonomy (Minguillón, Lerga et al., 2017). Wikipedians are free to tag each page under one or more existing categories or to create new ones. Numerous studies have approached them, such as by studying their semantic domain (Aghaebrahimian, Stauder, & Ustaszewski, 2020; Heist & Paulheim, 2019). However, the main problem of this folksonomy is the large number of individual categories and their unstructured (i.e., without a clear hierarchical system) relations at different levels, introducing a lot of noise and making it difficult to have a general thematic view of Wikipedia (Boldi & Monti, 2016; Kittur, Chi, & Suh, 2009). In addition, there are also hidden categories, related to the maintenance or management of the page.
Besides the categories, Wikipedia has other options for accessing and browsing its contents by topics (https://en.wikipedia.org/wiki/Wikipedia:Contents). On the one hand, it offers different curated content lists (e.g., the “list of articles every Wikipedia should have” or the list of “vital articles”). There are other lists that offer collections of articles that respond to the same topic, and even “lists of lists.” Similarly, there are “portals,” which imitate the classic web portals and are organized in sections that group the main contents of a topic, not only the articles (e.g., the “Science” portal or the “History of science” subportal). WikiProjects, communities of Wikipedians aimed at improving Wikipedia content on a specific topic and which have their own page from which they coordinate their activities, can also work as a classification approach due to their thematic orientation (e.g., “Anthropology” or “The Beatles”). There are also third-party classification systems, such as the “Library of Congress Classification” or the “Universal Decimal Classification.” Finally, external to Wikipedia, but within the Wikimedia ecosystem, there are other types of classification solutions, such as Wikidata taxonomies (https://www.wikidata.org/wiki/Wikidata:WikiProject_Taxonomy) or ORES (https://www.mediawiki.org/wiki/ORES), that can be used to identify Wikipedia topics using machine learning techniques. The main limitation with all of the above is that there is no central classification system that covers all Wikipedia pages, and that at the same time it is concise and easy to manage, particularly in terms of the number of subjects and the hierarchical relationships among them. The lack of such central classification in Wikipedia is a major hindrance for the large-scale epistemic study of Wikipedia.
2.3. Content Control
Each Wikipedia page has a discussion space called “talk pages,” where Wikipedians discuss with other Wikipedians. Talk pages aim at improving the quality and reliability of the articles. Discussions in talk pages are public (Ferschke, Gurevych, & Chebotar, 2012), resembling the model of open peer review of scientific publications (Black, 2008), and representing a form of public review in contrast to the traditional academic blind peer review system (Cummings, 2020). Wikipedia also includes formal peer review approaches in which Wikipedians request assistance from experts on given topics (https://en.wikipedia.org/wiki/Wikipedia:Peer_review). Despite discrepancies and differences about what open peer review means and the different models proposed (Ross-Hellauer, 2017), the three basic principles (open identities, reports, and participation) are clearly recognizable in Wikipedia (Table S2 in the Supplementary material). Wikipedians are both authors and reviewers of content and their reports are available as comments on the talk pages, all of which are always open and identifiable. Interestingly, Wikipedia-inspired reviewing approaches have even been proposed for scholarly publishing, such as the postpublication correction system and readers’ comments (Xiao & Askin, 2014).
Wikipedia also includes a quality control system of the content of the different articles that comes from WikiProjects. It is grounded on an evaluation system to classify pages in higher or lower levels of content quality, with standard grades that are listed on the respective talk page. Although there is a general scheme (Table 3), it is possible that some WikiProjects do not include all grades or that there may be differences in their application. Similarly, the pages are also classified according to their importance within the topic (Top, High, Mid, and Low). Wikipedians can set any level of quality and importance on a given page, as well as modifying them. When there are disagreements among Wikipedians about the quality level of a page, this leads to a discussion and a search for consensus around the quality level of the page. However, at the highest levels of quality (Featured Articles and Good Articles) this assignment requires a stricter review process, including the presentation of a candidacy and an evaluation by independent Wikipedians according to pre-established criteria. These two levels also have their own badges on the article page.
Class . | Description . | Assignment . | Badge . |
---|---|---|---|
Featured article | The best possible content on Wikipedia, no need for improvement | Review | Yes |
Featured list | The best possible list on Wikipedia, no need for improvement | Review | Yes |
A | Fully addresses the subject and requires only minor improvements | Review | No |
Good article | It satisfies Wikipedia’s main criteria and is close to a professional article | Review | Yes |
B | The content is almost complete and has no major problems | Free | No |
C | The content is considerable, but has significant problems | Free | No |
Start | It includes significant content, but is still in development | Free | No |
Stub | The content is very short and requires substantial work | Free | No |
List | Content displayed in a list linking to Wikipedia articles on a specific topic | Free | No |
Class . | Description . | Assignment . | Badge . |
---|---|---|---|
Featured article | The best possible content on Wikipedia, no need for improvement | Review | Yes |
Featured list | The best possible list on Wikipedia, no need for improvement | Review | Yes |
A | Fully addresses the subject and requires only minor improvements | Review | No |
Good article | It satisfies Wikipedia’s main criteria and is close to a professional article | Review | Yes |
B | The content is almost complete and has no major problems | Free | No |
C | The content is considerable, but has significant problems | Free | No |
Start | It includes significant content, but is still in development | Free | No |
Stub | The content is very short and requires substantial work | Free | No |
List | Content displayed in a list linking to Wikipedia articles on a specific topic | Free | No |
2.4. Sources
A fundamental aspect of Wikipedia lies in the system of links that allows its pages to be connected among them, making Wikipedia unique in this sense with regard to other encyclopedic systems (Reagle & Koerner, 2020). These internal links have been studied, showing both the semantic relationships they can establish and other potential utilities (Consonni, Laniado, & Montresor, 2019; Presutti, Consoli et al., 2014), as well as the possibility of calculating network indicators such as PageRank based on them (Thalhammer & Rettinger, 2016). There are, however, important issues to consider when working with Wikipedia pages links:
The links may be redirects; that is, old page versions that automatically redirect to the new versions when accessing them.
There are lists of links to other Wikipedia pages. Most of the lists include pages that are conceptually related to each other and share a clear subject matter. However, there are specific lists such as disambiguation pages, which are aimed at reducing the ambiguity of some terms (e.g., “citation” or “granada”), and therefore the links in these lists are not necessarily thematically related.
Another fundamental source for Wikipedia is its bibliographic references. Wikipedia recommends the use of bibliographic references to support its contents and it is an essential requirement for a page to achieve the best quality status (Featured article). These references are the same as those made in scientific publications, in both cases serving as a support for an idea. However, it is necessary to consider that citations in Wikipedia and citations in scientific publications are governed by different norms and dynamics. In Figure 2 the main differences between scientific publications references and Wikipedia references are schematized.
Other relevant particularities of Wikipedia references include
Unlike scientific publications in which the identity of the citers (i.e., those including the references in the scientific publication) is clear and invariable, in Wikipedia this is more complex (given the live nature of Wikipedia articles) and not always possible. However, there are some methodological proposals for this purpose (Zagorova et al., 2022).
Wikipedia citation counts can be distorted by the translations of articles into different languages, because it is possible to easily transfer the references across the different language versions of the same article, thus distorting the meaning and value of Wikipedia citation counts. This limitation does not occur in scientific publications, as only one language version of a given publication is usually considered in the counting of citations.
There are certain Wikipedia pages that function as large bibliographic indexes, bringing together the most relevant literature on a specific topic (e.g., research annuals or bibliographies).
There are also templates (special Wikipedia pages that are embedded within other pages to facilitate the repetition of information), which are sometimes used to generate pre-established lists of references that are quickly inserted and replicated into numerous Wikipedia pages that are strongly related. This happened, for example, with the listing of lunar crater references (https://en.wikipedia.org/wiki/Wikipedia:Templates_for_discussion/Log/2014_June_8#Template:Lunar_crater_references).
2.5. Data Gathering
There are numerous data sources, and the choice of one or the other depends mostly on the type and volume of data required. In some cases, there are even multiple ways of accessing the same data. These have been summarized in Table 4, but can be found in detail in Section S3 in the Supplementary material. In fact, Wikimedia has a Research community (https://meta.wikimedia.org/wiki/Research) that gathers different resources to help and guide all those people who want to access the data of the Wikimedia projects and that lists the different projects related to it.
. | Content . | Access . | Format . | Update frequency . | Data quantity* . | Type** . | Main challenge*** . |
---|---|---|---|---|---|---|---|
Wikimedia Dumps | Metadata, page content, and relationships | Offline | XML, SQL | Once/twice a month | Big data | General | Data processing |
MediaWiki and Wikimedia APIs | Metadata, page content, relationships, and statistics | Online | JSON, WDDX, XML, YAML, PHP | Real time | Small data | General | Data recovery |
Wiki Replicas | Metadata, page content, and relationships | Online | SQL | Near-real time | Small data | General | Data recovery |
Event Streams | Real-time logs | Online | SSE, JSON | Real time | – | Specific | Data recovery |
Analytics dumps | Statistics on page views and activity | Offline | TSV | Monthly | Big data | Specific | Data processing |
WikiStats | Statistics on page views, content, and activity | Online | JSON/CSV | Monthly | Small data | Specific | Data recovery |
Dbpedia | Contents and semantic relationships | Both | RDF/XML, Turtle, N-Triplets, SPARQL endpoint | Live/monthly | – | General | Data recovery |
XTools | Statistics on page views, content, and activity | Online | JSON | Real time | Small data | Specific | Data recovery |
Repositories | Dedicated Wikipedia data sets | Offline | – | – | – | – | – |
Altmetric aggregators | Wikipedia References to publications | Online | CSV/JSON | Daily | – | Specific | Data processing |
. | Content . | Access . | Format . | Update frequency . | Data quantity* . | Type** . | Main challenge*** . |
---|---|---|---|---|---|---|---|
Wikimedia Dumps | Metadata, page content, and relationships | Offline | XML, SQL | Once/twice a month | Big data | General | Data processing |
MediaWiki and Wikimedia APIs | Metadata, page content, relationships, and statistics | Online | JSON, WDDX, XML, YAML, PHP | Real time | Small data | General | Data recovery |
Wiki Replicas | Metadata, page content, and relationships | Online | SQL | Near-real time | Small data | General | Data recovery |
Event Streams | Real-time logs | Online | SSE, JSON | Real time | – | Specific | Data recovery |
Analytics dumps | Statistics on page views and activity | Offline | TSV | Monthly | Big data | Specific | Data processing |
WikiStats | Statistics on page views, content, and activity | Online | JSON/CSV | Monthly | Small data | Specific | Data recovery |
Dbpedia | Contents and semantic relationships | Both | RDF/XML, Turtle, N-Triplets, SPARQL endpoint | Live/monthly | – | General | Data recovery |
XTools | Statistics on page views, content, and activity | Online | JSON | Real time | Small data | Specific | Data recovery |
Repositories | Dedicated Wikipedia data sets | Offline | – | – | – | – | – |
Altmetric aggregators | Wikipedia References to publications | Online | CSV/JSON | Daily | – | Specific | Data processing |
Volume of data to be retrieved and processed.
Data from Wikipedia are included to address different problems or are of a specific nature.
Task that will require more effort when using the data source.
The two main sources are dumps and APIs. One of the main problems when working with Wikipedia data dumps is their size, especially when dealing with the more complete editions (e.g., the metadata of the revision of the English Wikipedia pages as of June 2022 is formed by 27 files of more than 2 Gbyte each), so accessing a subset of data requires a lot of time and effort. In the case of using Wikipedia APIs, metadata can be accessed on demand, but the retrieval process is very laborious, especially when large volumes of data are required. Other sources are characterized by offering already preprocessed data, such as the total number of edits or page views, which can be consulted from XTool.
In this paper, we extracted and developed a full Wikipedia knowledge graph with the ambition of facilitating the future of the English Wikipedia, reducing the time and effort that researchers may need in collecting and connecting all the different data sources.
2.6. Wikinformetrics
Finally, there are multiple metrics that can be extracted from the sources presented before and that enable the informetric study of Wikipedia pages. Based on previous studies and the above exploration of the informetric characteristics of Wikipedia, several metrics have been selected (Table 5). Each of them is of interest for measuring a particular dimension of the pages. For example, the number of views can be seen as a measure of the impact and outreach of a particular page, and although the numbers of edits and editors reflect the volume of activity, the numbers of talks and talkers are representative of the discussions that take place around these pages. These are not the only metrics that can be obtained from Wikipedia, but they can be considered to capture some of the most important analytical aspects of Wikipedia pages (e.g., contributions, content development, links and interactions, and impact), being also easy to interpret in an informetric framework.
Metric . | Analytical dimension . | Description . |
---|---|---|
Editors | Activity | Number of unique editors that have edited a Wikipedia article |
Edits | Activity | Number of total edits that have a Wikipedia article |
Linked | Connectivity | Number of Wikipedia articles in which the article is linked to |
Links | Connectivity | Number of internal links that include a Wikipedia article to others |
Age | Description | Years that have passed since the creation of the page to the date of data collection |
Length | Description | Length in bytes of the page |
Talkers | Discussion | Number of unique editors that have edited a Wikipedia article’s talk page |
Talks | Discussion | Number of total edits that the talk page of a Wikipedia article has |
Views | Outreach | Number of daily views of a Wikipedia page |
References | Support | Number of elements listed in the references |
Pub. referenced | Support | Number of publications referenced |
URLs | Support | Number of external links that include a Wikipedia article |
Metric . | Analytical dimension . | Description . |
---|---|---|
Editors | Activity | Number of unique editors that have edited a Wikipedia article |
Edits | Activity | Number of total edits that have a Wikipedia article |
Linked | Connectivity | Number of Wikipedia articles in which the article is linked to |
Links | Connectivity | Number of internal links that include a Wikipedia article to others |
Age | Description | Years that have passed since the creation of the page to the date of data collection |
Length | Description | Length in bytes of the page |
Talkers | Discussion | Number of unique editors that have edited a Wikipedia article’s talk page |
Talks | Discussion | Number of total edits that the talk page of a Wikipedia article has |
Views | Outreach | Number of daily views of a Wikipedia page |
References | Support | Number of elements listed in the references |
Pub. referenced | Support | Number of publications referenced |
URLs | Support | Number of external links that include a Wikipedia article |
3. WIKIPEDIA KNOWLEDGE GRAPH
Using the different data sources described above, a knowledge graph of the English edition of Wikipedia has been constructed for informetric purposes and freely shared on Zenodo (https://doi.org/10.5281/zenodo.6346899). The English edition of Wikipedia has been chosen because it is the largest one and has an international scope. For its construction, data from Wikimedia and analytic dumps were used, as well as data shared in repositories, specifically the data set of Singh et al. (2020) in which they share references made in Wikipedia articles. The data included in this data set covers all English Wikipedia activity until July 2021, except page views, which are from April 1, 2021 to June 30, 2021, and bibliographic reference data, until May 2020. R and Python have been used together, with the scripts available on GitHub (https://doi.org/10.5281/zenodo.6959428). The construction of this data set is described in Section S1 in the Supplementary material. The resulting data set consists of nine files connected to each other by a relational structure summarized in Figure 3.
This knowledge graph offers numerous possibilities for the informetric study of Wikipedia, making it possible to study new relationships (and interactions) between science and this social medium (e.g., the attention on Wikipedia to academic topics, the presence of scientific literature on popular Wikipedia pages, or the use of scientific literature in Wikipedia pages with large discussions in their Talk pages). This is the case of the work of Arroyo-Machado, Díaz-Faes, and Costas (2022), who found a positive relationship between the research performance of universities and their social attention on Wikipedia, using data from this data set.
Although the generation of new versions of the knowledge graph cannot be guaranteed by the authors of this paper, the way in which its creation is detailed and the shared scripts ensure that new versions can be generated. This is also of importance for the generation of new knowledge graphs in other language editions of Wikipedia, as the data used as a basis are also available in other languages. The only limitation in this respect is in the reference data, as they come from a specific data set (Singh et al., 2020). However, those responsible have also shared the tools used to obtain the references and there are other alternatives such as Zagorova et al. (2022) or altmetric data aggregators.
4. CASE STUDY: INFORMETRIC ANALYSIS OF THE ENGLISH WIKIPEDIA
As a case study, the knowledge graph of the English Wikipedia is used to calculate and study the proposed metrics in a broad manner. The analysis was performed in Python and the code is available at GitHub (https://doi.org/10.5281/zenodo.6958972).
4.1. Wikipedia Metrics and Articles’ Content
There are 53,710,529 pages in the English Wikipedia, considering all namespaces as well as pages that are redirects; however, this number is reduced to 6,328,134 pages when the focus is on articles that are not redirects. These represent just 11.79% of the overall English Wikipedia. The metrics proposed in Figure 4 have been obtained for all of them.
Figure 4 shows the descriptive statistics of the main variables, differentiating between total Wikipedia articles and those classified based on their quality; 5,522,676 articles (87.27% of the total) are associated with a WikiProject and with some quality level. Articles with different quality levels have been considered in all of them. It is noticeable that in all metrics, Featured articles have the highest values. The case of class B articles is noteworthy, as they not only show few differences with respect to the Good and A-Class articles, being also greater in number of articles than both, but in aspects such as views they are positioned above them.
There are important differences in the number of referenced publications, going from an average of 14.27 publications in Featured articles to 8.52 in A and 5.84 in Good articles, while the Start and Stub articles cite on average less than one publication. This reflects compliance with English Wikipedia’s criteria for establishing the quality level of articles. The general criteria do not make explicit the need for a greater number of references to increase the level of quality, among others, but they do require an increase in “reliable sources,” so that citations to publications can serve as a proxy for this. Likewise, it also corroborates previous findings of a relationship between the level of quality and the number of edits (Wilkinson & Huberman, 2007), and the length of articles (Blumenstock, 2008).
Most Wikipedia pages are not of recent creation (Figure 5A), with a median of 11 years. In some of the metrics, such as edits and talks, extreme outliers are found. This can be seen in the fact that their average values are 102 and 9.19, respectively, above the median and third quartile values. This situation is much more pronounced in the case of views, with an average of 3,346.59. Furthermore, the number of referenced elements has a median of 1 and an average of 4.6. When comparing the links with the linked ones, we find that Wikipedia pages link more than they are linked, because the median for the former is 36 and for the latter 15.
The correlations between these variables are all positive (Figure 5B). The strongest correlation is between talkers and talks (rs = 0.97), followed by another analogous relationship such as that between editors and edits (rs = 0.94). When considering pairs of metrics of different nature, the strongest correlation is between edits and views (rs = 0.74), followed by that of editors and views (rs = 0.72), which suggests a relationship between the popularity of Wikipedia pages in terms of visits and their number of edits. Interestingly, a lower correlation was found between views, and both talks and talkers (rs = 0.48), suggesting that discussions around Wikipedia pages are not necessarily related to higher numbers of views. Another moderate correlation can be found between the length of an article and its views (rs = 0.6), which may indicate that the larger the article, the more attention it receives or that the more attention it receives, the more it grows in length. There are other moderate correlations, such as between the length and the number of references (rs = 0.56) and URLs (rs = 0.65), but which are to be expected as the two elements directly interfere with each other. The number of referenced publications is the metric most weakly correlated, there being for example a weak correlation between this and views (rs = 0.24) or talks (rs = 0.2). Our results confirm the same type of relationships reported in previous research (Mittermeier et al., 2021), albeit this time considering the entire population of English language Wikipedia articles.
4.2. Different Types of Attention Captured on Wikipedia
The results of this analysis can also be accessed interactively and in greater detail via the R Shinny app: https://wenceslao-arroyo-machado.shinyapps.io/wikinformetrics/.
A review of Wikipedia’s main pages based on different metrics reveals its potential to capture content that responds to different types of attention (Table S4 in the Supplementary material). The page views make it possible to identify those topics that capture the most attention of society in a given period—page views are limited to a period of 3 months in our data set. Thus, in our data set the pages of Prince Philip, Duke of Edinburgh (10,860,553 views) and Elizabeth II (9,900,275), or Mare of Easttown (5,995,513) rank among the most visited in the English-language Wikipedia. Also, five of the 20 most viewed pages are series or movies released in the period analyzed, which also highlights that content related to entertainment occupies a relevant position in Wikipedia. Sports also receive many views and reflect current events, as evidenced by the UEFA Euro 2020 page (12,100,455 views), the second most viewed, just after the Main Page (554,030,839). There is a clear presence of articles that respond to general interests, such as the Bible (11,048,609) or Cleopatra (9,516,340) pages. This may indicate that some topics raise general interest and may not be time related.
The number of talks of Wikipedia articles is often used in conjunction with other variables in the construction of models for controversy detection (Jang, Foley et al., 2016). This suggests that this metric may be useful for detecting such controversial content in a simple way. Among the 20 pages with the highest number of talks, those of political figures, religion topics, and scientific controversies stand out. The strong talk that takes place in some of them, as in Donald Trump (62,944), and the vandalism and presence of trolls, as in Gamergate controversy (27,185), have caused the editing of these pages to be restricted. In fact, there are some articles clearly related to controversial or sensitive issues, such as Climate change (40,837) and Homeopathy (25,898). In this regard, Wikipedia itself offers a page with a curated list of controversial articles (https://en.wikipedia.org/wiki/Wikipedia:List_of_controversial_issues), with 13 of the 20 pages listed as of 4 July 2021.
Finally, based on the volume of referenced publications, that is, all materials with an associated identifier (DOI, ISBN, arXiv ID, etc.), it is also possible to identify the Wikipedia pages that cite more scientific publications. However, in this case there are many research annuals and bibliographic pages present among the 20 articles, for example 2018 in paleontology with 569 referenced publications. These lists have been eliminated to select the top 20 articles with encyclopedic content. In these articles there is a clear presence of scientific content, especially in medicine, such as Feminizing hormone therapy (329) and Alzheimer’s disease (277). However, there are also articles related to history, such as History of Lisbon (313) or World War II (264). This may suggest that the metric of the number of publications cited can be used as a proxy to identify Wikipedia articles that are more scholarly oriented.
5. DISCUSSION
In this study we describe how Wikipedia is a complex system, involving numerous actors and elements, and whose rules and governance depend on the community itself (Jemielniak, 2012). It is not only one of the first and clearest examples of Web 2.0 but also one of the few that remains among the most visited websites and has not deviated from its initial objective. Far from that, over the years it has gained the acceptance and trust of many of those who initially looked at it with skepticism.
We describe many similarities between scientific publications and Wikipedia pages. Both have different typologies of documents, structured content, evaluation of content, and use of links and bibliographic references. There are also notable differences. While scientific publications may have limited access and a more specialized audiences, Wikipedia’s content and scope is more open and targeted to more general audiences. The live nature of Wikipedia is probably its main distinctive feature when compared to scientific publications. This must be considered when conducting informetric research on Wikipedia. To help in this endeavor, we propose an informetric-inspired conceptual framework, proposing different metrics that pay attention to the different analytical dimensions of Wikipedia, such as article characteristics, outreach, or citations to scientific publications. Some of these metrics have been already explored in the literature, such as page views (Mittermeier et al., 2019, 2021), but never in a comprehensive conceptual framework. The informetric-inspired conceptual framework presented here is expected to be useful for any Wikipedia study involving informetric, scientometric, bibliometric, or webometric perspectives. Similarly, different Wikipedia data sources have been identified and described, finding in their differences in coverage, volume, access, or data processing crucial aspects for their selection.
Alongside the conceptual analytical framework proposed, a knowledge graph of the English edition of Wikipedia has been built and shared openly (https://doi.org/10.5281/zenodo.6346899). The data are gathered under a comprehensive data set that follows a relational model and can be used by anyone interested in the study of this encyclopedia from an informetric point of view. It combines different data sources that allow users on the one hand to characterize any Wikipedia page, while also allowing them to establish relationships between each other (e.g., between two articles, an article and a category or an article and a linked website or a scientific publication referenced in it). Together with the metadata and relations of Wikipedia pages, the data of their bibliographic references are also incorporated, which come from the data set shared by Singh et al. (2020). It is precisely in Wikipedia’s bibliographic reference data where greater efforts are needed so that they can be efficiently accessed through its official sources, such as dumps or the API.
The case study provides a descriptive overview of Wikipedia articles in its English edition, suggesting interesting valuable analytical possibilities and highlighting the relationships and usefulness of the metrics described. Our results suggest that the low correlations among most of the metrics point to the fact that the analytical dimensions measured through them are rather distinct. The potential analytical usefulness of some of the metrics has been highlighted. For example, the number of Wikipedia page views can be seen as a metric of social attention; the number of talks of Wikipedia pages can be seen as a proxy of controversial topics; and the number of scientific references in Wikipedia pages can help identify scholarly-related content. The use of the quality levels derived from WikiProjects has proved to be useful, showing clear differences between the different levels, but has also provided an overview of the Wikipedia articles.
Finally, it is important to also mention some of the limitations of this work. First, not all possible Wikipedia metrics and their relationships have been explored (e.g., the relationship between pages and users, or the number of users who follow the pages (the so-called watchers), or the number of editions in other languages of a given article). The use of large amounts of data and some specific sources leads to a loss of consistency. For example, the Wikipedia dump process takes several days without blocking the edits during that time, so they are not really a snapshot. This loss of consistency also occurs when using different sources, especially when combining 2021 Wikipedia data with references from a third-party data set published in 2020. The knowledge graph and the case study are based on the English Wikipedia; however, future research should study whether the same relationships found in this study also hold for other languages as well as the existing relationships between language editions.
ACKNOWLEDGMENTS
We thank Mercedes and María for their intellectual advice in the early stages.
AUTHOR CONTRIBUTIONS
Wenceslao Arroyo-Machado: Data curation, Formal analysis, Investigation, Software, Visualization, Writing—original draft. Daniel Torres-Salinas: Funding acquisition, Resources, Validation, Writing—review & editing. Rodrigo Costas: Conceptualization, Methodology, Project administration, Supervision, Writing—review & editing.
COMPETING INTERESTS
The authors have no competing interests.
FUNDING INFORMATION
This work was funded by the Spanish Ministry of Science and Innovation with grant number PID2019-109127RB-I00/SRA/10.13039/501100011033. Wenceslao Arroyo-Machado received an FPU Grant (FPU18/05835) from the Spanish Ministry of Universities. Daniel Torres-Salinas received support under the Reincorporation Programme for Young Researchers of the University of Granada. Rodrigo Costas is partially funded by the South African DSI-NRF Centre of Excellence in Scientometrics and Science, Technology and Innovation Policy (SciSTIP).
DATA AVAILABILITY
The Wikipedia knowledge graph data set is available in Zenodo (Arroyo-Machado et al., 2022).
The source code for constructing the Wikipedia knowledge graph data set is available in Zenodo (Arroyo-Machado, 2022a).
The case study code is available in Zenodo (Arroyo-Machado, 2022b).
Note
Wikipedia references had already been studied for years before the birth of altmetrics, such as in the citation analysis by Nielsen (2007) or, in a more qualitative way, that of Mühlhauser and Oser (2008).
REFERENCES
Author notes
Handling Editor: Vincent Larivière