Abstract
Overlay maps of science are global base maps over which subsets of publications can be projected. Such maps can be used to monitor, explore, and study research through its publication output. Most maps of science, including overlay maps, are flat in the sense that they visualize research fields at one single level. Such maps generally fail to provide both overview and detail about the research being analyzed. The aim of this study is to improve overlay maps of science to provide both features in a single visualization. I created a map based on a hierarchical classification of publications, including broad disciplines for overview and more granular levels to incorporate detailed information. The classification was obtained by clustering articles in a citation network of about 17 million publication records in PubMed from 1995 onwards. The map emphasizes the hierarchical structure of the classification by visualizing both disciplines and the underlying specialties. To show how the visualization methodology can help getting both an overview of research and detailed information about its topical structure, I studied two cases: coronavirus/Covid-19 research and the university alliance called Stockholm Trio.
PEER REVIEW
1. INTRODUCTION
To be able to support and manage research activities, there is a need to monitor and study research; for example, to coordinate research, for follow-up investment, or to strengthen collaboration in targeted areas. It is relatively easy to keep track of the research activities of small research units. In contrast, large research units, such as whole universities, may have hundreds or thousands of employees, producing thousands of research publications each year within a broad spectrum of topics. Keeping track of the competences, research areas, and collaborations is a challenging task for such organizations.
Research publications are one important output of research activity. Publications can be studied to monitor research activity and gain insight into aspects such as collaboration patterns, specialization, strong research areas, trends, and development. Overlay maps of science have been proposed to “offer an intuitive way of visualizing the position of organizations or topics in a fixed map” (Rafols, Porter, & Leydesdorff, 2010). Overlay maps are base maps over which subsets of publications or filters can be projected, for example to study the position of organizations or topics, or to highlight properties such as citation impact, open access publishing or clinical research (Kay, Newman et al., 2014).
To this point, most overlay maps have been flat in the sense that they visualize research fields at one single level, commonly the levels of research disciplines or specialties. Such maps generally fail to provide both overview and detail about the research being studied. The aim of this study is to improve overlay maps of science to provide these two features in one single, interactive map, having a focus on the biomedical sciences. The maps created enable users to explore multiple levels of a hierarchical classification in a single interactive visualization.
2. BACKGROUND
Visualizations of science have been around for a long time (for overviews, see Börner, Chen, & Boyack, 2005; Petrovich, 2020; van Eck & Waltman, 2014; Zitt, Lelu et al., 2019). Early work focused on maps restricted to one or a few research areas. Different aspects of areas have been visualized and studied using a variety of entities and relations: for example, copublishing between researchers or organizations, co-occurrence of keywords, and citation relations between publications or journals.
Since the end of the 1990s, maps that cover large parts of the science system (at least within the natural science and biomedicine) have been created. Initial maps of science were based on journals and made it possible to get unprecedented overviews of the science system (e.g., Bassecoulard & Zitt, 1999; Boyack, Klavans, & Börner, 2005; Leydesdorff, 2004, 2006; Moya-Anegón, Vargas-Quesada et al., 2004).
Rafols et al. (2010) showed how comprehensive maps of science can be used as base maps, over which overlays can be projected. The idea of such a map is to fix the positions of the nodes in the map, representing for example research fields, so that an overlay projected onto the map can be easily compared with the base map, as well as with other projections. For instance, consider the publication outputs, A and B, of two universities. An overlay map is created by projecting A onto a base map. The size of the nodes is scaled in relation to the distribution of A over research fields. Another map is created based on B, using the same procedure. We can now compare the subject orientation of the two universities by exploring the two maps and compare node sizes. If we color the nodes of the maps based on some variable, we can analyze different aspects of A and B, such as the amount of open access publishing, citation impact or degree of international collaboration in different research fields. Compared to maps restricted to particular areas, overlay maps provide context and points of reference, for example by offering the possibility to spot areas in which A and B do not have any research.
Since overlay maps were introduced in scientometrics, they have been used in many applications. Kay et al. (2014) used overlay maps to visualize patents by companies; Tang and Shapira (2011) analyzed the growth of U.S. and China copublications in nanotechnology; Klaine, Koelmans et al. (2012) positioned environmental, health, and safety of nanomaterials in relation to general nanotechnology; Leydesdorff, Moya-Anegón, and Guerrero-Bote (2015) created an overlay map based on journal relations using Scopus data and exemplified how the map can be used to explore the publication output of authors, organizational units, or other publication sets; and Rotolo, Rafols et al. (2017) used three case studies to demonstrate the use of overlay maps for strategic intelligence.
Most scientometric studies using overlay maps have visualized maps at one single level and one single entity (e.g., journals or keywords). An exception is the early work by Small (1999) visualizing a hierarchical structure of a set of about 37 thousand documents. The study by Rotolo et al. (2017) includes what they refer to as “cognitive” base maps of research publications at different granularity levels. These maps are based on the Web of Science categories at the broadest level, journals at the meso-level, and MeSH at the most granular level. However, the different granularity levels are presented in different maps and hierarchical relations between levels are not shown.
To provide the possibility to navigate from broad to narrow levels in one single map, I base the map presented in this paper on a hierarchical classification obtained by clustering articles in a citation network.
Publication-level classification at a global level (covering a complete multidisciplinary data source) obtained by clustering articles in a citation network was first implemented by Waltman and van Eck (2012). Compared to classification at the journal level, publication-level classifications can be made more granular. A hierarchical structure can be obtained by merging clusters at lower levels into broader clusters. Publication-level classifications have been used to create overlay maps (RoRI Institute, Waltman et al., 2019). However, the applications are few and lack hierarchical structure, other than node coloring by major research areas.
The validity of the clustering solutions created by clustering articles in citation networks has been contested (Held, Laudel, & Gläser, 2021). There is no ground truth classification and different methodological choices result in different, sometimes equally valid, representations of research delineation (Glänzel & Schubert, 2003; Gläser, Glänzel, & Scharnhorst, 2017; Klavans & Boyack, 2017; Mai, 2011; Sjögårde & Ahlgren, 2018; Smiraglia & van den Heuvel, 2013; Velden, Boyack et al., 2017; Waltman & van Eck, 2012). Nevertheless, the results have been compared to a wide range of baselines and many different applications have been evaluated and compared (Ahlgren, Chen et al., 2020; Boyack, 2017; Boyack, Newman et al., 2011; Boyack & Klavans, 2010, 2018; Donner, 2021; Haunschild, Schier et al., 2018; Sjögårde & Ahlgren, 2018, 2020; Šubelj, van Eck, & Waltman, 2016; Waltman, Boyack et al., 2020).
Including citations that are external to the analyzed data set improves the accuracy of a clustering solution (Ahlgren et al., 2020; Boyack, 2017; Donner, 2021; Klavans & Boyack, 2011). This is an advantage of a global approach, compared to a local one in which a clustering is based on a restricted set of publications. Nonetheless, a local approach may be preferable in some applications because it can emphasize the local, within field, context of research. However, local maps are difficult to compare to other maps. The rationale of overlay maps is to make comparisons between maps. A global approach is therefore most often more useful for the purpose of comparison.
3. DATA AND METHOD
To create a visualization of biomedical research literature that incorporates both overview and detail, I based the visualization on a hierarchical publication-level classification. This classification was obtained by clustering articles in a (direct) citation network of PubMed records. Currently, PubMed indexes over 1 million publications yearly, covering a wide range of biomedical research disciplines. I therefore refer to the classification created as “global,” even though it does not have a comprehensive coverage of all fields of science.
A similar classification was recently created by Boyack, Smith, and Klavans (2020). This classification differs mainly by the choice of similarity measure between publications. Boyack et al. based their classification on a combination of direct citations and textual similarity. By complementing direct citations with textual similarity, they were able to include publications that would otherwise have no relations. Thus, the relation between publications in their approach is a mixture of fundamentally different similarity measures and makes the interpretation of the classification more difficult. Since the model of Boyack et al. was published, more citations have been made openly available and it is now possible to create citation-based classifications with a more comprehensive coverage. I therefore base my classification strictly on direct citations.
Clustering of publications creates one of several possible representations of the biomedical sciences. The clusters created are internally connected by citations. This way they are self-organized by the use of formal communication. The advantages of this methodology are that it does not rely on predefined categories, the cost is low, high granularity as well as a hierarchical structure can be obtained, and the assignment of individual publications can be made without subjective choices. Nevertheless, subjectiveness is still present in, for example, the choice of publication–publication relation and parameter values, in particular the value of the resolution parameter. The value of the resolution parameter used in this study was guided by previous work (Sjögårde & Ahlgren, 2018, 2020); nonetheless, arbitrariness in this choice is still unavoidable. Another disadvantage is the creation of disjoint clusters where each publication is assigned to exactly one cluster. Forcing publications into one cluster is practical and facilitates interpretation but also means that information is lost. Naturally, publications can address multiple concepts and be multidisciplinary in nature. Classification obtained by clustering does not represent such characteristics in itself. However, to enable analyses of transverse structures, the classification can be complemented with other sources, such as citation relations or the Medical Subject Headings (MeSH).
A comprehensive discussion about which methods (such as the choice of publication–publication relation or choice of clustering algorithm) to prefer when clustering publications is out of scope of this paper. The focus of this paper is on how to improve the interpretability of overlay maps of science by providing possibilities to navigate between overview and details in a map, and by taking advantage of the hierarchical structure of a classification. I leave the discussion about how best to obtain classifications to future work. Nevertheless, the visualization method presented in this paper may help to evaluate classifications by making them easier to navigate and interpret. In this way the paper may also contribute to the understanding of what kind of clusters are created by the use of clustering in citation networks. The visualization methodology may be applied to any hierarchical classification and can also be delimited to fewer levels in the classification.
In this paper I present two examples of maps incorporating possibilities to navigate the hierarchical structure of a classification. These maps are publicly available and are limited to the presented cases and the base map of science. It is out of scope of this study to provide a web tool that can be used to explore other overlays.
Figure 1 illustrates the process used to obtain the classification from the citation network and to create a base map from the classification. The base map incorporates features to emphasize the hierarchical structure of the classification. In Section 3.1 I describe the data and process to obtain the classification, and in Section 3.2 I describe the process to create the base map from the classification.
3.1. Obtaining the Classification
I used PubMed data to create a classification of publications in four levels based on citation relations from the NIH Open Citation Collection (Hutchins, Baker et al., 2019; iCite, Hutchins, & Santangelo, 2019). I used the bibliometric system at Karolinska Institutet for the analysis. The system contains PubMed data from 1995 onwards. Data were extracted in February 2022 (version 28 of the NIH Open Citation Collection) and were restricted to the publication types “article” and “review”: about 18.6 million publications with about 462 million direct citation relations. In the remainder of this paper, I use the term publication to refer to both articles and reviews.
Except for some modifications, I obtained the classification using the methodology put forward in Waltman and van Eck (2012). In accordance with this methodology, direct citation relations were used to create a network. Citation relations were normalized in relation to each publication’s total number of citation relations. The Leiden algorithm (Traag, Waltman, & van Eck, 2019) was used to obtain a partitioning of the publications1. To get clusters of substantial size, I restricted the cluster size to a minimum of 50 publications by reclassifying publications in clusters below this minimum size using the method provided in the software. The resolution parameter was calibrated to obtain clusters of about the same size as in Sjögårde and Ahlgren (2018) for the corresponding publication years, resulting in 63,575 clusters. Thereby, clusters approximately correspond to research topics, and I refer to clusters at this level as such.
Topics were clustered into larger groups based on their summed relatedness and normalizing for cluster sizes (Eq. 4 in Boyack et al., 2020). I calibrated the resolution parameter to obtain clusters of about the same size for corresponding publication years as obtained in Sjögårde (2020). Thereby, clusters approximately correspond to research specialties. A minimum threshold of at least 500 publications was used at this level, which resulted in 1,602 specialties2.
Specialties were clustered into larger clusters. My intention was to create clusters of approximately the size of other broad classifications, such as Web of Science journal categories (about 250 clusters), Science Metrix journal classification (180 clusters at the “subfields” level) and Scopus Subject areas (about 330 clusters). Because the classification only includes biomedicine, I aimed for a smaller number of clusters. I tested several different values of the resolution parameter. Too low values merged specialties with seemingly weak relatedness into coarse clusters, while too high values resulted in many specialties being unmerged. I finally chose a solution with 131 clusters, after restricting cluster sizes to at least 100,000 publications. I refer to this level as research disciplines. These clusters represent fields of science hold together by citations. Thus, only the formal communication through citations has been taken into account. These higher level clusters do not necessarily coincide with disciplines as social and organizational structures (Hammarfelt, 2019) but may help to shed light on the organization of science through its communication.
The disciplines were grouped into 22 broader research areas. It is particularly difficult to obtain good labels at this level, because terms extracted from bibliographic fields tend to be too narrow. For this reason, the research areas are not displayed in the visualization. The research areas are used to color sibling disciplines.
To create labels, I used the procedure proposed in Sjögårde, Ahlgren, and Waltman (2021). Noun phrases were extracted from article titles, MeSH, journal titles, and author addresses. A noun phrase was operationalized as a sequence of adjectives and nouns, ending with a noun (van Eck, Waltman et al., 2010a). A Java program was written for this purpose and the Stanford Core NLP software was used for data mining (Manning, Surdeanu et al., 2014), in particular the lemmatizer and the Part-of-Speech tagger (Toutanova, Klein et al., 2003; Toutanova & Manning, 2000)3. The relevance of terms to clusters was calculated using term frequency to specificity ratio (TFS; Sjögårde et al., 2021). TFS balances term frequency and term specificity to obtain terms that are both frequent in a cluster and specific to the cluster. For each cluster, the three terms with the highest TFS value were concatenated into a label. Seven more terms are listed when clicking on a node in the visualization. I used article titles and MeSH to create labels at the topic level (α = 0.33 was used for the TFS calculation). I used article titles, MeSH, and journal titles at the specialty level (α = 0.5) and journal titles and author addresses at the discipline level (α = 0.67).
3.2. Creating the Base Map
As a basis for the map, I created a network of specialties. In the following step I contracted each subnetwork of sibling specialties and positioned the parent discipline on top of this subnetwork. A list of topics was created for each specialty. This list is displayed when clicking a node. In the following I describe the steps in detail.
3.2.1. Specialty level
I created a list of specialties with the attributes shown in Table 1. For the purpose of illustration, the table presents values for an example specialty. The size attribute was calculated as the square root of the number of publications. This makes the area of each node proportional to the number of publications. If a cluster contained no more than 500 publications, a hyperlink was created to the underlying publications (“Get list in PubMed”). If a cluster contained 501–5,000 publications a separate hyperlink was created for each batch of 500 publications (1–500, 501–1,000, etc). If more than 5,000 publications were in a cluster no hyperlinks were provided and instead the text “Too many publ.” was shown. This was done because of restrictions on hyperlink length. The underlying topics were listed in the column “Children.” Labels and numbers of publications for the topics were concatenated into a list. The interactive map contains hyperlinks to the underlying publications for each of the topics (with the same restriction to 5,000 publications). Specialty nodes were colored according to their cluster at the top level (research areas).
Attribute . | Value . |
---|---|
id | l2.229 |
label | skin neoplasm; cancer; nevus |
size | 30.4 |
color | rgba(116,200,0,1) |
Additional terms | nevus; malignant melanoma; surgical oncology; sentinel lymph node biopsy; cutaneous melanoma; raf; oncogene proteins b |
Level | Specialty |
Parent | dermatology; melanoma; skin |
# Publ. | 3707 |
Get list in PubMed | 1–500, 501–1000, 1001–1500, 1501–2000, 2001–2500, 2501–3000, 3001–3500, 3501–3707 |
Children |
|
x | −214.7 |
y | −590.5 |
Attribute . | Value . |
---|---|
id | l2.229 |
label | skin neoplasm; cancer; nevus |
size | 30.4 |
color | rgba(116,200,0,1) |
Additional terms | nevus; malignant melanoma; surgical oncology; sentinel lymph node biopsy; cutaneous melanoma; raf; oncogene proteins b |
Level | Specialty |
Parent | dermatology; melanoma; skin |
# Publ. | 3707 |
Get list in PubMed | 1–500, 501–1000, 1001–1500, 1501–2000, 2001–2500, 2501–3000, 3001–3500, 3501–3707 |
Children |
|
x | −214.7 |
y | −590.5 |
The ForceAtlas algorithm (Jacomy, Venturini et al., 2014) was used to create a layout based on the normalized direct citation value between the specialties (the same relatedness value used for clustering)4. The algorithm resembles a physical system in which nodes repulse each other and edges attract the nodes. The magnitude of the attraction is relative to the weight of the edges. For each specialty, edges were restricted to the 20 having the highest relatedness values. This was done to improve efficiency.
3.2.2. Discipline level
The same attributes were calculated for disciplines as for specialties. Underlying specialties were listed in the attribute “Children” in the case of disciplines. In correspondence with specialties, only the 20 relations with the highest relational strength were kept.
The sizes of the specialty nodes were rescaled by dividing by 2. This was done for better readability of the visualization. The discipline nodes were made partly transparent in order not to hide the underlying specialties.
3.2.3. Adjusting specialties subnetworks
The chosen visualization approach emphasizes the delineation of publications obtained by the clustering algorithm. Furthermore, it emphasizes the hierarchical structure obtained by clustering lower level clusters into higher level clusters. The benefit of this approach is that a more easily interpretable overview can be provided. However, the approach comes with a cost. At the level of specialties, the layout contracts siblings and thereby underemphasizes relations between different areas.
A map without adjustment (α = 1) may represent relations between a node and all its relations somewhat better. However, no network visualization layout solves this problem entirely because they are all limited to a two- (or three-) dimensional Euclidean space. Given the many and diverse relations that each cluster has, it is only possible for the layout algorithm to find a best possible solution in such a space. This solution must emphasize some relations (by positioning the nodes) at the expense of others.
I visualized the networks using the sigma.js package created with the “SigmaExporter” plugin for the visualization software Gephi5. R was used to create networks and other files necessary for the visualization (a json network file6, a json configuration file7, and an html file). The package includes the possibility to search for nodes. Note that this search feature is restricted to searching in node labels to identify nodes. It cannot be used to restrict the map to a set of publications or nodes.
The file size of the full base map is large due to the high amount of data in hyperlinks. To decrease loading time, I restricted the available online version of the map to the publication years 2020–February 2022, showing about 2.7 million publications. Below I refer to this map as the base map. As a result of this restriction, the map shows the current (or most recent) state of the biomedical literature.
4. RESULTS
In this section I demonstrate how the base map provides both overview and detail by visualizing the hierarchical structure of the classification which it has been built upon. I then present two cases that show how the map can be used to enrich the study of research activities: coronavirus/covid-19 research and its historical roots and the subject orientation of Stockholm’s three largest universities, part of the “Stockholm Trio” university alliance. Table 2 lists the URLs for all maps presented in this section.
Map . | URL . |
---|---|
Base map | https://petersjogarde.github.io/papers/hiervis/base/index.html |
Covid-19 publications | https://petersjogarde.github.io/papers/hiervis/covid_v2/pubs/index.html |
Publications cited from Covid-19 publications | https://petersjogarde.github.io/papers/hiervis/covid_v2/cited/index.html |
KTH Royal Institute of Technology | https://petersjogarde.github.io/papers/hiervis/sthlm_trio/kth/index.html |
Stockholm University | https://petersjogarde.github.io/papers/hiervis/sthlm_trio/sthlm_univ/index.html |
Karolinska Institutet | https://petersjogarde.github.io/papers/hiervis/sthlm_trio/ki/index.html |
Map . | URL . |
---|---|
Base map | https://petersjogarde.github.io/papers/hiervis/base/index.html |
Covid-19 publications | https://petersjogarde.github.io/papers/hiervis/covid_v2/pubs/index.html |
Publications cited from Covid-19 publications | https://petersjogarde.github.io/papers/hiervis/covid_v2/cited/index.html |
KTH Royal Institute of Technology | https://petersjogarde.github.io/papers/hiervis/sthlm_trio/kth/index.html |
Stockholm University | https://petersjogarde.github.io/papers/hiervis/sthlm_trio/sthlm_univ/index.html |
Karolinska Institutet | https://petersjogarde.github.io/papers/hiervis/sthlm_trio/ki/index.html |
4.1. The Base Map
Figure 2 is a screenshot of the interactive base map that is available online.
Disciplines and their underlying specialties have been colored by clusters at the level above disciplines. The map shows clusters oriented from biophysics and biochemistry at the bottom right to social, psychological, and healthcare aspects of medicine at the bottom left.
The bottom left side includes health profession-related research: nursing, psychology, medical informatics, and public health.
At the top left side, we find disciplines with a clinical focus, including, for example, neurosurgery, gastroenterology, dentistry, pathology, obstetrics, cardiology diseases, and a variety of cancers and treatment thereof.
The disciplines at the top middle of the map are focused on cell and molecular medicine, including research on human proteins, transcription factors, immunology, stem cells, DNA and RNA, etc. Several clinical areas are strongly connected to the cell and molecular disciplines and are positioned between the top middle and the top right, including, for example, some transplantation, oncology, and rheumatology.
A group of life science disciplines are located to the right of the cell and molecular disciplines, including biology, microbiology, biochemistry, biotechnology, environmental sciences, and environmental engineering. Note that PubMed primarily covers biomedicine and life sciences and does not have full coverage of, for example, environmental sciences.
It is not easy to compare the base map with other maps of science, such as the RoRI map of funding landscape (RoRI Institute et al., 2019) and the PubMed model by Boyack et al. (2020), because the maps display clusters at different levels of aggregation and these other maps do not provide much possibility for overview. Nonetheless, it is clear that the base map presented in this paper and the two mentioned maps all position clinical research at one end of the map and more basic research at the other end of the map. All three maps have areas oriented towards natural sciences, biophysics, and biochemistry at the basic end of the map. They also have cell and molecular science positioned close to these areas as well as an area of infectious diseases. Areas of research oriented towards healthcare and health professions are positioned furthest from the technical side in all maps. The maps seem to be rather similar at this overall level.
The zooming feature is displayed in Figure 3, showing specialties in “dermatology; melanoma; skin.” The figure reveals major specialties addressing skin cancer, psoriasis, allergy, acne, and hair loss. By looking at the node sizes we can estimate the relative size of these fields. Skin cancer and psoriasis are the two largest nodes. Some nodes are about half the size of these large nodes, for example hair loss (“alopecia; hair; hair follicle”) and allergy (“atopic dermatitis; pruritus; dog diseases”), while others are very small, for example pemphigus (“pemphigus; bullous pemphigoid; pemphigus vulgaris”).
Figure 4 shows the strong relation between “skin absorption; pharmaceutic; drug delivery systems” clustered together with other skin related specialties but having strong relations with pharmaceutical specialties located in the other end of the map, in particular “pharmaceutic; pharmaceutical science; excipient.”
Clicking on a specialty gives the user further information. This feature is exemplified in Figure 5, in which the information panel for the skin cancer specialty (“skin neoplasm; cancer; nevus”) is displayed. The information panel reveals subtopics addressing treatment, the use of artificial intelligence to detect skin cancers, medication, imaging and behavior, and risk factors. Hyperlinks make it possible to retrieve the publications underlying each topic in PubMed.
4.2. Coronavirus
To create a map of research related to the coronavirus pandemic that started in late 2019, I used the search query in Table 3. The query has been designed by the library at Karolinska Institutet to get publications both about the disease (COVID-19) and the virus causing the disease (SARS-CoV-2).
Covid*[tw] OR nCov[tw] OR 2019 ncov[tw] OR novel coronavirus[tw] OR novel corona virus[tw] OR " Covid-19"[All Fields] OR "Covid-2019"[All Fields] OR "severe acute respiratory syndrome coronavirus 2"[Supplementary Concept] OR "severe acute respiratory syndrome coronavirus 2"[All Fields] OR "2019-nCoV"[All Fields] OR "SARS-CoV-2"[All Fields] OR "2019nCoV"[All Fields] OR (("Wuhan"[All Fields] AND ("coronavirus"[MeSH Terms] OR "corona virus"[All Fields] OR "coronavirus"[All Fields])) AND (2019/12[PDAT] OR 2020[PDAT] OR 2021[PDAT])) |
Covid*[tw] OR nCov[tw] OR 2019 ncov[tw] OR novel coronavirus[tw] OR novel corona virus[tw] OR " Covid-19"[All Fields] OR "Covid-2019"[All Fields] OR "severe acute respiratory syndrome coronavirus 2"[Supplementary Concept] OR "severe acute respiratory syndrome coronavirus 2"[All Fields] OR "2019-nCoV"[All Fields] OR "SARS-CoV-2"[All Fields] OR "2019nCoV"[All Fields] OR (("Wuhan"[All Fields] AND ("coronavirus"[MeSH Terms] OR "corona virus"[All Fields] OR "coronavirus"[All Fields])) AND (2019/12[PDAT] OR 2020[PDAT] OR 2021[PDAT])) |
The search query resulted in 145,089 articles from 2019 until February 2022. The COVID-19/SARS-CoV-2 map (Figure 6)8 shows that most research related to the pandemic fuses into one discipline (“covid; cov; sars”). Nonetheless, almost half of the publications retrieved from the search query are distributed over a wide range of other specialties, including specialties in psychiatry and mental health (“psychiatry; health; mental disorders”), nursing and other health profession related research (“nursing; education; surgery”), remote healthcare (“telemedicine; radiology; internet”), immunology (“lymphocyte; immunology; hiv”), and infectious diseases (“infectious disease; microbiology; staphylococcal infection”).
Zooming into the largest coronavirus node (Figure 7) reveals some major topics addressing; imaging of the lungs (“tomography; x ray; chest ct”), pregnancy (“infectious pregnancy complication; pregnancy; pregnant woman”), thromboembolism (“venous thromboembolism; pulmonary embolism; anticoagulant”), characteristics of the virus (“variant; coronavirus spike glycoprotein; concern”) and effects on the neurological system (“nervous system diseases; guillain; barre syndrome”).
By creating a map of research cited by the coronavirus research (Figure 8)9, we get a picture of the research upon which the coronavirus research has been built. Node sizes are relative to the total number of publications cited by the set of coronavirus research publications and the colors have been set by the clusters’ average number of citations from this set. The map shows that coronavirus research is based on a wide range of areas, for example:
knowledge from previous coronavirus epidemics (topics addressing SARS and MERS in the specialty “coronavirus infection; covid 19; viral pneumonia”);
research on mRNA vaccines (the topic “messenger rna; mrna; mrna vaccine” in “dna; transfection; gene transfer technique”);
research on the drug targeting (“drug repositioning; drug target interaction; drug target interaction prediction” in “chemical information; modeling; drug design”); and
protein structure modeling (the topic “sars; cov; sars cov 2” in the specialty “hla; allele; tissue antigen”).
4.3. Stockholm Trio
Stockholm Trio is a university alliance in Stockholm including the city’s three large universities: KTH Royal Institute of Technology (KTH), Karolinska Institutet (KI), and Stockholm University. The universities have fundamentally different subject orientations: KTH is a one-faculty technical university, Karolinska Institutet is a one-faculty medical university, and Stockholm University has several faculties within the humanities, social sciences, and natural sciences. Mapping of the biomedical research at the three universities may help management to find areas of potential collaboration.
To analyze the publication output of the three universities, I created one map for each university. The maps were delimited to PubMed (i.e., the biomedical area) and to 2019–2021. The publications authored by researchers at the three universities of the Stockholm Trio were identified using search queries. Table 4 shows the search queries used to identify publications by researchers at the three universities in KI’s internal version of PubMed. A minor part of the publications by each university was not captured by these rather few and simple search queries. However, the coverage is sufficient for the purpose of illustration.
University . | Search queries . |
---|---|
KTH | '%KTH%Sweden%' |
'%royal%inst%tech%sweden%' | |
'%kungliga tekniska%sweden%' | |
Karolinska Institutet | '%karolinska%sweden%' |
'%university hosp%huddinge%sweden%' | |
'%university hosp%solna%sweden%' | |
'%danderyd hosp%sweden%' | |
'%s_dersjuk%sweden%' | |
'%stockholm county council%sweden%' | |
Stockholm Trio | '%stockholm univ%sweden%' |
'%university of stockholm%sweden%' |
University . | Search queries . |
---|---|
KTH | '%KTH%Sweden%' |
'%royal%inst%tech%sweden%' | |
'%kungliga tekniska%sweden%' | |
Karolinska Institutet | '%karolinska%sweden%' |
'%university hosp%huddinge%sweden%' | |
'%university hosp%solna%sweden%' | |
'%danderyd hosp%sweden%' | |
'%s_dersjuk%sweden%' | |
'%stockholm county council%sweden%' | |
Stockholm Trio | '%stockholm univ%sweden%' |
'%university of stockholm%sweden%' |
Figure 9 displays snapshots of the maps for KTH (A), Stockholm University (B) and Karolinska Institutet (C) for the publication years 2019–2021.
The KTH map (Figure 9A) shows about 2,400 biomedical publications, which is a minor part of KTH’s total publications output. It shows a clear focus on the technical areas to the right of the map, including materials science, biophysics, and biochemistry (disciplines “pharmacy; nanoparticle; engineering,” “chemistry; biochemistry; chemical engineering,” and “technology; physics; engineering”). There are also a relatively large number of publications in environmental engineering (“technology; environmental engineering; environment”) and biotechnology (“biotechnology; technology; biochemistry”). At the top middle part of the map we find KTH publications in disciplines focusing on cell and molecular medicine (“microrna; dna; biochemistry” and “lymphocyte; immunology; hiv”). There are also publications in some clinical areas (e.g., “cardiology; heart; heart failure,” “neurosurgery; radiology; stroke,” “orthopaedic surgery; orthopaedics; arthroplasty,” and “radiation oncology; urology; radiology”). At the bottom left side KTH has some publications in psychiatry, mental health, and brain research (“psychiatry; neurology; pharmacology,” “brain; cognition; attention,” and (“psychiatry; health; mental disorders”). There are also some publications in the covid-19 cluster (“covid; cov; sars”).
The KTH map reveals a clear focus on methodology, expressed by terms such as “wastewater surveillance” in the covid-19 cluster, “magnetic resonance imaging” in the neurosurgery discipline, and “tomography” in “radiation oncology; urology; radiology”.
The map for Stockholm University (Figure 9B) shows about 4,400 biomedical publications, which is a small proportion of Stockholm university’s total publication output. This map reveals a focus towards natural sciences, biophysics, and biochemistry, but also a large proportion of publications in psychology. There is a stronger focus on biology and ecology in this map than in the KTH map. Some disciplines are of about the same size in the Stockholm University map and the KTH map. However, zooming in to the specialties reveals differences. For example, in “technology; environmental engineering; environment” Stockholm University has a focus on monitoring the environmental and toxicological effects from pollutants and emissions and KTH has a focus on bioengineering (expressed by terms such as “waste disposal,” “filtering,” “arsenite binding,” and “deionization”). In similarity with KTH, Stockholm University has only a few publications on the clinical side of the map (top left).
The KI map (Figure 9C) is fundamentally different from the Stockholm University and KTH maps. It shows about 21,600 biomedical publications, which covers a high proportion of KI’s total publication output. KI has a wide range of research in most areas of the map, not the least on the clinical side. However, there is a smaller proportion of publications on the right side of the map, including, for example, medical aspects of nanotechnology, toxicology, and microscopy. Similar to Stockholm University, KI has many publications in psychology and psychiatry.
5. DISCUSSION
I have shown how a publication-level classification, including both coarse and granular levels, can be used to create overlay maps of science that provide both overview and detail. To exemplify the use of such maps I have demonstrated potential utilization by revealing the topical structure of coronavirus/COVID-19 research and differences and similarities in research orientation at three universities.
The visualizations created by the methodology put forward in this paper enable the navigation of millions of articles, from broad levels down to individual articles. No existing software supports such navigation. Potentially, any set of PubMed publications can be projected onto the map (currently not supported by the web tool): for example, the publication output of an organization or a journal or within a particular research field. The contents of the disciplines and specialties displayed in the visualization can be explored down to narrow topics and individual articles. Thereby, analysts and other users can get a deeper and richer understanding of the data displayed in overlay maps. Potentially, the visualization technique can be used in other applications, for example to display local maps or to visualize search results in information retrieval systems.
There are some limitations of the visualization methodology. Some of these limitations relate to the creation of the classification and some to the visualization methodology itself. Several researchers have acknowledged that different choices of methods, parameter values, relational measures, and clustering algorithms results in diverse representations of science, sometimes equally valid (for examples, see Gläser et al., 2017; Waltman et al., 2020). My choice of citation relation, clustering algorithm, parameter values, and labeling approach have been guided by both empirical support and practical considerations. My approach, based on direct citations, has the advantages of being efficient (having fewer relations than bibliographic coupling and cocitations) and including relations to and from a large proportion of the publications in the data source, given that a comprehensive data source and a large time span are used. The direct citation approach has performed well in quantitative evaluations using large corpuses (Boyack & Klavans, 2020; Klavans & Boyack, 2017). The resolution parameter values have also been guided by previous research (Sjögårde & Ahlgren, 2018, 2020). Nonetheless, other representations may be equally valid and express other aspects of the research landscape. For example, cocitations may be a better choice if one wants to examine the historical development of a research field, and bibliographic coupling might be preferable to display related publications in an information retrieval setting.
Labeling of the obtained clusters is a challenging task. Even though the methodology used works reasonably well, the subject orientation of clusters is sometimes hard to interpret using the cluster labels. Occasionally, other information needs to be considered by a user to understand the subject orientation of a cluster and to distinguish it from other clusters; for example, additional key terms, sibling cluster labels, parent and children cluster labels, and consulting publication records in PubMed. Providing interactivity facilitates such interpretation. However, it remains unclear to what extent interpretation is a problem for users. Further work evaluating the interpretability of classifications from a user perspective is therefore needed.
The visualizations that I have presented include clusters visualized as nodes at two levels. It is possible to visualize more levels, but at the risk of making the visualization more cluttered and harder to interpret. Functionality hiding nodes at granular levels when zooming out and showing nodes when zooming in might be an option to be able to include more levels in the visualization. However, including nodes at additional levels does not necessarily help users to read and interpret the visualization. Users might, for example, prefer reading lists at more granular levels. Therefore, user studies are needed to develop user-friendly features and to make interactive overlay maps of science easier to interpret.
The intention of this study has not been to evaluate normalization methods or layout algorithms. There might be better options to create layouts, in particular at the discipline level, regarding both normalization of citation relations and layout algorithm.
I have emphasized the clusters by contracting the subnetwork of sibling specialties. This procedure puts sibling specialties in proximity and improves readability. However, it may distort relations outside the cluster. Stressing the hierarchical structure of the classification hides transverse relations in this structure, such as relations between specialties with different parents. Complementing the visualization with other information may be a viable option to make such relations visible and to enrich analyses. The purpose of a study must guide the application of the map. For example, the map can be restricted to concepts expressed by MeSH and transverse relations can be highlighted using the citation relations obtained when constructing the classification. Using specialties at the top level and topics at a lower level might be a better solution for smaller sets of data.
The choice of the ForceAtlas layout algorithm was guided by my experience with visualizing a wide range of bibliometric networks (e.g., coauthor networks, MeSH-networks, coauthoring organization networks, and article citation networks) in bibliometric practice at Swedish universities. ForceAtlas is implemented in the visualization software Gephi10, which makes it possible to try out different parameter values to facilitate the readability of a particular network and more generally to learn how to use parameter values for different kinds of networks. In my experience, the ForceAtlas algorithm creates visualizations that are interpretable and make sense, and I have received positive feedback from users on visualizations created using this layout. Alternatives to ForceAtlas are, for example, the OpenOrd layout algorithm used by Boyack et al. (2020), the VOS layout algorithm which is implemented in the VOSviewer software (van Eck, Waltman, et al., 2010b; van Eck & Waltman, 2010), the Fruchterman–Reingold layout algorithm (Fruchterman & Reingold, 1991) and the Kamada–Kawai layout algorithm (Kamada & Kawai, 1989).
The maps created in this study have been restricted to the biomedical sciences. During recent years the amount and proportion of available bibliographic metadata has increased substantially. Future work may be extended to other research fields.
There are several technical issues related to the visualization tool used. For example, the current version of the visualization package does not support smartphones and tablets; identification of areas of interest to a user could be facilitated by filters and improved search features; loading the visualization files is rather slow; and hyperlinks are provided in batches if a cluster includes more than 500 publications. The intention has not been to provide a perfect visualization tool but rather to show how interactive visualizations of hierarchical classifications can provide users with enriched possibilities to explore the scientific literature. I have demonstrated that it is possible to provide maps of science that can give the user an overview of millions of publications and details down to individual publications. Such maps may constitute a valuable tool for researchers studying science, improve the transparency of cluster-based citation normalization, support research management and policy making, and constitute a tool for researchers to explore research of relevance to them.
ACKNOWLEDGMENTS
I would like to thank Ludo Waltman and two anonymous reviewers for their constructive feedback on an earlier version of this paper.
COMPETING INTERESTS
The author has no competing interests.
FUNDING INFORMATION
Peter Sjögårde was funded by the Foundation for Promotion and Development of Research at Karolinska Institutet.
DATA AVAILABILITY
All maps and underlying data and configuration files are available online:
Classification and labels:
Base map:
https://petersjogarde.github.io/papers/hiervis/base/index.html
Base map files:
https://github.com/petersjogarde/petersjogarde.github.io/tree/main/papers/hiervis/base
Covid-19/SARS-CoV-2 maps:
Publications: https://petersjogarde.github.io/papers/hiervis/covid_v2/pubs/index.html
Cited publications: https://petersjogarde.github.io/papers/hiervis/covid_v2/cited/index.html
Covid-19/SARS-CoV-2 files:
https://github.com/petersjogarde/petersjogarde.github.io/tree/main/papers/hiervis/covid_v2
Stockholm trio maps:
KTH: https://petersjogarde.github.io/papers/hiervis/sthlm_trio/kth/index.html
Stockholm University: https://petersjogarde.github.io/papers/hiervis/sthlm_trio/sthlm_univ/index.html
KI: https://petersjogarde.github.io/papers/hiervis/sthlm_trio/ki/index.html
Stockholm trio files:
https://github.com/petersjogarde/petersjogarde.github.io/tree/main/papers/hiervis/sthlm_trio
Notes
The process took about 10 h 15 min to run 100 iterations. A total of 256 Gbyte RAM was allocated for the process. The value of the quality function (CPM) was 0.408. The resolution parameter was set to 0.00010. Version 1.1.0 of the software was downloaded from https://github.com/CWTSLeiden/networkanalysis (November 20, 2020).
At aggregated level, reclassification was performed by merging clusters below the threshold with the cluster above the threshold having the strongest relational strength.
Stanford CoreNLP is available at https://stanfordnlp.github.io/CoreNLP/.
An R-function (with base code in C) was created by my colleague Robert Juhasz for this task. The function is equivalent to the ForceAtlas layout in Gephi. The following parameter values were used: number of iterations = 10,000, inertia = 0.1, repulsion strength = 5,000, attraction strength = 5, max displacement = 5, freeze balance = true, freeze strength = 80, freeze inertia = 0.2, gravity = 1, outbound attraction distribution = false, adjust sizes = false, speed = 1, cooling = 1.
The SigmaExporter was developed through the InteractiveVis project at the Oxford Internet Institute, University of Oxford. The Java code of the exporter is available under a GPLv3 License. https://gephi.org/plugins/#/plugin/sigmaexporter (March 5, 2020).
REFERENCES
Author notes
Handling Editor: Vincent Larivière