Understanding knowledge role transitions: A perspective of knowledge codification

Abstract Informal knowledge constantly transitions into formal domain knowledge in the dynamic knowledge base. This article focuses on an integrative understanding of the knowledge role transition from the perspective of knowledge codification. The transition process is characterized by several dynamics involving a variety of bibliometric entities, such as authors, keywords, institutions, and venues. We thereby designed a series of temporal and cumulative indicators to respectively explore transition possibility (whether new knowledge could be transitioned into formal knowledge) and transition pace (how long it would take). By analyzing the large-scale metadata of publications that contain informal knowledge and formal knowledge in the PubMed database, we find that multidimensional variables are essential to comprehensively understand knowledge role transition. More significantly, early funding support is more important for improving transition pace; journal impact has a positive correlation with the transition possibility but a negative correlation with transition pace; and weaker knowledge relatedness raises the transition possibility, whereas stronger knowledge relatedness improves the transition pace.


INTRODUCTION
Knowledge is a driving force for technological and economic change, notably scientific knowledge (Mistry & Berardi, 2016;Naghavi & Walsh, 2011). Scientific knowledge is a wide and abstract concept and one way to classify knowledge is as tacit or explicit (Ahmadyousefi, Choobchian et al., 2020). As tacit knowledge can only be understood by people who have the same personal experience, the codification of knowledge is considered as the essential method for translating tacit knowledge into economically viable innovations (Lissoni, 2001). Specifically, knowledge codification not only alters the tacit forms of knowledge but is a process of knowledge creation. When new knowledge is codified, new concepts and terminology are introduced, which inherently involves the further creation of knowledge (Cohendet & Meyer-Krahmer, 2001). Codified knowledge is explicit and formal. It comes in a variety of forms (Faber, 2011;Pór & Molloy, 2000;Su & Lee, 2010), such as words and numbers, scientific procedures, or universal principles. For example, keywords are usually considered finegrained knowledge entities (Yang, Li & Huang, 2018). New knowledge was transformed into formal knowledge after the operation of knowledge codification. There is therefore a need to explore and understand better knowledge transformation as a form for effecting the creation of knowledge.
In the context of knowledge codification, formal knowledge refers to knowledge that is codified. Scientific knowledge can be intuitively defined as formal knowledge when adopted into the domain knowledge base (Hjørland & Albrechtsen, 1995;Möller, Sintek et al., 2008;Wang, Hamilton & Bither, 2005). Correspondingly, informal knowledge refers to the knowledge entities that were not codified or adopted into the knowledge base. Knowledge role transition is the transformation from informal knowledge to formal knowledge in the process of knowledge codification. Formal knowledge is usually described by an ontological approach, which is the formal, explicit specification of a shared conceptualization of a domain. The machine interpretability of formal knowledge (codified knowledge) has been increasingly critical to improving the performance of information retrieval and image understanding (Möller, et al., 2008;Wildemuth, 2004) and organizing solutions to other real problems that exist in a specific research domain (Cardoso, Da Silveira, & Pruski, 2020;Tsatsaronis, Varlamis, 2013). Thus, understanding different transition patterns of formal knowledge is critical to tracking credible and valuable scientific knowledge and promoting innovation in science and technology.
Formal knowledge was born from the role transition of informal knowledge in an exogenous and endogenous-driven process. As the dynamic cooperation of research elements involved could lead to the role transition of scientific knowledge, we attempt to understand the patterns of knowledge role transition from multiple dimensions (i.e., publication, author, institution, funding, descriptors (keywords), and venue ( journals)). The three primary contributions of this paper are as follows: 1. We attempt to understand the role transition in the process of knowledge growth from the perspective of knowledge codification. 2. We investigate influence factors of the transition possibility from informal knowledge to formal knowledge. 3. We design a series of temporal indicators to characterize the transition pace from informal knowledge to formal knowledge.
The remainder of this paper is structured as follows. Section 2 contains twofold surveys on analysis dimension selection for understanding knowledge transition and the knowledge growth process from a knowledge codification perspective. Section 3 provides concept definitions and proposes variables based on the metadata of scholarly publications. Section 4 introduces the empirical data and constructs the formal knowledge and informal knowledge matched-pair samples. Section 5 presents the match-paired analysis of formal knowledge and informal knowledge and dynamic correlation analysis between various variables and transition time. Section 6 comprehensively discusses the findings and implications of this study. Section 7 concludes the overview of our work.

Analysis Dimension Selection for Understanding Knowledge Transition
The growth process is always bounded and usually follows an S-shaped or sigmoid curve in many studies including knowledge growth (Shimogawa, Shinno & Saito, 2012). The process of knowledge growth has also been described as a sequence of life stages: either three main stages (embryonic, early, and recognized) or four S-curve stages (birth, growth, maturity, and senility) (Tu & Seng, 2012;van den Oord & van Witteloostuijn, 2018). Thus, the growth process can also be crudely considered a transition from one stage to another. Knowledge transition could be characterized across several dimensions, such as the number of actors involved (e.g., scientists, institutions), funding support, and knowledge outputs produced (e.g., publications, patents). Most importantly, these dimensions are likely to coevolve and possibly cause different effects over different stages of knowledge growth.
Researchers have devoted much time and effort to understanding the transition process of scientific knowledge for discovering reliable and valuable knowledge. Emerging technology and topic discoveries are representative studies of detecting promising scientific knowledge through understanding the role transition patterns of technologies and topics. To be more specific, Tu and Seng (2012) calculated the publication number corresponding to each keyword to measure the power of transitioning into emerging topics. Some researchers have explored the transition patterns of scientific technologies and research topics from various dimensions according to the scholarly metadata of publications. Rotolo, Hicks, and Martin (2015) summarized scholarly metadata (i.e., authors, institutions, funding, keywords) to crystallize the five attributes of emerging technology: radical novelty; relatively fast growth; coherence; prominent impact; and uncertainty and ambiguity. Soon afterward, Carley, Newman et al. (2018) developed four attributes by metadata analytics to characterize emerging technology: novelty, persistence, community, and growth. Referring to emerging technology, Wang (2018) proposed the four attributes (i.e., radical novelty, relatively fast growth, coherence, and scientific impact) of emerging research topics based on scholarly metadata analytics. Iqbal, Qadir et al. (2019) conducted metadata analysis to identify research topics with high impact from the publication number as well as citation count dimensions. Weis and Jacobson (2021) detected the early warning signal for impactful research by analyzing highdimensional relationships among metadata of the scientific literature from papers, authors, and journals.
Scientific knowledge transition is affected by various factors. Thus, it is essential for observing the growth process from different dimensions of scholarly publications, which helps us receive a comprehensive understanding of knowledge role transition. Scholarly metadata was associated with publication authors, institutions, funding, citations, venues, keywords, etc., which benefits characteristic extraction. This study will take advantage of the large-scale metadata information and the intertwining relationships in scholarly publications to understand the role transition of scientific knowledge. the knowledge base extension. Tsatsaronis, Varlamis et al. (2013) proposed temporal variables to understand the dynamic process of concept transformation of the MeSH (Medical Subject Headings) ontology. Furthermore, Cardoso, Pruski, and Da Silveira (2018) introduced external sources of knowledge (i.e., PubMed and UMLS) to support biomedical ontology evolution by identifying outdated knowledge entities and the required types of change for the domain to evolve. Additionally, researchers universally recognized that the transformation of knowledge entities in the knowledge ecosystem is analogous to that of biological units in the ecosystem (Sice, Thirkle & Ogwu, 2018). This section has elaborated the knowledge growth process from a knowledge codification perspective. For knowledge entities, researchers focus on their growth process to understand the factors of knowledge transition based on thesauri. In the process of knowledge growth, if the knowledge entity is adopted into thesauri, it transitions into formal knowledge. That can provide realistic scenarios for investigating knowledge role transitions.

Concept Definition Under Knowledge Codification
The research objective of this paper is to investigate the role of transition patterns from informal knowledge to formal knowledge in terms of transition possibility and transition pace. In the field of library information science, the thesaurus is composed of codified knowledge. We defined the relevant concepts based on a domain thesaurus to improve readability, as shown in Figure 1.
For example, knowledge A and C have the same time intervals, but they have different growth trajectories, which indicates that knowledge entities have different transition possibilities. "Balloon embolectomy" (A) and "neuraminidase genes" (C) both appeared in abstracts in 1976 for the first time. "Balloon embolectomy" was codified and adopted in a domain thesaurus in 2012, whereas "neuraminidase genes" was not, meaning that "balloon embolectomy" was transitioned into formal knowledge but "neuraminidase genes" was informal knowledge at that time. In addition, knowledge B takes a shorter transition time to become formal knowledge compared to knowledge A, which implies that the pace of knowledge transition is different. Transformation time refers to the length of time that knowledge entities take from their first appearance to their adoption into the thesaurus. For example, "forensic toxicology" (B) also first appeared in abstracts in 1976 and was codified and adopted into the domain  . Thus, "forensic toxicology" (B) has a shorter transition time than "balloon embolectomy" (A).
1. Formal knowledge: This refers to knowledge that is codified and adopted into a domain thesaurus, such as A, B, and D in Figure 1. 2. Informal knowledge: This is derived from uncodified knowledge entities that are not adopted into a domain thesaurus currently, such as C in Figure 1. 3. Knowledge role transition: In the process of knowledge growth, informal knowledge is codified and adopted into a domain thesaurus. This means that informal knowledge transitions into formal knowledge. This fact is considered as knowledge role transition. 4. Transition possibility: This refers to whether informal knowledge could transition into formal knowledge; this action may be driven by exogenous and endogenous factors. 5. Transition pace: This refers to how long the role transition of knowledge takes, which can be measured through the adoption time in the records of a domain thesaurus.

Measuring Indicators
To depict or characterize the growth process of new knowledge, we have selected various characteristics by taking advantage of various metadata information and the intertwining relationships in scholarly publications (i.e., authors, keywords/descriptors, citations, institutions, journals, funding) (Salatino, 2019), as shown in Figure 2.
Specifically, according to the metadata information, the number of actors involved, funding, and knowledge outputs produced were utilized in order to characterize the growth patterns of knowledge entities (Carley et al., 2018;Rotolo et al., 2015). In Figure 2, the six circle nodes (except knowledge entities) are the six dimensions of analyzing the influence factors of transition possibility and transition pace. Thus, we propose the corresponding variables: scholarly publication, author, institution, funding, descriptors (keywords), and venue ( journals), as shown in Table 1. In the following content, we elaborate on the rationale behind the set of proposed variables.

Publication dimension
The number of scholarly publications is the more intuitive signal of knowledge output (Wang, 2018). Because annotation terms always appear one at a time in the keyword list, the number of publications is a measuring indicator of knowledge growth (Tu & Seng, 2012). In addition, the citation relationship not only provides information on the impact of scholarly publications but also measures the impact of research components such as authors, institutions, journals, and keywords. (Waltman, 2016). The number of citations in scholarly publications could provide an indication of attention inside the academic domain. Thus, the publication dimension could provide four indicators (i.e., cumulative # pubs, annual avg. pubs, cumulative # citations, and annual avg. citations). Yearly average of publication number Annual avg. pubs Yearly average of publication number in the span of transition time.

Cumulative citation count
Cumulative # citations Count of accumulative citation in the span of transition time.

Yearly average of citation count
Annual avg. citations Yearly average of citation count in the span of transition time.

Author
Cumulative author count Cumulative # authors Count of authors in the span of transition time.

Author average impact Author avg. impact
Average h-index of authors adopting descriptors in the span of transition time.

Yearly average of author count
Annual avg. authors Yearly average of author count in the span of transition time.

Average author counts per publication
Avg. authors per pub Average author counts per publication in the span of transition time.

Cumulative institution number
Cumulative # institutions Number of institutions in the span of transition time.

Yearly average of institution number
Annual avg. institutions Yearly average of institution number in the span of transition time.

Cumulative funding number
Cumulative # funding Number of funding awards in the span of transition time.

Author dimension
Authors of scholarly publications are important to explore in scientific metrics, such as author collaboration (Ebrahimi, Asemi et al., 2021;Guan, Yan & Zhang, 2017;Kaur & Mahajan, 2015) and author impact (Amjad, Rehmat et al., 2020;Dunaiski, Geldenhuys & Visser, 2018). Authors with high impact would lead the development of a discipline. Researchers contribute to the updating and growth of formal knowledge in the form of research results (Sun & Latora, 2020). The author count has been commonly adopted to compute the author community (Rotolo et al., 2015;Lu, Huang et al., 2021). In addition, we also obtained the measure of the author community by an average author count per publication and chose the common h-index method (Hirsch, 2005) to compute author impact; the details of the calculation approach are given in Appendix B in the Supplementary material. Thus, we utilized cumulative # authors, annual avg. authors, avg. authors per pub, and author avg. impact to calculate the community size and impact of the authors.

Institution dimension
As per the relationships shown in Figure 2, a research institution symbolizes a large research community, containing many talents, equipment, and other research resources, which invariably influences academic development, and an academic publication means the crystallization of an institution's wisdom (Kahn, 2011). The reputation of an institution also influences the growth of a research topic (Hottenrott, Rose & Lawson, 2021). Academic institutions are also essential objects in scientometrics research (Ellegaard & Wallin, 2015;Yegros-Yegros, Capponi & Frenken, 2021). In our study, we calculated the number of institutions to depict the growth situation of new knowledge, which was divided into two indicators: cumulative # institutions and annual avg. institutions.

Funding dimension
Generally, relatively large investments indicate that a prominent impact is expected (Álvarez-Bornstein & Bordons, 2021). The amount of funding could also cast light on the development prospects of new knowledge. Thus, early indications of knowledge growth may be revealed from the analysis of funding data. Although the coverage of funding data remains limited (Hopkins & Siepel, 2013), we extracted the usage information of funding data as reported by authors in the acknowledgments section of scholarly publications. The funding support signifies the development and energy of new knowledge through the second-order relationship between funding and knowledge outputs produced. We adopted cumulative # funding and annual avg. funding in understanding knowledge role transition.

Journal dimension
Journal information was also utilized to measure the importance and development potential of research topics (Moed, 2010). Mainly, the quantity and impact of scholarly publications have a positive correlation with the journal impact factor (Dinesh, 2017). Peset, Garzón-Farinós et al. (2020) found that journal impact has a significant effect on the survival time of author keywords, which is an important perspective to investigate knowledge growth. Thus, we speculate that journal impact reflects the reliability and recognition of new knowledge in the specific domain. In this study, the comprehensive journal impact factor, cumulative # journals, and annual avg. journals were considered as the indicators of journal venue. The journal impact factor indicates the domain recognition of the journal, and the journal number signifies the popularity of knowledge entities.

Descriptor dimension
Knowledge relatedness is a key indicator to represent semantic specificity (Breschi, Lissoni & Malerba, 2003). We attempt to measure the semantic specificity of one knowledge entity through the distribution of co-occurrence counts with other knowledge entities. The idea is similar to the Gini coefficient (Gini, 1997) which was utilized to demonstrate a degree of inequality of distribution in bibliometric studies (Cockriel & McDonald, 2018;Leydesdorff, Wagner & Bornmann, 2019;Nuti, Ranasinghe et al., 2015). Referring to the Gini coefficient, we calculated the integral area of the Lorenz curve to express knowledge relatedness. If a new knowledge entity jointly occurs with others more often, it has lower semantic specificity in the specific domain. For example, the "coliphages" and "corsiaceae" cases with the same count have different semantic specificity, as per the comparison examples shown in Figure 3.
In Figure 3, the x-axis represents the order of other knowledge entities, and the y-axis corresponds to the normalized co-occurrence count of knowledge entity pairs. The specific calculation process is shown in Appendix B in the Supplementary material. Based on the above analysis, we can divide the variables into cumulative and temporal variables to explore the two research problems of transition possibility and transition pace. The analysis framework for the knowledge role transition is shown in Figure 4.

DATA
In this study, the whole PubMed XML data set, which is an essential literature resource for the medicine domain, was parsed. The 30,376,130 scientific publications were collected up to 2019 from the PubMed data set. The metadata information of these scientific publications has enriched characteristics for exploring knowledge transition. A scholarly publication is associated with various metadata, such as author, institution, citation, journal, and keywords. The essential data acquisition and processing are shown in Figure 5.

Data Collection
The acquisition of multidimension data is performed around the publications in PubMed, which includes an amount of metadata information. First, because the citation relationship of PubMed is incomplete, we obtained the citation data of Web of Science ( WoS) to make up for missing citation relationships in PubMed (Xu, Kim et al., 2020). Second, the journal information is collected from the SJR website (Scimago Journal & Country Rank), which includes SJR for evaluating journal impact (Guerrero-Bote & Moya-Anegón, 2012), h-index, Cites/Doc. (2 years), etc.
Finally, as a domain knowledge base, the MeSH thesaurus was parsed to gain descriptors that were adopted by domain experts. We randomly selected a version of the MeSH thesaurus in a recent 5-year period. MeSH includes three items of data: descriptors (subject heading), qualifiers, and supplementary concept records. Among these, descriptors are divided into 16 trees and, as of 2015, they number 27,885 descriptors. We took MeSH descriptors as formal knowledge.

Data Reprocessing
More importantly, the twofold data items need to be preprocessed L: author name disambiguation and knowledge match. Author name disambiguation is to resolve the problem of author consistency. Knowledge match is to establish the relationship between knowledge entities and papers.

Author name disambiguation
Author name disambiguation is a general method for identifying unique authors in some studies. According to our investigation, Author-ity (Torvik & Smalheiser, 2009) and Semantic Scholar (Ammar, Groeneveld et al., 2018) are two high-quality data sets. Keeping in mind that the Author-ity data set has a higher F1 score (98.16%) than the Semantic Scholar data set (Xu et al., 2020), this process was to select the author's unique ID from the Author-ity data set as the primary unique identifier according to the proven strategy. However, it is limited by the time range, which only contains PubMed papers before 2009. Thus, authors after 2009 were supplemented by using the author name disambiguation results of Semantic Scholar.

Knowledge match
The abstracts of scientific literature usually emphasize research contributions (Bu, Li et al., 2021). The knowledge that appears in the abstract almost completely describes the core content of a scholarly publication. Knowledge match can build a bridge between fine-grained knowledge and scholarly publication. In this study, the formal knowledge is from the preferred concepts in the MeSH thesaurus, so the abstract of scholarly publication can be annotated by the formal knowledge using sequence matching algorithms. We found 372,899,456 matching records from 30,376,130 publications, which were integrated with the original annotated records in the PubMed database. Therefore, the scholarly publications could be conveniently retrieved by matching and annotating records of formal knowledge.

Empirical Data Construction
Formal knowledge is usually kept in the form of thesauri based on ontology. Medical Subject Headings (MeSH) was developed by the National Library of Medicine (NLM), which is a controlled vocabulary for indexing scholarly publications in the PubMed database (Liu, Peng et al., 2015). The domain-specific structured ontology describes what occurs in each domain and is usually considered as a domain knowledge base with specific hierarchies to represent concepts and relations (Nayak, Dutta et al., 2019). The transition time of formal knowledge is generally concentrated in the time interval of one to 40 years (see Figure 6), with an average value of 35 years. To collect records of the knowledge transition process, we need to match the formal knowledge with the abstracts of scholarly publications before the transition year. After mapping the existing literature to descriptors, we obtained 17,639 descriptors, which cover 63.3% of the total number of descriptors (27,885).
As some MeSH descriptors have a structure distinct from the usual keywords (i.e., authorkeywords; Valderrama-Zurián, García-Zorita et al., 2021), biomedical entities extraction is essential to discover informal knowledge in scholarly publications. We selected BERN (Kim, Lee et al., 2019), which learned the descriptor composition features of MeSH descriptors, to extract biomedical knowledge entities. Keeping in mind that the character length scope of knowledge entities from BERN is from 1 to 145, whereas the that of MeSH descriptors is from 2 to 45, we therefore restricted the word length scope of the experimental and control groups. Knowledge entities that were adopted into the MeSH ontology were considered formal knowledge, and other knowledge entities with the same year of debut and word length scope (from 2 to 45) were chosen as informal knowledge to constitute matched-pair samples with formal knowledge. The 17,639 MeSH descriptors were randomly matched with 3,449,589 informal knowledge entities as the control group. The samples' distribution of the formal knowledge and informal knowledge over the transition time is shown in Figure 6. As the MeSH thesaurus adopted more knowledge entities in 1999 than in other years, the spike in Figure 6 contains mainly the 761 knowledge entities adopted in 1999, which have 25 years of data.

Matched-pair statistical analysis
Kolmogorov-Smirnov tests were conducted for the distribution of cumulative # pubs, cumulative # citations, cumulative # authors, cumulative # institutions, cumulative # funding, cumulative # journals, journal avg. impact, author avg. impact, and knowledge relatedness variables between treatment and control groups. The tests revealed that the distribution of these variables does not follow the normal distribution (p < 0.001). Thus, the Wilcoxon signature rank test was performed to verify the differences in the distribution of these variables between formal and informal knowledge. Most tests showed that statistically significant differences between formal and informal knowledge are less than the 0.001 level, except the test of journal avg. impact, as shown in detail in Table 2.
To observe the significant difference between formal and informal knowledge, we respectively provided the box plots for these variables: cumulative # pubs, cumulative # citations, cumulative # authors, author avg. impact, cumulative # institutions, cumulative # funding, cumulative # journals, journal avg. impact, and knowledge relatedness, shown in Figure 7. The test results imply that cumulative # pubs, cumulative # citations, cumulative # authors, cumulative # institutions, cumulative # funding, cumulative # journals, author avg. impact, and knowledge relatedness could effectively distinguish formal knowledge from informal knowledge, but journal avg. impact is not significantly discriminative for formal knowledge and informal knowledge. Even the statistically significant difference of the author avg. impact variable is less than the 0.001 level, and the medians of the treatment and control groups are similar. Thus, a more in-depth comparative analysis is necessary for cumulative # pubs, cumulative # citations, cumulative #authors, cumulative # institutions, cumulative # funding, cumulative # journals, author avg. impact, and knowledge relatedness variables, and notably author avg. impact.

Differentiation analysis over transition time
The test statistics analysis only presents the discriminatory power of the variables from the perspectives of the overall distribution and does not reflect the specific characteristics of each variable. Keeping in mind that different knowledge entities may have different transition time intervals, we calculated the average values of these variables over transition time. This could highlight the performance of these variables in terms of discriminating knowledge entities with different transition time intervals. We visualized the distributions of average values of these variables over transition time, as shown in Figure 8. Quantitative Science Studies Figure 8 indicates that author avg. impact and journal avg. impact do not show the distinguishing effects on formal knowledge and informal knowledge, which implies that the author avg. impact and journal avg. impact variables cannot determine the transition from informal knowledge to formal knowledge. Moreover, the average values of cumulative # pubs, cumulative # citations, cumulative # authors, cumulative # institutions, cumulative # funding, and cumulative # journals of formal knowledge are not less than those of informal knowledge. This suggests that these variables could be utilized to distinguish between formal knowledge  and informal knowledge. Interestingly, only the knowledge relatedness values of informal knowledge are not less than those of formal knowledge. It could be understood that the smaller the value of knowledge relatedness, the more specific the semantics of the knowledge and the more likely it is to be adopted as formal knowledge. What is more, the cumulative # pubs, cumulative # authors, cumulative # institutions, and cumulative # journals variables could more effectively distinguish formal knowledge from informal knowledge with a short transition time than that with a long transition time. In contrast, the space between the blue and red curves of cumulative # citations, cumulative # funding, and knowledge relatedness in the bigger transition time intervals seems to be relatively larger than that in the smaller transition time intervals. This means that the cumulative # citations, cumulative # funding, and knowledge relatedness variables have stronger distinguishable effects on knowledge entities with a long transition time, different from the cumulative # pubs, cumulative # authors, cumulative # institutions, and cumulative # journals variables.
Furthermore, it is important to note that, in Figure A-1 in Appendix A in the Supplementary material, we provided scatter plots of the knowledge entities in the coordinates of each variable versus transition time to provide a comprehensive response to the differentiation performance of each variable. Combined with the above analysis, we suppose that those multidimensional variables could improve the performance for distinguishing the formal and informal knowledge overall transition times. Likewise, it is essential to understand knowledge growth and transition patterns from multidimensions.

Static correlation analysis
Transition time is the interval of time that elapses from informal knowledge to formal knowledge. Thus, we conducted an analysis of the correlation between temporal variables and transition time to explore the influence factors of transition pace. Because some samples have shorter transition times and the values of cumulative variables increase with the consumption of transition time, the correlation coefficients were calculated respectively based on the first 5 years and 10 years of history data from informal knowledge to formal knowledge, as shown in Table 3.
Intuitively, most values of temporal variables are negatively correlated with transition time, except journal avg. impact. It is easy to understand that the more annual publications, citations, contributing authors, institutions, journals, and supporting funds, the stronger the authors' impact, and the faster new knowledge could be adopted into the domain knowledge base. Specifically, annual avg. funding, annual avg. knowledge relatedness, avg. authors per pub, journal avg. impact, and author avg. impact are not less than 0.4, whereas annual avg. pubs, annual avg. authors, annual avg. institutions, annual avg. citations, and annual avg. journals are less than 0.3. Furthermore, we calculated the mean values of annual avg. knowledge relatedness, avg. authors per pub, journal avg. impact, and author avg. impact corresponding to each transition time interval. Figure 9(a) shows the distribution of avg. authors per pub over transition time: The mean value of avg. authors per pub gradually decreases in the smaller time intervals, whereas it has an overall upward trend in the bigger time intervals. This implies that the avg. authors per pub variable has different effects on formal knowledge with different transition time intervals. Figure 9(b) indicates that author avg. impact has an overall decreasing trend over transition time, which is a robust impact element on transition pace. Figure 9(c) suggests that the values of annual avg. knowledge relatedness drop rapidly and then slowly, which is

Quantitative Science Studies
more effective for characterizing the transition pace of formal knowledge with short to medium transition times. Figure 9(d) suggests that the values of journal avg. impact decrease sharply and then stay relatively stable. In summary, the values of one variable may not always have a strong correlation with transition time in the whole span, even if there are opposite trends in the first and second half transition time intervals. Thus, multidimensional variables are essential to provide a comprehensive description of the pace at which knowledge is adopted.

Dynamic correlation analysis
To further explore the dynamic correlation between temporal variables and transition time, we calculated the correlation coefficients consecutively in each span of history data. As the average value of transition time of all formal knowledge is 35 years, we respectively calculated the correlation coefficients of temporal and cumulative variables in the first 35 spans of history data. In the static correlation analysis, most correlation coefficients, except for journal avg. impact, are negative numbers in the first 5 and 10 years of history data. To visualize this, the correlation coefficients of these variables were taken as negative, namely zero minus the correlation coefficients, to obtain positive values. In addition, to highlight the performance of the temporal variables, we also calculated the dynamic correlation of the cumulative variables. The results of the dynamic correlation are a comparative analysis of temporal and cumulative variables, including the journal avg. impact, avg. authors per pub, and author avg. impact variables, are shown in Figure 10.
Comparing these variables, the correlation coefficients of annual avg. pubs, annual avg. authors, annual avg. citations, annual avg. institutions, and annual avg. journals, respectively are firstly lower and then higher than those of cumulative # pubs, cumulative # authors, cumulative # citations, cumulative # institutions, and cumulative # journals. However, the correlation coefficients of annual avg. funding and annual avg. knowledge relatedness are always higher than those of cumulative # funding and knowledge relatedness. The maximum correlation values of avg. authors per pub, author avg. impact, annual avg. funding, journal avg. impact, annual avg. knowledge relatedness, and knowledge relatedness are all bigger than At the author level, the values of avg. authors per pub and author avg. impact are more correlated with transition time: The curves of the two variables both increase at first and then decrease. Specifically, avg. authors per pub has a maximum correlation value in the 14th span of history data (0.535), and the maximum correlation value of author avg. impact is in the 26th span (0.589). The analysis results indicate that the avg. authors per pub and author avg. impact variables could be utilized to characterize the pace of knowledge transition. This suggests the higher the average number of authors per paper, the greater the impact of the authors, and the earlier new knowledge could receive attention from domain researchers.
At the funding level, the annual avg. funding variable has its maximum correlation value in the year of debut, whereas the curve of cumulative # funding reaches its peak in the first six years of history data. The correlation maximum of cumulative # funding is 0.237 while that of annual avg. funding is 0.433. The annual avg. funding variable is more correlated with transition time than cumulative # funding. This implies that early funding support is more important for improving the transition pace.
At the descriptor level, annual avg. knowledge relatedness increases rapidly and then stays stable, and knowledge relatedness also increases and then gradually decreases. The curve of annual avg. knowledge relatedness reaches an inflection point in the 10-year span, the correlation value of which is 0.5. Looking at the distribution of knowledge relatedness over transition time, the knowledge relatedness variable has a correlation maximum of about 0.43 at the 7-year span. The curve of annual avg. knowledge relatedness is always above that of knowledge relatedness, which suggests that annual avg. knowledge relatedness has a stronger correlation with transition time, which reveals the intrinsic and implicit interaction law of knowledge transition. Thus, the relatedness degree with other knowledge is an important variable for characterizing the pace of knowledge transition. Particularly at the journal level, the "SJR Quartile" was taken to delineate the levels of journal impact. "SJR Quartile" indicates the quartile to which a given journal belongs according to its impact. Quartile 1 (Q1) is the highest impact score and Quartile 4 (Q4) is the lowest. Formal knowledge is mainly derived from the new knowledge that appears in high-impact journals. After statistical analysis, we found that formal knowledge originated from Q1 journals (14,359), Q2 journals (2,532), Q3 journals (154), and Q4 journals (18). This indicates that the new knowledge entities that appear in journals with high impact factors are more likely to be transformed into formal knowledge. In terms of transition pace, the journal number gradually decreases while the journal impact gradually increases, as shown in Figure 10. This suggests that formal knowledge with a shorter transition time has a bigger journal number but a weaker journal impact. To further understand the effect of journal impact on transition pace, the transition year is regarded as the "time of death" (i.e., the time at which the event occurred). We grouped formal knowledge by the initial journal impact to conduct survival analysis (González, García-Massó et al., 2018). Figure 11 shows that some descriptors from SJR Q1 have a longer survival time, with their fold at the top, and some from SJR Q2 have the shortest survival time, with their fold always at the bottom in the short intervals of transition time. This means that informal knowledge with a high journal impact is more likely to transition into formal knowledge, whereas a lower journal impact level than SJR Q1 improves the pace of transition.

DISCUSSION
We aim to understand the knowledge role transition from a perspective of knowledge codification, which influences the speed of knowledge creation: innovation. In this section, we discuss the implications of major findings to current theories, which may inform new research challenges and ideas for future study.
Codified knowledge is beneficial in facilitating the creation, circulation, and reconstitution of knowledge. Knowledge codification is inherently a complex process, which is influenced by a variety of external and internal factors. Knowledge being codified could be understood as its role transition in our work. Theoretically, because each change in scientific knowledge is typically a reaction to scholarly publications, the characteristics of explicit and implicit scholarly publications could largely characterize the transition process of formal knowledge. In terms of practice, formal knowledge is represented as thesauri or ontologies. When a new knowledge entity is adopted into a thesaurus, it transitions into formal knowledge. In previous studies, scholars have explored the evolutionary pattern of scientific knowledge growth by dividing it into life cycle stages (Shimogawa et al., 2012). Accordingly, knowledge role transition also belongs to a phenomenon in the process of knowledge growth and evolution, like the process of awakening sleeping knowledge (Yang, Bu et al., 2022).
To explore more internal and external elements that influence knowledge growth and evolution, multidimension data were collected and different variables were developed by taking advantage of the large-scale metadata information of scholarly publications (Sharma & Khurana, 2021;Wang, 2018;Weis & Jacobson, 2021). These studies found that the metadata information could reveal the evolution of scientific knowledge growth, but the results varied for specific growth phenomena. For instance, the cumulative citation was a better signal for identifying impactful research 5 years after publication (Weis & Jacobson, 2021), whereas the cumulative #citations variable is better at distinguishing knowledge entities with a long transition time in our work.
The transition possibility is defined to describe whether informal knowledge could transition into formal knowledge. This phenomenon could be regarded as a state change in the process of knowledge growth and evolution. We find that the cumulative variables from metadata information have a better effect on revealing this phenomenon. Specifically, cumulative # pubs, cumulative # authors, and cumulative # journals could better distinguish between formal knowledge and informal knowledge with a short transition time, whereas cumulative # citations, cumulative # funding, and knowledge relatedness are better at distinguishing knowledge entities with a long transition time. As for cumulative # pubs, cumulative # authors, and cumulative # journals, their numbers were fixed as soon as the literature was published, so these indicators therefore tend to differentiate between knowledge entities with short transition times. Cumulative # citations, cumulative # funding, and knowledge relatedness indicators have a significant delayed effect (Mariani, Medo, & Zhang, 2016): It generally takes a few years for them to emerge with an edge. The cumulative # funding variable could differentiate formal knowledge from informal knowledge: Formal knowledge has a higher median value of cumulative # funding than informal knowledge in Table 2. This finding indicates that funded knowledge has a higher potential for development than nonfunded knowledge, which is consistent with the results of the latest study (Mosleh, Roshani, & Coccia, 2022). Further, our study explores the factors influencing the growth and evolution of scientific knowledge from a more microscopic perspective and finds that metadata variables have their scope of applicability.
In terms of transition pace, temporal variables are better suited to describe the pace of knowledge transition. The author avg. impact variable is more correlated with transition time. Namely, the correlation coefficient of the first 10 years is 0.49 and the maximum is 0.59 at the 0.0001 level. High-impact authors have a leadership effect that attracts more followers (Bu, Ding et al., 2018), and knowledge entities that gain more attention are more likely to be codified, which contributes to the transition from informal knowledge to formal knowledge. In addition, it is important to note that only the journal avg. impact variable has a positive correlation with transition time, whereas the others are negatively correlated with transition time. We further find that some knowledge entities from SJR Q1 take a longer time to be formal knowledge and some from SJR Q2 need a shorter time, which is consistent with the results that SJR Q2 shows a longer average survival time than those from SJR Q1 (Peset et al., 2020). The two results both indicate that the knowledge entities or keywords from the SJR Q2 journal are of greater concern to researchers. Most surprising is that the smaller knowledge relatedness value improves the possibility of transition, whereas the bigger knowledge relatedness or annual avg. knowledge relatedness value improves the transition pace of formal knowledge. According to the calculation approach of knowledge relatedness in Appendix B in the Supplementary material, the broader the semantics of a knowledge entity, the more other knowledge entities co-occur with it, and the more balanced the distribution of co-occurrence.
The more convergent the semantics the more likely it is to be codified, and the broader the semantics the faster the role transition. In the future, the balance point of transition pace and transition possibility is an interesting research problem for scientometrics.

CONCLUSION
Knowledge role transition is a highly complex process that is influenced by a variety of external and internal factors. By analyzing the large-scale metadata of publications in PubMed, we found that cumulative variables (i.e., cumulative # pubs, cumulative # authors, cumulative # institutions, and cumulative # journals) tended to predict formal knowledge with short transition times, whereas cumulative # citations, cumulative # funding, and knowledge relatedness distinguish those with long transition times. The temporal variables (i.e., avg. authors per pub, author avg. impact, annual avg. funding, journal avg. impact, and annual avg. knowledge relatedness) are more correlated with transition time. Specifically, early funding support is more important for improving the transition pace, notably in the year of debut. Journal impact has a positive correlation with the transition possibility but a negative correlation with transition pace. The weaker knowledge relatedness raises the transition possibility, whereas the stronger knowledge relatedness improves the transition pace.
This study has significant theoretical and practical implications regarding knowledge codification and role transition. In theoretical terms, it helps to better understand the implicit and dynamic patterns of knowledge codification. In practical terms, the findings concerning knowledge role transition patterns will help maintenance experts to update the terms or concepts of thesauri by recommending credible and valuable new domain knowledge entities. The thesaurus is also an important indexing tool in information retrieval system: The new domain knowledge detection benefits the automatic annotation of scientific literature, which improves the performance of the academic retrieval system. More importantly, understanding knowledge role transition allows us to learn from the past to improve the ability to detect knowledge innovation in the future. Overall, these findings are of great significance for domain knowledge management and early detection of credible and valuable knowledge.
However, there are some potential limitations in our work. First, because we focus on biomedical publications, these findings may not generalize to other disciplines. Second, to reduce the complexity of the calculation, we selected preferred concepts of MeSH to represent knowledge entities. Third, we have only taken a descriptive statistical analysis of the indicators set in the process of knowledge role transition but not an in-depth analysis of the underlying mechanisms. In future work, we should semantically encode knowledge entities and investigate the patterns of knowledge role transition in other disciplines.