Abstract
Finding the lineage of a research topic is crucial for understanding the prior state of the art and advancing scientific displacement. The deluge of scholarly articles makes it difficult to locate the most relevant previous work. It causes researchers to spend a considerable amount of time building up their literature list. Citations play a crucial role in discovering relevant literature. However, not all citations are created equal. The majority of the citations that a paper receives provide contextual and background information to the citing papers. In those cases, the cited paper is not central to the theme of citing papers. However, some papers build upon a given paper and further the research frontier. In those cases, the concerned cited paper plays a pivotal role in the citing paper. Hence, the nature of the citation that the former receives from the latter is significant. In this work, we discuss our investigations towards discovering significant citations of a given paper. We further show how we can leverage significant citations to build a research lineage via a significant citation graph. We demonstrate the efficacy of our idea with two real-life case studies. Our experiments yield promising results with respect to the current state of the art in classifying significant citations, outperforming the earlier ones by a relative margin of 20 points in terms of precision. We hypothesize that such an automated system can facilitate relevant literature discovery and help identify knowledge flow for a particular category of papers.
1. INTRODUCTION
Literature searches are crucial to discover relevant publications. The knowledge discovery that ensues forms the basis of understanding a research problem, finding the previously explored frontiers and identifying research gaps, which eventually leads to the development of new ideas. However, with the exponential growth of scientific literature (including published papers and preprints) (Ghosal, Sonam et al., 2019b), it is almost impossible for a researcher to go through the entire body of the scholarly works, even in a very narrow domain. Citations play an important role here in finding the relevant articles that further topical knowledge. However, not all citations are equally effective (Zhu, Turney et al., 2015) effective in finding relevant research. A majority of papers cite a work contextually (Pride & Knoth, 2017a) for providing additional background context. Such background contextual citations help in the broader understanding; however, they are not central to the citing paper’s theme. Some papers use the ideas in a given paper, build upon those ideas, and displace the body of relevant research. Such papers are expected to acknowledge the prior work (via citing them) duly. However, the nature of citation, in this case, is different from that of contextual citations. These citations, which heavily rely on a given work or build upon that work, are significant citations. However, the current citation count metric puts equal weights on all the citations. Therefore, it is inadequate for identifying the papers that have significantly cited a given work and may have taken the relevant research forward. Identifying such significant citations is hence crucial to the literature study.
It is not uncommon that authors sometimes fail to acknowledge relevant papers’ role in influencing their ideas (Rousseau, 2007; Van Noorden, 2017). As a result, researchers spend a lot of their time searching for the papers most relevant to their research topic, thereby locating the subsequent papers that carried forward a given scientific idea. It is usually desirable for a researcher to understand the story behind a prior work and trace the concept’s emergence and gradual evolution through publications, thereby identifying the knowledge flow. Researchers ideally curate their literature base by identifying significant references to a given paper and then hierarchically locating meaningful prior work.
The idea of recognizing significant citations is also important to understand the true impact of given research or facility. To understand how pervasive particular research was in the community, it is essential to understand its influence beyond the direct citations it received. To this end, tracking the transitive influence of research via identifying significant citations could be one possible solution.
In this work, we develop automatic approaches to trace the lineage of given research via transitively identifying the significant citations to a given article. The overall objective of our work is twofold:
Accelerate relevant literature discovery via establishing a research lineage.
Find the true influence of a given work and its pervasiveness in the community beyond citation counts.
There are two aspects to the problem: identifying the relevant prior work and identifying the follow-up works that stemmed from or are influenced by the current work. The first aspect would facilitate relevant prior literature discovery for a paper. In contrast, the second aspect would facilitate discovering the knowledge flow in subsequent relevant papers. Obviously, our approach would not be a one size fits all approach. Still, we believe it is effective to find investigations that build upon relevant priors and facilitate relevant literature discovery, and thereby steer towards identifying the pervasiveness of a given piece of research in the community. We base our work on classifying citations as contextual or significant and trace the lineage of research in a citation graph via identifying significant edges. The major contributions of the current work are the following:
We use a set of novel and rich features to classify citations as significant or contextual.
A graph-based approach to tracing the lineage of a given research work leveraging on citation classification.
2. RESEARCH LINEAGE
The mechanism of citations in academia is not always transparent (Van Noorden & Singh Chawla, 2019; Vîiu, 2016; West, Stenius, & Kettunen, 2017). Problems such as coercive citations (Wilhite & Fong, 2012), anomalous citations (Bai, Xia et al., 2016), citation manipulation (Bartneck & Kokkelmans, 2011), rich get richer effects (Ronda-Pupo & Pham, 2018), and discriminatory citation practices (Camacho-Miñano & Núñez-Nickel, 2009) have infested the academic community. However, in spite of all these known issues, citation counts and h-indices still remain the measures of research impact and tools for academic incentives, though long-debated by many (Cerdá, Nieto, & Campos, 2009; Laloë & Mosseri, 2009). Usually, we measure the impact of a given paper by the direct citations it receives. However, a given piece of research may have induced a transitive effect on other papers, which is not apparent with the current citation count measures. Figure 1 shows a sample citation network where A could be a paper or a research facility. We want to know how pervasive was the research or facility A in the community. At d = 1 are the direct citations to A. We see that article B cite A significantly, or B is inspired by A. Other citations to A are background. At citation depth d = 2, we see that article C and article D significantly cite B (direct citation). We see that C also cites A significantly. Finally, at citation depth d = 3, E significantly cites C. We intend to understand if there is a lineage of research from A to E (A → B → C → E). Although E does not cite A directly, can we identify A’s influence on E? If E is a seminal work receiving hundreds of citations, can we infer that A was the prior work that indirectly inspired E? We are interested in discovering such hidden inspirations to honestly assess the contributions of a research article or a facility.
3. RELATED WORK
Measuring academic influence has become a research topic because publications are associated with academic prestige and incentives. Several metrics (impact factor, eigen factor, h-index, citation counts, altmetrics, etc.) have been devised to comprehend research impact efficiently. Still, each one is motivated by a different aspect and has found varied importance across disciplines. Zhu et al. (2015) did pioneering work on academic influence prediction leveraging citation context. Shi, Wang et al. (2019) presented a visual analysis of citation context-based article influence ranking. Xie, Sun, and Shen (2016) predicted paper influence in an academic network by taking into account the content and venue of a paper, as well as the reputation of its authors. Shen, Song et al. (2016) used topic modeling to measure academic influence in scientific literature. Manju, Kavitha, and Geetha (2017) identified influential researchers in an academic network using a rough-set based selection of time-weighted academic and social network features. Pileggi (2018) did a citation network analysis to measure academic influence. Zhang and Wu (2020) used a dynamic academic network to predict the future influence of papers. Ji, Tang, and Chen (2019) analyzed the impact of academic papers based on improved PageRank. F. Wang, Jia et al. (2019) assessed the academic influence of scientific literature via altmetrics. F. Zhao, Zhang et al. (2019) measured academic influence using heterogeneous author-citation networks. Recently, many deep learning–based methods have been explored for citation classification. Perier-Camby, Bertin et al. (2019) attempt to compare deep learning-based methods with rule-based methods. They use deep learning–based feature extractors such as BCN (McCann, Bradbury et al., 2017) and ELMo (Peters, Neumann et al., 2018) to extract semantic information and feed it to various classifiers for classification. They conclude that neural networks could be a potential dimension for citation classification when a large number of samples are available. However, for a small data set such as the one we use, rule-based methods clearly hold an advantage. Apart from this, the features used in rule-based methods are more comprehensible than features extracted from deep learning methods, thus providing deeper insights into analyzing factors that make a citation significant or contextual.
The closest literature for our task is that on citation classification. Citation classification has been explored in the works of Alvarez, Soriano, and Martínez-Barco (2017), Dong and Schäfer (2011), Qayyum and Afzal (2019), and Teufel, Siddharthan, and Tidhar (2006). These works use features from the perspective of citation motivation. On the other hand, there are works that emphasize on features from a semantic perspective. Wang, Zhang et al. (2020) use syntactic and contextual information of citations for classification. Aljuaid, Iftikhar et al. (2021) and Amjad and Ihsan (2020) perform classification based on sentiment analysis of in-text citations. Athar (2011) and Ihsan, Imran et al. (2019) propose sentiment analysis of citations using linguistic studies of the citance. More recently, several open-source data sets for citation classification have been developed in the work of Cohan, Ammar et al. (2019) and Pride and Knoth (2020). Valenzuela, Ha, and Etzioni (2015) explored citation classification into influential and incidental using machine learning techniques which we adapt as significant and contextual respectively in this work.
In this work, we propose a rich set of features informed from both citation and context (semantics) perspectives, leveraging advantages of both types, thus performing better than all of the methods mentioned above. However, our problem is motivated beyond citation classification. We restrict our classification labels to significant and contextual, unlike Valenzuela et al. (2015), as these labels are enough to trace the lineage of a work. Furthermore, to the best of our knowledge, we did not find any work leveraging citation classification for finding a research lineage. Hence, we only compare our performance for the citation significance detection subtask with other approaches.
4. DATA SET DESCRIPTION
We experiment with the Valenzuela data set (Valenzuela et al., 2015) for our task. The data set consists of incidental/influential human judgments on 630 citing-cited paper pairs for articles drawn from the 2013 ACL anthology, the full texts of which are publicly available. Two expert human annotators determined the judgment for each citation, and each citation was assigned a label. Using the author’s binary classification, 396 citation pairs were ranked as incidental citations, and 69 (14.3%) were ranked as influential (important) citations. For demonstrating our research lineage idea, we explore knowledge flow on certain papers of Document-Level Novelty Detection (Ghosal, Salam et al., 2018b) and the High Performance Computing (HPC) algorithm MENNDL (Young, Rose et al., 2015). The actual authors of these two topics helped us with manual annotation of their paper’s lineage.
5. METHODOLOGY
To identify significant citations, we pursue a feature-engineering approach to curate several features from cited-citing paper pairs. The objective is to classify the citations received by a given paper into SIGNIFICANT and CONTEXTUAL. The original cited citing papers in the Valenzuela data set are in PDF. We convert the PDFs to corresponding XMLs using GROBID (Lopez, 2009). We use GROBID to parse our PDFs into XMLs as well as manually correcting a few inconsistent files so that there is no discrepancy.
Citation frequency inside the body of citing paper (F1): We measure the number of times the cited paper is referenced from within the citing paper’s body. The intuition is that if a paper is cited multiple times, the cited paper may be significant to the citing paper.
Are the authors of citing and cited paper the same? (Boolean) (F2): We check if the authors of the citing and cited papers are the same. This might be a case of self-citation or can also signal the extension of the work.
Author overlap ratio (F3): This measures number of common authors in the citing and cited papers normalized to the total number of authors in the citing paper. The intuition is similar to F2.
Is the citation occurring in a table or figure caption? (Boolean) (F4): The intuition is that most of the citations in tables and figures appear for comparison/significantly referencing existing work. Hence, the citing paper might be an extension of the cited article or may have compared it with earlier significant work.
Is the citation occurring in groups? (Boolean) (F5): We check if the citation is occurring along with other citations in a group. The intuition is that such citations generally appear in related works to highlight a background detail; hence, they might not be a significant citation.
Number of citations to the cited paper normalized by the total number of citations made by the citing paper (F6): This measures the number of citations to the cited paper by the citing paper normalized by the total number of citation instances in the citing paper. This measures how frequently the cited paper is mentioned compared to other cited papers in the citing paper.
Number of citations to the cited paper normalized by the total number of bibliography items in the citing paper (F7): This measures the number of citations to the cited paper normalized to the total number of bibliography items in the citing paper. The intuition is similar to F6.
tf-idf similarity between abstracts of the cited and citing paper (F8): We take cosine similarity between the tf-idf representations of the abstracts of cited and citing papers. The intuition is that if the similarity is higher, the citing paper may be inspired/extended from the cited paper.
tf-idf similarity between titles of the cited and citing paper (F9): We take cosine similarity between the tf-idf representations of the titles of cited and citing papers.
Average tf-idf similarity between citance and abstract of the cited paper (F10): We calculate the similarity of each citance with the abstract of the cited article and take the average of it. Citances are sentences containing the citations in the citing paper. Citances reveal the purpose of the cited paper in the citing paper. Abstracts contain the contribution/purpose statements of a given paper. Hence similarity with citances may suggest that the cited paper has been used significantly in the current paper.
Maximum tf-idf similarity between citance and abstract of the cited paper (F11): We take the maximum of similarity of the citances (there could be multiple citation instances of the same paper in a given paper) with the abstract of the cited paper.
Average tf-idf similarity between citance and title of the cited paper (F12): We calculate the similarity of each citance with the title of the cited paper and take an average of it.
Maximum tf-idf similarity between citance and title of the cited paper (F13): We take the maximum of similarity of the citances with the title of the cited paper.
Average length of the citance (F14): Average length of the citances (in words) for multiple citances. The intuition is that if the citing paper has spent many words on the cited article, it may have significantly cited the corresponding article.
Maximum length of the citance (F15): Maximum length of the citances (in words).
No. of words between citances (F16): We take the average of the number of words between each pair of consecutive citances of the cited paper. This is set to 0 in the case of a single citance.
In how many different sections does the citation appear in the citing paper? (F17): We take the number of different sections in which the citation to a cited paper occurs and normalize it with the total number of sections present in the citing paper. The intuition is that if a citation occurs in most sections, it might be a significant citation.
Number of common references in citing and cited paper normalized by the total number of references in citing article (F18): We count the number of common bibliographic items present in the citing and cited papers and normalize it with total bibliographic items present in the citing paper.
Number of common keywords between abstracts of the cited and citing paper extracted by YAKE (Campos, Mangaravite et al., 2018) (F19): We compare the number of common keywords between the abstracts of the citing and cited papers extracted using YAKE. Our instinct is that a greater number of common keywords would denote a greater similarity between abstracts.
Number of common keywords between titles of the cited and citing paper extracted by YAKE (F20): We compare the number of common keywords between the titles of the citing and cited papers extracted using YAKE.
Number of common keywords between the body of the cited and citing papers extracted by YAKE (F21): We compare the number of common keywords between the body of the citing and cited papers extracted using YAKE.
Word Mover’s Distance (WMD) (Huang, Guo et al., 2016) between the abstracts of the cited and citing papers (F22): We measure the WMD between the abstracts of the citing and cited papers. The essence of this feature is to calculate semantic distance/similarity between abstracts of the two papers.
WMD between titles of the cited and citing papers (F23): We measure the WMD between the titles of the citing and cited papers.
WMD between the bodies of the cited and citing papers (F24): We measure the WMD between the bodies of the citing and cited papers.
Average WMD between citance and abstract of the cited and citing papers (F25): We take the average of WMDs between the citance and abstract of the cited paper.
Maximum WMD between citance and abstract of the cited and citing papers (F26): We take the maximum of WMDs between the citance and abstract of the cited paper.
Average VADER (Gilbert & Hutto, 2014) polarity index—Positive (F27), Negative (F28), Neutral (F29), Compound (F30): We measure the VADER polarity index of all the citances of the cited paper, and take their average for each sentiment (positive, negative, neutral, and compound).
Maximum VADER polarity index—Positive (F31), Negative (F32), Neutral (F33), Compound (F34) of citances: We measure the VADER polarity index of all the citances of the cited paper, and take the maximum among them for each sentiment (positive, negative, neutral, and compound). The intuition to use sentiment information is to understand how the citing paper cites the cited paper.
Number of common venues in the bibliographies of the citing and cited papers (F35): We count the number of common venues mentioned in the bibliographies of the citing and cited papers and normalize it with the number of unique venues in the citing paper. Higher venue overlap would signify that the papers are in the same domain (Ghosal et al., 2019b).
Number of common authors in the bibliographies of the citing and cited papers (F36): We count the number of common authors mentioned in the bibliographies of the citing and cited papers and normalize it with the number of unique authors in the citing paper (Ghosal et al., 2019b).
As mentioned earlier, only 14.3% of total citations are labeled as significant, which poses a class imbalance problem. To address this issue, we use SMOTE (Chawla, Bowyer et al., 2002) along with random undersampling of the majority (contextual citation) class. We first split the data set into 60% training and 40% testing data. Then we undersample the majority class by 50%, and oversample the minority class by 40% on the training partition of the data set.
6. EVALUATION
Our evaluation consists of two stages: First, we evaluate our approach on the citation significance task. Next, we try to see if we can identify the research lineage via tracing significant citations across the two research topics (Document-level Novelty and MENNDL). We ask the original authors to annotate the lineage and verify it with our automatic method. We train our model on the Valenzuela data set and use that trained model to predict significant citations of Document-Level Novelty and MENNDL papers, and thereby try to visualize the research lineage across the citing papers. We curate a small citation graph to demonstrate our idea. Note that our task in concern is Citation Significance Detection, which is different from Citation Classification in Literature. Whereas Citation Classification focuses on identifying the citation’s intent, Citation Significance aims to identify the value associated with the citation. Obviously, the two tasks are related to each other, but the objectives are different.
6.1. Citation Significance Detection
The goal of this task is to identify whether a citation was SIGNIFICANT or CONTEXTUAL. We experiment with several classifiers for the binary classification task such as kNN (k = 3), Support Vector Machines (kernel = RBF), Decision Trees (max depth = 10) and Random Forest (n estimators = 15, max depth = 10). We found Random Forest to be the best performing one with our feature set. Table 1 shows our current results against the earlier reported results on the Valenzuela data set. We attain promising results compared to earlier approaches with a relative improvement of 20 points in precision. As the data set is small, neither earlier works nor we attempted a deep neural approach for citation classification on this dataset. Like us, Qayyum and Afzal (2019) also used Random Forest as the classifier; however, they relied on metadata features rather than content-based features for their work. Their experiments tried to answer the following questions: To what extent can the similarities and dissimilarities between metadata parameters serve as useful indicators for important citation tracking? And: Which metadata parameters or their combinations are helpful in achieving good results? We specifically work with full-text content-based features; hence our approach leverages richer information because it takes into consideration the full text of the works, whereas Qayyum and Afzal (2019) is solely based on metadata which helps us to achieve better performance.
Methods . | Precision . |
---|---|
Valenzuela et al. (2015) | 0.65 |
Qayyum and Afzal (2019) | 0.72 |
Nazir, Asif, and Ahmad (2020a) | 0.75 |
Nazir, Asif et al. (2020b) | 0.85 |
Current Approach | 0.92 |
Methods . | Precision . |
---|---|
Valenzuela et al. (2015) | 0.65 |
Qayyum and Afzal (2019) | 0.72 |
Nazir, Asif, and Ahmad (2020a) | 0.75 |
Nazir, Asif et al. (2020b) | 0.85 |
Current Approach | 0.92 |
Table 2 shows the classification results of the various classifiers we experimented with. Clearly, our features are highly interdependent (Section 5), which explains the better performance of Random Forests.
Methods . | Precision . | Recall . | F1 score . | Accuracy . |
---|---|---|---|---|
kNN | 0.80 | 0.87 | 0.83 | 0.81 |
SVM | 0.79 | 0.67 | 0.73 | 0.81 |
Decision Tree | 0.80 | 0.82 | 0.81 | 0.86 |
Random Forest | 0.92 | 0.82 | 0.87 | 0.90 |
Methods . | Precision . | Recall . | F1 score . | Accuracy . |
---|---|---|---|---|
kNN | 0.80 | 0.87 | 0.83 | 0.81 |
SVM | 0.79 | 0.67 | 0.73 | 0.81 |
Decision Tree | 0.80 | 0.82 | 0.81 | 0.86 |
Random Forest | 0.92 | 0.82 | 0.87 | 0.90 |
Figure 2 shows the importance of the top 10 features ranked by their information gain. However, our experimental data set is small and our features corelated; hence it seems that some features have marginal contributions. We deem that in a bigger real-life data set, the feature significance would be more visible. Here, we can see that features such as distance between citances, the number of concerned citations normalized by the total number of citations, similarity between cited-citing abstracts, in-text citation frequency, and the similarity between citance and cited abstract play an important role in the classification. The other features in the top 10 are distance between citance, number of citations from citing to cited normalized by the total citations made by the citing paper, the similarity between cited-citing abstracts, in-text citation frequency, the average similarity between citance and cited abstract, number of citations from citing to cited normalized by the total references made by the citing paper, number of common YAKE keywords between the body of citing and cited paper, the average similarity between citance and title of cited paper, the max similarity between citance and abstract of cited paper, and neutral sentiment polarity of citance. We explain the possible reasons behind the performance of these features in the subsequent sections. The precision using only the top 10 features is 0.73. Hence, other features play a significant role as well. A complete list of features and the corresponding information gain is given in Table 3.
Feature . | IG . | Feature . | IG . | Feature . | IG . |
---|---|---|---|---|---|
F16 | 0.147 | F24 | 0.024 | F30 | 0.015 |
F7 | 0.070 | F13 | 0.022 | F27 | 0.015 |
F8 | 0.070 | F33 | 0.021 | F31 | 0.015 |
F1 | 0.065 | F18 | 0.020 | F32 | 0.014 |
F10 | 0.061 | F3 | 0.020 | F15 | 0.014 |
F6 | 0.041 | F23 | 0.019 | F9 | 0.013 |
F21 | 0.033 | F35 | 0.019 | F17 | 0.011 |
F12 | 0.031 | F34 | 0.017 | F36 | 0.011 |
F13 | 0.030 | F19 | 0.017 | F4 | 0.006 |
F33 | 0.030 | F22 | 0.016 | F5 | 0.006 |
F35 | 0.025 | F28 | 0.016 | F2 | 0.004 |
F26 | 0.024 | F25 | 0.016 | F20 | 0.003 |
Feature . | IG . | Feature . | IG . | Feature . | IG . |
---|---|---|---|---|---|
F16 | 0.147 | F24 | 0.024 | F30 | 0.015 |
F7 | 0.070 | F13 | 0.022 | F27 | 0.015 |
F8 | 0.070 | F33 | 0.021 | F31 | 0.015 |
F1 | 0.065 | F18 | 0.020 | F32 | 0.014 |
F10 | 0.061 | F3 | 0.020 | F15 | 0.014 |
F6 | 0.041 | F23 | 0.019 | F9 | 0.013 |
F21 | 0.033 | F35 | 0.019 | F17 | 0.011 |
F12 | 0.031 | F34 | 0.017 | F36 | 0.011 |
F13 | 0.030 | F19 | 0.017 | F4 | 0.006 |
F33 | 0.030 | F22 | 0.016 | F5 | 0.006 |
F35 | 0.025 | F28 | 0.016 | F2 | 0.004 |
F26 | 0.024 | F25 | 0.016 | F20 | 0.003 |
To analyze the contribution of each feature, we evaluate our model using single feature at a time similar to Valenzuela et al. (2015). The precision after considering each feature individually is shown in Table 4. It is seen that the first 28 features in the table contribute significantly to the classification, and the overall precision after considering all the features is even better (an improvement of 14 points). F1, F4 (suggesting that significant citations do occur in tables or figures), and F21, followed by F7, F19, and F3 are the best performing features. This indicates that features obtained from a citation perspective are more useful. On the other hand, the worst performing features are F20 (perhaps due to the small size of the data set), F13, F5 (suggesting that significant citations also occur in groups), F17, F2, F22, and F33. Most of our observations are in line with Valenzuela et al. (2015).
Feature . | Precision . | Feature . | Precision . | Feature . | Precision . |
---|---|---|---|---|---|
F1 | 0.78 | F15 | 0.28 | F25 | 0.15 |
F4 | 0.76 | F8 | 0.27 | F11 | 0.14 |
F21 | 0.71 | F10 | 0.27 | F24 | 0.13 |
F7 | 0.68 | F9 | 0.25 | F31 | 0.11 |
F19 | 0.61 | F27 | 0.23 | F26 | 0.10 |
F3 | 0.50 | F23 | 0.20 | F33 | 0.08 |
F16 | 0.47 | F36 | 0.20 | F22 | 0.07 |
F28 | 0.43 | F35 | 0.20 | F2 | 0.04 |
F35 | 0.37 | F12 | 0.19 | F17 | 0.04 |
F6 | 0.33 | F34 | 0.19 | F5 | 0.03 |
F32 | 0.33 | F30 | 0.17 | F13 | 0.03 |
F33 | 0.29 | F18 | 0.15 | F20 | 0.01 |
Total | 0.92 |
Feature . | Precision . | Feature . | Precision . | Feature . | Precision . |
---|---|---|---|---|---|
F1 | 0.78 | F15 | 0.28 | F25 | 0.15 |
F4 | 0.76 | F8 | 0.27 | F11 | 0.14 |
F21 | 0.71 | F10 | 0.27 | F24 | 0.13 |
F7 | 0.68 | F9 | 0.25 | F31 | 0.11 |
F19 | 0.61 | F27 | 0.23 | F26 | 0.10 |
F3 | 0.50 | F23 | 0.20 | F33 | 0.08 |
F16 | 0.47 | F36 | 0.20 | F22 | 0.07 |
F28 | 0.43 | F35 | 0.20 | F2 | 0.04 |
F35 | 0.37 | F12 | 0.19 | F17 | 0.04 |
F6 | 0.33 | F34 | 0.19 | F5 | 0.03 |
F32 | 0.33 | F30 | 0.17 | F13 | 0.03 |
F33 | 0.29 | F18 | 0.15 | F20 | 0.01 |
Total | 0.92 |
Pride and Knoth (2017b) found Number of Direct Citations, Author Overlap, and Abstract Similarity to be the most important features. Our approach performs well enough to proceed with the next stage.
It is important to note that despite so many features, it is possible that some features might be correlated. Hence, we find the Pearson’s correlation coefficient between each pair of features to see how dependent they are on each other. The heatmap of the correlation matrix is shown in Figure 3.
We find that the average correlation coefficient between all the features is 0.074. However, there are few pairs of features that have high correlation coefficients. We have listed such pairs in Table 5.
Feature pair . | Correlation coefficient . | Feature pair . | Correlation coefficient . |
---|---|---|---|
F10 & F11 | 0.937 | F25 & F26 | 0.910 |
F30 & F34 | 0.919 | F9 & F20 | 0.907 |
F28 & F32 | 0.917 | F29 & F33 | 0.905 |
F27 & F31 | 0.914 | F1 & F15 | 0.835 |
F12 & F13 | 0.910 | F27 & F29 | 0.832 |
Feature pair . | Correlation coefficient . | Feature pair . | Correlation coefficient . |
---|---|---|---|
F10 & F11 | 0.937 | F25 & F26 | 0.910 |
F30 & F34 | 0.919 | F9 & F20 | 0.907 |
F28 & F32 | 0.917 | F29 & F33 | 0.905 |
F27 & F31 | 0.914 | F1 & F15 | 0.835 |
F12 & F13 | 0.910 | F27 & F29 | 0.832 |
From Table 5 we can see that feature pairs such as F10 & F11, F30 & F34, F28 & F32, F27 & F31, F12 & F13, F25 & F26, and F29 & F33 have high correlation, which is understandable, as these pairs are nothing but the maximum and average of the same quantity measured throughout the corresponding literature. Hence, to reduce the complexity of the classifier, one may use just one of the features from each pair. The results after combining these features are shown in Table 6. It can be seen that even after combining these features, there is no significant degradation in performance of our model.
6.2. The 3C Data Set
As we mention earlier, the data set used is small, due to which the significance of each feature might not be visible explicitly. Hence, we also test our method on the 3C data set, which is larger. The 3C Citation Context Classification1 Shared Task organized as part of the Second Workshop on Scholarly Document Processing @ NAACL 2021 is a classification challenge where each citation context is categorized based on its purpose and influence. It consists of two subtasks:
Task A: Multiclass classification of citation contexts based on purpose with categories BACKGROUND, USES, COMPARES CONTRASTS, MOTIVATION, EXTENSION, and FUTURE.
Task B: Binary classification of citations into INCIDENTAL or INFLUENTIAL classes (i.e., a task for identifying the importance of a citation).
The training and test data sets used for Task A and Task B are the same. The training data and test data consist of 3,000 and 1,000 instances, respectively. We use the data for Task B in our experiments. However, the 3C data set does not provide us with full text, so we are only able to test only 19 of our features. We achieved an F1 score of 0.5358 with these 19 features on the privately held 3C test set. Our relevant features in use here are F1, F2, F9, F10, F11, F12, F13, F14, F15, F20, F23, F27, F28, F29, F30, F31, F32, F33, and F34. We provide the results on the validation set using a Random Forest Classifier in Table 7. The best performing system in 3C achieved an F1 score of 0.60 while the baseline F1 score was 0.30.
6.3. Research Lineage: Case Studies
Our end goal is not just citation classification but to make use of a highly accurate citation significance detection approach to trace significant citations and, thereafter, to try to establish a lineage of the given research. As explained in Section 2, by research lineage we aim to identify the idea propagation via tracking the significant citations. To achieve this, we create a Significant Citation Graph (SCG). This is a graph-like structure in which where each node represents a research paper. There is a directed edge between each cited-citing pair, whose direction is from cited paper node to citing paper node, indicating the flow of knowledge from cited paper to citing paper. In the usual case, all citations have equal weights in a citation graph. However, in our case, each edge is labeled as either significant or contextual, using the approach we discussed in the previous section. Our idea is similar to that of existing scholarly graph databases; however, we go one step further and depict how a particular concept or knowledge has propagated with consecutive citations.
Algorithm 1 shows the method to create the adjacency list for the SCG. The Citation Significance Detection ML model is trained on a given data set (Valenzuela in our case). To demonstrate the effectiveness of our method, we present an SCG for a set of papers on Document-Level Novelty Detection and MENNDL. Being the authors of the papers on these topics, we have identified the significant citations of each paper and used it to test the effectiveness of our proposed method to create an SCG.
Input: Trained Model & concerned research document, P | |
Output: Adjacency List for Citation Graph | |
1 | Initialize adjacency list, A |
2 | Initialize an empty queue, Q |
3 | Q.add(P) |
4 | whileQ is not emptydo |
5 | forEach citation, C in Q[0]do |
6 | Extract features (F1-F36) for C |
7 | ifC is Significant and C is not in Qthen |
8 | Q.add(C) |
9 | A[Q[0]].add(C) |
10 | Q.pop() |
11 | returnA |
Input: Trained Model & concerned research document, P | |
Output: Adjacency List for Citation Graph | |
1 | Initialize adjacency list, A |
2 | Initialize an empty queue, Q |
3 | Q.add(P) |
4 | whileQ is not emptydo |
5 | forEach citation, C in Q[0]do |
6 | Extract features (F1-F36) for C |
7 | ifC is Significant and C is not in Qthen |
8 | Q.add(C) |
9 | A[Q[0]].add(C) |
10 | Q.pop() |
11 | returnA |
6.3.1. Case study I: Document-level novelty detection
Figure 4 denotes an excerpt of a SCG from our Document-Level Novelty Detection papers. The red edges denote significant citations, whereas black edges denote contextual citations. Our approach determined if a citation edge is significant or contextual. In the citation graph, we are interested in the lineage among four textual novelty detection papers (P1, P2, P3, P4), which the original authors annotate. We annotated that P1 is the pivot paper that introduced their document-level novelty detection data set, and their other papers P2, P3, and P4 are based on P1. While P2 and P4 address novelty classification, P3 aims to quantify textual novelty. Our approach conforms to the annotation by the original authors. With P1 as the pivot we can see that there are significant edges from P1 to each of P2, P3, and P4. There is also a significant edge between P2 and P4. However, there is no edge between P2 and P3, as they were contemporaneous submissions and their objective was different (P2 was about novelty classification and P3 was about novelty scoring). P1 → P2 → P4 forms a research lineage as P2 extends on P1 and P4 extends on P2. Furthermore, we see that P12, P25, P24, and P22 (transitively) are some influential papers for P1. We verified from the authors that P25 was the paper to introduce the first document-level novelty detection data set but from an information retrieval perspective. P25 inspired the authors to create the data set in P1 for ML experiments. We infer that P12, P22, and P24 had significant influence on their investigations with P1. Hence, our approach (trained on a different set of papers in the Valenzuela data set) proved successful to identify the significant citations and thereby also identify the corresponding lineage.
6.3.2. Case study II: MENNDL HPC algorithm
We tested our approach’s efficacy to predict the lineage of the high-performance computing algorithm MENNDL. We show the research lineage of MENNDL (Young et al., 2015) in Figure 5. We asked the original authors to annotate their research progression with MENNDL. According to the authors, the first paper to describe the MENNDL algorithm was published in 2015, which is deemed the pivot (P9). The follow-up paper that carried forward the work in P9 was P4 in 2017. Then P1 came in 2018, building upon P4. P7 and P12 came as extensions of P4. Next, P6 came in 2019 that took forward the work from P1. With P9 as the source, our approach correctly predicted the lineage as P9 → P4 → P1 → P6. Also, the lineage P9 → P4 → P12 and P9 → P4 → P7 via tracing significant citations could be visible in the SCG at Figure 4. We annotate P8 as an application of P9; hence no significant link exists between P9 and P8.
From the above experiments and case studies, it is clear that our proposed method works reasonably well when a paper cites the influencing paper meaningfully. However, there are cases where some papers do not cite the papers by which they are inspired. In such cases, our method would not work.
7. CONCLUSION AND FUTURE WORK
In this work, we present our novel idea towards finding a research lineage to accelerate literature review. We achieve state-of-the-art performance on citation significance detection, which is a crucial component to form the lineage. We leverage that and show the efficacy of our approach on two completely different research topics. Our approach is simple and could be easily implemented on a large-scale citation graph (given the papers’ full text). The training data set is built from NLP papers. However, we demonstrate our approach’s efficacy by testing on two topics: one from NLP and the other from HPC, hence establishing that our approach is domain agnostic. Identifying significant citations to form a research lineage would also help the community to understand the real impact of a research beyond simple citation counts. We would look forward to experimenting with deep neural architectures to identify meaningful features for the current task automatically. Our next foray would be to identify the missing citations for papers that may have played an instrumental role in certain papers but unfortunately are not cited. We release all the codes related to our experiment at https://figshare.com/s/2388c54ba01d2df25f38.
AUTHOR CONTRIBUTIONS
Tirthankar Ghosal: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Writing—Original draft. Piyush Tiwary: Formal analysis, Implementation, Writing—Original draft. Robert Patton: Funding acquisition, Supervision. Christopher Stahl: Conceptualization, Data curation, Project administration, Writing—Review & editing.
ACKNOWLEDGMENTS
This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy (DOE). The views expressed in the article do not necessarily represent the views of the DOE or the U.S. government. The U.S. government retains and the publisher, by accepting the article for publication, acknowledges that the U.S. government retains a nonexclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (https://energy.gov/downloads/doe-public-access-plan).
TG also thanks the Oak Ridge Institute for Science and Education (ORISE) for sponsorship for the Advanced Short-Term Research Opportunity (ASTRO) program at the Oak Ridge National Laboratory (ORNL). The ASTRO program is administered by the Oak Ridge Institute for Science and Education (ORISE) for the U.S. Department of Energy. TG also acknowledges the Visvesvaraya PhD fellowship award VISPHD-MEITY-2518 from Digital India Corporation under Ministry of Electronics and Information Technology, Government of India.
FUNDING INFORMATION
TG was sponsored by ORNL under the ORISE ASTRO Internship Program.
DATA AVAILABILITY
We release all the codes related to our experiment at https://figshare.com/s/2388c54ba01d2df25f38.
Note
REFERENCES
Author notes
Equal contribution.