Further improvements on estimating the popularity of recently published papers

Abstract As the number of published scientific papers continually increases, the ability to assess their impact becomes more valuable than ever. In this work, we focus on the problem of estimating the expected citation-based popularity (or short-term impact) of papers. State-of-the-art methods for this problem attempt to leverage the current citation data of each paper. However, these methods are prone to inaccuracies for recently published papers, which have a limited citation history. In this context, we previously introduced ArtSim, an approach that can be applied on top of any popularity estimation method to improve its accuracy. Its power originates from providing more accurate estimations for the most recently published papers by considering the popularity of similar, older ones. In this work, we present ArtSim+, an improved ArtSim adaptation that considers an additional type of paper similarity and incorporates a faster configuration procedure, resulting in improved effectiveness and configuration efficiency.


INTRODUCTION
With the growth rate of scientific articles (also known as papers) continually increasing (Larsen & von Ins, 2010), the reliable assessment of their scientific impact is now more valuable than ever. Consequently, a variety of impact measures have been proposed, aiming to quantify scientific impact at the paper level. Such measures have various practical applications: for instance, they can be used to rank the results of keyword-based searches (e.g., Vergoulis, Chatzopoulos et al., 2019), facilitating literature exploration and reading prioritization, or to assist the comparison and monitoring of the impact of different research projects, institutions, or researchers (e.g., Papastefanatos, Papadopoulou et al., 2020).
Because scientific impact can be defined in many, diverse ways (Bollen, Van de Sompel et al., 2009), the proposed measures vary in terms of the approach they follow (e.g., citation-based, altmetrics), as well as in the aspect of scientific impact they attempt to capture (e.g., impact in academia, social media attention). In this work, we focus on citation-based measures that attempt to estimate the expected scientific impact of each paper in the near future (i.e., its current popularity). Providing accurate estimations of paper popularity is an open problem, as indicated by a recent extensive experimental evaluation (Kanellos, Vergoulis et al., 2021a). Furthermore, popularity distinctly differs from the overall (long-term) impact of a paper that is usually captured by traditional citation-based measures (e.g., citation count). One important issue in estimating paper popularity is to provide accurate estimations for the most recently published papers. The estimations of most popularity measures rely on the existing citation history of each paper. However, as very limited citation history data are available for recent papers, their impact estimation based on these data is prone to inaccuracies. Hence, these measures fail to provide accurate estimations for recent papers. To alleviate this issue, in Chatzopoulos, Vergoulis et al. (2020) we introduced ArtSim, an approach that can be applied on top of any existing popularity estimation method to improve its accuracy. ArtSim does not only rely on each paper's citation history data but also considers the history of older, similar papers, for which these data are more complete. The intuition behind the approach is that similar papers (e.g., having similar topics and/or author lists) are likely to have similar popularity dynamics. To quantify paper similarity, ArtSim exploits author lists and the involved topics, based on data that can be easily found in scholarly knowledge graphs, a large variety of which has been made available in recent years (e.g., AMiner's DBLP-based data sets (Tang, Zhang et al., 2008), the Open Research Knowledge Graph (Jaradeh, Oelen et al., 2019), the OpenAIRE Research Graph (Manghi, Atzori et al., 2019a;Manghi, Bardi et al., 2019b)).
Our experiments showed that ArtSim effectively enhances the performance of traditional methods in estimating article popularity. However, at the same time, we found that there was room for further improvements. In this context, we extended ArtSim and produced an improved version called ArtSim+. This new approach maintains all the benefits of ArtSim and introduces two main improvements: (a) it takes into consideration an additional type of paper similarities, based on their publication venues, and (b) it leverages a more efficient and more effective configuration procedure based on the technique of generalized simulated annealing. To evaluate ArtSim+'s performance, we reproduce the most important of our previous experiments and we extend them by investigating the effect on an additional popularity estimation method. Furthermore, we conduct thorough experiments to showcase the effects of the new configuration procedure. Finally, we provide both ArtSim and ArtSim+ implementations as open source code under a GNU/GPL license 1 .

Preliminaries
In this work, we focus on approaches that aim at estimating the citation-based popularity of scientific papers. In general, citation-based measures are defined and calculated on top of the citation network, that is, the directed graph of all papers (nodes) along with their citations (edges); each directed edge i → j, with i and j being nodes of the graph, denotes that paper i cites paper j. This information is usually encoded in the citation network's adjacency matrix A, where A[i, j] = 1 if a paper j cites paper i and A[i, j] = 0, otherwise.
For popularity, we adopt the definition given in Kanellos et al. (2021a). According to this, popularity at current time t c can only be accurately quantified a posteriori, when papers receive citations as a result of being currently studied. Because citation networks evolve over time, we define the adjacency matrix at time t c as A(t c ). Given a parameter T, which denotes a future time window, we can define adjacency matrix A(t c + T ) − A(t c ), which describes the citation network containing only the citations made in the time span [t c , t c + T]. The popularity of papers is then given by the citation count based on A(t c + T ) − A(t c ). It is worth mentioning that T is a problem parameter, which depends on various factors, such as the publication life cycle in a particular scientific discipline (manuscript writing, peer-review, publication).
Our proposed approach to estimate popularity is based on exploiting path-based similarities of papers that can be calculated using scholarly knowledge graphs. Knowledge graphs, also known as heterogeneous information networks (Shi, Li et al., 2017), are graphs that encode rich domainspecific information about various types of entities, represented as nodes, and the relationships between them, represented as edges. Figure 1 presents an example of such a knowledge graph, consisting of nodes representing papers (P), authors (A), venues (V), and topics (T). Three types of (bidirectional) edges are present in this example network: edges between authors and papers, denoted as AP or PA, edges between papers and topics, denoted as PT or TP, and edges between papers and venues, denoted as PVor VP. The first edge type captures the authorship of papers, the second one encodes the information that a particular paper is written on a particular topic, while the last one captures the fact that a paper has been published in a particular venue.
Various semantics are implicitly encoded in the paths (i.e., edge/node sequences) of knowledge graphs. In fact, all paths that correspond to the same sequence of node and edge types (i.e., the same metapath (Sun, Han et al., 2011b)) encode latent relationships of the same interpretation between the starting and ending nodes. Metapaths can be represented by the sequence of the respective node and edge types but, for the sake of simplifying the notation, it is usually assumed (Shi, Li et al., 2016a;Sun et al., 2011b) that there is at most one type of edges between any pair of node types in the HIN, thus each metapath is denoted by the sequence of the respective node types. For example, in the graph of Figure 1, the metapath Author -Paper -Topic -Paper -Author, or APTPA for brevity, relates two authors that have published works in the same topic (e.g., both 'John Doe' and 'Henry Jekyll' have papers about 'DL'). Metapaths are useful for many graph analysis and exploration applications. For example, in our approach, we use them to calculate metapath-based similarities: the similarity between two nodes of the same type, based on the semantics of a given metapath, can be captured by considering the number of instances of this metapath connecting these nodes (e.g., Sun et al., 2011b;Xiong, Zhu, & Yu, 2015).

Methods to estimate scientific impact
There is a lot of work in the areas of bibliometrics and scientometrics to quantify the impact of scientific articles. Much focus was been on methods to calculate variations of the citation counts and PageRank. The latter algorithm, although originally introduced to evaluate the importance of Web pages, has been successfully adapted and applied to citation networks providing insights about the scientific impact of papers (Chen, Xie et al., 2007). Furthermore, it has additionally spawned a separate line of work that aims at improving it when applied on citation networks (Mariani, Medo, & Zhang, 2016;Su, Pan et al., 2011;Vaccario, Medo et al., 2017;Zhou, Zheng et al., 2016). However these works focus on capturing the overall impact of papers, instead of their expected short-term impact (or popularity) (Ghosh, Kuo et al., 2011;Sayyadi & Getoor, 2009;Walker, Xie et al., 2007). This is an interesting problem, as on the one hand this problem has been shown to have a more pronounced improvement margin (Kanellos et al., 2021a), and on the other hand it corresponds to important real application scenarios: Researchers using search engines to find papers in their scientific fields would benefit from a popularity-based ranking to find the current and most recent research trends. In-depth examinations of various impact measures that have been proposed in the relevant literature can be found in Kanellos et al. (2021a) and Bai, Liu et al. (2017). In contrast to these methods, ArtSim and ArtSim+, our approaches, do not aim to introduce a new popularity measure but rather aim to improve the accuracy of existing ones.

ArtSim
In previous work (Chatzopoulos et al., 2020), we introduced ArtSim, an approach that can be applied on top of any popularity estimation method to improve its accuracy. Its power originates from providing better estimations for most of the recently published papers by finding older papers that are similar to them, and considering their average popularity. The intuition is that older papers have a more complete citation track and that similar papers are likely to follow a similar trajectory in terms of popularity. To quantify paper similarity, ArtSim exploits the corresponding author lists and the involved topics. This information is available in scholarly knowledge graphs, a large variety of which have been made available in recent years.

Entity similarity in HINs
Both ArtSim+ and its predecessor are built upon recent work on entity similarity in the area of heterogeneous information networks. Some of the first entity similarity approaches for such networks (e.g., PopRank (Nie, Zhang et al., 2005) and ObjectRank (Balmin, Hristidis, & Papakonstantinou, 2004)) are based on random walks. Later works, such as PathSim (Sun et al., 2011b), focus on providing more meaningful results by calculating node similarity measures based on user-defined semantics. Our work is based on JoinSim (Xiong et al., 2015), which is more efficient compared to PathSim, making it more suitable for analyses on large-scale networks.

Basic Approach
Like its predecessor, ArtSim+ can be applied on top of any popularity measure to increase the accuracy of its estimations. As such, ArtSim+ takes the scores calculated by any popularity method as input, applies transformations on them, and produces a new set of improved popularity scores. This process is illustrated in Figure 2.
The transformations applied on popularity scores by ArtSim+, rely on the assumption that similar articles are expected to share similar popularity dynamics. To calculate the similarity between different papers, ArtSim+ relies on a scholarly knowledge graph that contains information about papers, authors, venues, and topics, as well as connections between them (like the one presented in Figure 1). On a knowledge graph of such a schema, it is possible to define paper similarity according to various semantics using the JoinSim (Xiong et al., 2015) similarity measure calculated on different metapaths (see Section 2.1 for details). ArtSim+ considers paper similarity according to the Paper -Author -Paper (PAP), Paper -Topic -Paper (PTP), and Paper -Venue -Paper (PVP) metapaths. The PAP metapath defines the similarity of papers according to their common authors, the PTP metapath defines similarity based on their common topics, and the PVP metapath is based on their venue. ArtSim+ uses the calculated similarity scores to provide improved popularity estimates (scores), focusing, in particular, on recent papers that have a limited citation history (i.e., those that are going through their cold start period ). The calculation of ArtSim+ scores is based on the following formula: where S PAP , S PTP , and S PVP are the average popularity scores of all the articles that are similar to p, based on metapaths PAP, PTP, and PVP respectively. S i is the initial popularity score of paper p based on the original popularity measure and t c denotes the current year. Finally, our method applies transformations on popularity scores for those papers published in years that range in the time span (t c − y, t c ], where y > 0.

Improving Method Configuration
ArtSim+ depends on parameters α, β, γ, δ 2 [0, 1], the values of which are set so that α + β + γ + δ = 1. Varying these parameters in the range [0, 1] has the following effects: As α increases, ArtSim+ score mostly depends on similar articles based on common authors. Similarly, as β and γ increase, the score is mainly based on similar articles based on common topics and venues, respectively. Finally, as δ approaches 1 the popularity scores remain identical to those calculated by the original popularity measure.
To determine the best configuration of our approach, an exhaustive "grid" search of the parameter space can be performed. The original version of ArtSim (Chatzopoulos et al., 2020) follows this approach, but the same technique can be applied on any possible ArtSim adaptation incorporating different types of metapaths. However, grid search can be highly inefficient; in practice, the efficiency of such a search depends on the number of parameters to be determined and the granularity of the examined grid. Thus, in the case of a method with a large number of parameters (such as ArtSim+) the corresponding search grid could be really large, resulting in a time-consuming process. To counterbalance this problem a coarse-grained grid search could be performed. However, this would run the risk of missing the optimal configuration. To alleviate this issue, we propose the use of the Generalized Simulated Annealing (GSA) (Tsallis & Stariolo, 1996;Xiang, Sun et al., 1997) algorithm to search the parameter space for the optimal configuration instead of performing a full grid search. GSA is a search algorithm that can be used to approximate the optimal parameter values for an optimization problem (e.g., to find the parameters of ArtSim+ that achieve the best accuracy). It combines the approach of Simulated Annealing (SA) (Kirkpatrick, Gelatt, & Vecchi, 1983) with that of Fast Simulated Annealing (FSA) (Szu & Hartley, 1987). SA is a traditional search algorithm that combines hill climbing with a random search mechanism, accepting not only changes that improve the objective function but also underperforming ones with a certain probability. However, SA employs a local visiting distribution (Gaussian) so that the majority of the search is confined in certain regions of the search space. For this reason, Fast Simulated Annealing (FSA) was introduced. FSA utilizes a semilocal distribution (Cauchy-Lorentz) traversing the search space more efficiently, but it can still be trapped in local optima (Xiang & Gong, 2000). GSA achieves faster convergence and higher probability to find the global optimal (Xiang & Gong, 2000), outperforming SA and FSA. It utilizes a distorted Cauchy-Lorentz distribution controlled by parameter q v , while its acceptance probability depends on the acceptance parameter q α (Xiang, Gubian et al., 2013). GSA searches the space more uniformly than its competitors, with the difference in performance being more prominent as the number of variables of the objective function increases (Xiang & Gong, 2000).

EVALUATION
In this section, we discuss the experiments conducted to assess the effectiveness of our method. In particular, we first elaborate on the experimental setup (Section 4.1). Then, in Section 4.2, we provide our findings regarding the effectiveness of ArtSim+ in improving the accuracy of various state-of-the-art popularity estimation methods. During this experiment we also compare ArtSim+ to ArtSim (Chatzopoulos et al., 2020), its predecessor, showcasing the superior performance of the current approach. Finally, in Section 4.3 we discuss the efficiency and effectiveness gains introduced to ArtSim+ due to the improved configuration process described in Section 3.2.

Data sets
For our experiments, we used the following data sets: • DBLP Scholarly Knowledge Graph (DSKG) data set. This contains data for 3,079,008 computer science papers, 1,766,548 authors, 5,079 venues and 3,294 topics from DBLP. It is based on AMiner's citation network data set (Tang et al., 2008) enriched with topics from the CSO Ontology (Salatino, Thanapalasingam et al., 2018). The topics have been assigned to each paper by applying the CSO Classifier (Salatino, Osborne et al., 2019) to its abstract. • DBLP Article Similarities (DBLP-ArtSim) data set. This contains similarities among papers in the previous network based on different metapaths. In particular, we calculated paper similarities based on (a) their author list using the Paper -Author -Paper (PAP) metapath, (b) their common topics, captured by the Paper -Topic -Paper (PTP) metapath, and (c) their venue, according to the Paper -Venue -Paper (PVP) metapath 2 . This data set is openly available on Zenodo 3 (Chatzopoulos, Vergoulis et al., 2021) under CC BY 4.0 license and contains approximately 31 million PAP, 207 million PTP, and 11 billion PVP metapath instances. It should be noted that the first version of this data set was a contribution of our previous work (Chatzopoulos et al., 2020); the current version of the data set has been updated to also include the Paper to Venue relationships.

Evaluation methodology
To assess the accuracy of methods in estimating paper popularity, we follow the experimental framework proposed in Kanellos et al. (2021a). As discussed in Sections 1 and 2, a paper's popularity, by definition, is reflected in the citations it receives in the near future. The aforementioned framework splits a given citation network data set C into two parts, C old and C future , according to a given split time point t s and uses C old (containing all papers published no later than t s ) as input to the estimation methods, while C future (all papers published between t s and a second given time point t s + T, with T > 0) is taken as a ground truth. The ground truth is used to calculate, for each paper published no later than t s , all citations it received during the (t s , t s + T ] period. Then, the total orderings (the rankings) of these papers based on these citations are compared with the rankings provided by each popularity estimation method. The method that produced the most similar ranking to the ground truth ranking is the one with the most accurate estimations. The ranking similarities are usually measured using both an overall similarity measure (e.g., the ranked list correlation according to Spearman's ρ or Kendall's τ), and a top-k similarity measure (e.g., nDCG@k); each type of similarity better fits the need of different applications.
At this point, it should be highlighted that each popularity estimation method produces its own measure value for each paper (i.e., its own score), and thus a direct comparison of these scores for the same paper is not possible; therefore, comparing the similarities of the methods' ranking to the ground truth ranking to measure each method's accuracy is an adequate alternative, especially as most applications only require popularity/impact measures for partial comparisons.
In our experiments, we configured the framework so that t s splits the used citation network data set into two equally sized (in terms of nodes) networks, while T is selected so that C future contains 30% more papers than C old . Regarding ranked list similarities, we use Kendall's τ (Kendall, 1948) to capture their overall similarity, while we use nDCG@k to capture their top-k similarity. Kendall's τ is an overall correlation measure, having values in the [−1, 1] range, with 1 and −1 corresponding to a perfect agreement and disagreement, respectively, while 0 reflects no correlation. nDCG@k, on the other hand, is a measure of ranking quality that has values in the range [0, 1], with 1 corresponding to ideal ranking of the top-k elements.

Popularity estimation methods
As mentioned, our approaches (ArtSim and ArtSim+) are used on top of other popularity estimation methods, resulting in improvements in their estimation accuracy. Thus, any experiments should involve at least one popularity method, on top of which ArtSim and ArtSim+ will be applied. In this work, we selected to use four popularity estimation methods that were found to perform well according to a recent experimental study (Kanellos et al., 2021a). In addition, we have also included AttRank (Kanellos, Vergoulis et al., 2021b), which was found to outperform the best popularity estimation methods in later experiments. However, this is just an indicative set of methods: ArtSim and ArtSim+ can be easily applied on top of any other one. The configurations used for each method (presented in Tables 1 and 2) have been selected after examining various configurations and selecting the one that achieved the best result according to the similarity measure (Kendall's τ or nDCG@k, see also Section 4.1.2) under consideration.
Moreover, for convenience, we briefly describe the intuition behind each method below: • AttRank (Kanellos et al., 2021b) is a PageRank variation that modifies PageRank's socalled random jump probability. In AttRank, this probability is not uniform, but results as a combination of an age-based weight, and a recent attention-based weight. The latter is determined based on the fraction of total citations received by each paper in recent years. It uses parameters α, β, γ 2 (0, 1), ρ 2 (−∞, 0), and y. Parameter y denotes the starting year, onward from which the recent attention is determined. Parameter ρ is the coefficient of the publication age-based weights, which decrease exponentially based on age. Parameters α, β, γ are the coefficients of the PageRank calculation, random jump probability based on recent attention, and random jump probability based on publication age, respectively. • Retained Adjacency Matrix (RAM ) (Ghosh et al., 2011) estimates popularity using a timeaware adjacency matrix to capture the recency of cited papers. The parameter γ 2 (0, 1) is used as the basis of an exponential function to scale down the value of a citation link according to its age. • Effective Contagion Matrix (ECM ) (Ghosh et al., 2011) is an extension of RAM that also considers the temporal order of citation chains apart from direct links. It uses two parameters α, γ 2 (0, 1) where α is used to adjust the weight of citation chains based on their length and γ is the same as in RAM. • CiteRank (CR) (Walker et al., 2007) estimates popularity by simulating the behavior of researchers searching for new articles. It uses two parameters, α 2 (0, 1) and τ dir 2 (0, ∞) to model the traffic to a given paper. A paper is randomly selected with an exponentially discounted probability according to its age with τ dir being the decay factor. Parameter α  is the probability that a researcher stops her search, with 1 − α being the probability that she continues with a reference of the paper she just read. • FutureRank (FR) (Sayyadi & Getoor, 2009) scores are calculated by combining PageRank with calculations on a bipartite graph with authors and papers, while also promoting recently published articles with time-based weights. It uses parameters α, β, γ 2 (0, 1) and ρ 2 (−∞, 0); α is the coefficient of the PageRank scores, β is the coefficient of the authorship scores and γ is the coefficient of time-based weights which decrease exponentially based on the exponent ρ.

Evaluating the Effectiveness of ArtSim+
In this experiment, we examine the gains introduced by applying ArtSim+ on top of various popularity estimation methods in terms of their improved estimation accuracy. Based on the evaluation framework used (see Section 4.1.2), we first evaluate the estimation accuracy in terms of Kendall's τ (Section 4.2.1) and then in terms of nDCG@k (Section 4.2.2).

Improvements in terms of Kendall's τ
In this experiment, we examine the accuracy of each of the examined popularity estimation methods (AttRank, ECM, RAM, CR, and FR) with and without the assistance of ArtSim and ArtSim+, in terms of Kendall's τ, for y 2 {1, 3, 5}. Recall that y is the parameter that determines which papers are in their "cold start" phase (e.g., if y = 3, then all papers published between t s − 2 and t s are considered to be in their cold start phase). All methods were configured based on the parameter settings included in Table 1, ArtSim was configured exactly as it was in Chatzopoulos et al. (2020), and ArtSim+ was configured based on the outputs of the experiments in Section 4.3. Figure 3 summarizes our findings.
Overall, both ArtSim and ArtSim+ introduce accuracy improvements to all popularity estimation methods. In all cases, ArtSim+ achieves a larger improvement than ArtSim, something that indicates that considering the venue-based paper similarities (captured by the PVP metapath) indeed results in improving accuracy. The most significant improvements, for both ArtSim and ArtSim+, are observed when they are applied on ECM and RAM. In particular, ECM and RAM are improved by 10-12% when applying ArtSim+ over the plain methods and by 4-5% over ArtSim for y 2 {3, 5}. AttRank, on the other hand, appears to have significantly smaller gains. The larger gains achieved for RAM and ECM can be explained by the fact that these methods rely heavily on each paper's current citations. Hence, a large number of recent papers without any citations, which, however, are likely to gather citations in the near future, are ranked at the bottom based on these methods. In contrast, AttRank, CR, and FR give ranking advantages to papers based on their publication age. Hence the papers that can benefit from ArtSim/ArtSim+ are already advantaged in part by these methods. It should be noted that, as expected, smaller gains for all methods are achieved for y = 1. In that case, as previously mentioned, our approach affects the popularity score of the papers published only in the last year, affecting only a small fraction of the overall papers.

Improvements in terms of nDCG@k
We also examine the accuracy of all estimation methods (with and without the ArtSim/ ArtSim+ assistance) in terms of nDCG@k, for y 2 {1, 3, 5} and k 2 {5, 50, 500 000}. Similarly to the experiment in Section 4.2.1, the best configurations for the examined accuracy measure 4 were selected for ArtSim, ArtSim+, and each estimation method (see also Table 2). Our findings are depicted in Figure 4.
Interestingly, for small values of k, our approach performs equally well as the plain popularity estimation methods. This behavior indicates that, at least to some extent, the existing state-of-the-art methods accurately identify the top popular papers. Another apparent explanation is that, in the case of a small k the set of top-k popular papers at the level of the whole data set, mainly consists of widely known, fundamental papers that already have a significant citation trajectory. To put it differently, the percentage of the top-k popular papers that are going through their cold start period is significantly smaller for small k values (see Table 3). This characteristic of the small k values was the motivation to also examine the k = 500,000 value, apart from k = 5 and k = 500. Going back to our experimental results, it is evident that the accuracy gains for all popularity estimation methods are indeed more apparent for k = 500,000.
It may be tempting to think that, although ArtSim+ brings evident accuracy improvements in terms of Kendall's τ, it can provide apparent improvements in terms of nDCG@k only for extremely large k values, which are not relevant to any practical scenario. Although this seems to be intuitively correct, this rationale does not reflect the truth, because, in practice, the overall top-ranking papers may be dominated by particular subfields that are characterized by a higher citation density, or which gather citations quicker (e.g., due to large numbers of frequent conferences in the field). Hence, the accuracy gains that ArtSim+ brings may be useful in various real applications involving searches on particular subfields/keywords. To showcase this, we also conducted an experiment that replicates a real-world application scenario, that of literature exploration by a researcher in an academic search engine.
The concept is the following: the users of such search engines usually refine their searches based on multiple keywords and filters (e.g., based on the venues of interest or the publication years) to reduce the number of papers they have to examine. However, even in this case, usually at least hundreds of papers are contained in the results. Hence, effective popularity-based ranking is crucial to facilitate the reading prioritization. Our experiment involves three individual search scenarios. In the first scenario we used the query "expert finding." This keyword search resulted in a set of 549 articles. Figure 5(a) presents the nDCG values for this search, per popularity estimation method 5 , along with the gains of ArtSim and ArtSim+ for y = 3. We 5 All popularity estimation methods have been configured based on their parameters that achieve the best nDCG@k values with k = 500,000 for the whole data set. AttRank α = 0.2, β = 0.8, γ = 0, y = 1 α = 0.2, β = 0.8, γ = 0, y = 1 α = 0, β = 0.7, γ = 0.3, y = 1 observe that ArtSim+ improves the nDCG values for k = 50 and k = 100. In our second scenario, we tried a constrained query. In particular, we used "recommender systems" as the search keywords keeping only papers published in well-known venues of data management and recommender systems, namely VLDB, SIGMOD, TKDE, ICDE, EDBT, RecSym, and ICDM. The result set includes 318 articles. Figure 5(b) presents the nDCG results. We observe that ArtSim+ boosts nDCG scores for all measures, starting from the smallest value of k = 5. Finally, we tried a keyword search with the phrase "digital libraries" that yielded 3,793 articles; the results are presented in Figure 5(c). In this case, the benefits are smaller than in the previous two search scenarios; however, we do note that ArtSim and ArtSim+ add improvements to the nDCG scores at k = 50 and k = 100.
Overall, the results of these keyword search scenarios indicate that in addition to improving the overall correlation, our approach also offers improvements in the case of practical, keyword-search based queries with regard to the top returned results. For all the aforementioned scenarios, the best parameter configuration of ArtSim and ArtSim+ are presented in Table A3 of the Appendix.

ArtSim+ Configuration
ArtSim+ has a wide range of configuration parameters. This is why a new, GSA-based configuration process was introduced to easily and efficiently configure most of them (see Section 3.2). In this section, we present a series of experiments that evaluates the efficiency and accuracy gains introduced by this process.
Before proceeding with the experiments, it is worth mentioning that ArtSim+ has a parameter y, the values of which are manually selected. The reason for not including y in the automatic  configuration process is that it gets discrete integer values from a narrow domain, and thus it is easy to configured manually. In particular, y is the parameter that determines which are the papers that are going through their "cold start" phase. The best y value for a given data set relies on the disciplines of the papers contained in it. For example, papers from life sciences are expected to receive citations at a faster rate than papers from theoretical mathematics; hence, a smaller y value should be selected to configure ArtSim+ for a data set with papers from the former discipline than for another with papers from the latter one. The data set we are using for our experiments contains computer science papers; based on previous experience, we decided to use (for all of our experiments) y values that are not greater than 5. In particular, we examined three different configurations of this parameter (namely, y = 1, y = 3, and y = 5) to investigate the effect that different values of y have on the popularity estimation accuracy and to the gains introduced by ArtSim+.
Therefore, the GSA-based automatic configuration process (here denoted as GSA) is focused on finding the best values for ArtSim+'s α, β, γ, and δ. In our experiment, we used GSA 6 to find the best configuration of ArtSim+ in terms of accuracy (Kendall's τ) for y = 3. We also used two alternative configuration processes: a (full) grid search that examines all distinct α, β, γ, δ values in [0, 1] with a step of size 0.1 (GS 1 ) and a grid search using a step of size 0.01 (GS 2 ). In addition, because ArtSim can also be configured in a similar way (however, having only three instead of four parameters), we included it in the experiment, as well. The execution times for all configuration approaches are depicted in Figure 6, while the achieved accuracy of each revealed configuration is presented in Table 4 (best accuracy highlighted in bold). Table 4 that, in almost all cases, GS 2 and GSA identify configurations that result in improved accuracy, compared to the best configuration identified by GS 1 . Of course, the main benefit of the latter configuration processes is that it is significantly faster than the other two, having the shortcoming of not finding the optimal configuration. Although GS 2 can identify configurations that achieve improved accuracy, the computational cost of a full search in such a grid is very large. In particular, in the case of ArtSim, GS 2 was found to be 35-40% slower than GSA, while in the case of ArtSim+ (which has one extra 6 For all our experiments, we used the implementation of GSA in SciPy 9 assuming Kendall's τ or nDCG@k as our objective function, setting q v = 2.62, q α = −5, and initial temperature T 0 = 5,230. parameter that needs to be tuned) GS 2 was so large that it did not finish execution after 5,000 minutes, with GSA finishing in less than 500 minutes.

First of all, it is apparent from
As an additional remark, our experiments reveal that considering the venue-based paper similarity (i.e., exploiting the PV-P metapath) is a valid addition to our approach. The first clue to this is based on the fact that ArtSim+ outperforms ArtSim (see Table 4); a second clue is that for most of our experiments (presented in both the current and the previous section) the best ArtSim+ accuracy was achieved using a configuration for which γ > α, β (see Tables A1 and A2 of the Appendix).
As a final experiment, we investigated the effect of different values of parameter y in the efficiency of the GSA configuration process for ArtSim+. The results are presented in Figure 7. It is apparent that an increase in the value of y results in larger configuration times. This is due to the fact that as parameter y increases, the number of papers that are going through their cold start period increases; thus ArtSim+ needs to perform calculations for more papers.  Previous sections outlined the improvements in estimation accuracy that ArtSim+ introduces when applied on top of existing popularity estimation methods in terms of Kendall's τ and nDCG@k. Despite the fact that ArtSim+ exhibits gains in estimation accuracy in all considered popularity estimation methods, in the configuration we examined, we make some particular limiting assumptions.
First of all, as already mentioned, the impact of papers has multiple aspects; some of them may be captured (to an extent) by particular types of citation analysis, others can only be quantified by altmetrics, while there are also aspects that are very difficult to quantify. ArtSim+ focuses on estimating citation-based short-term impact, which is more formally described in Section 2.1, hence it is not examined whether it is useful in estimating other scientific impact aspects. It is also really important to mention that, although related, impact and scientific merit are not completely (or even highly) correlated.
Moreover, ArtSim+ considers similarity between papers based on three specific dimensions (i.e., authors, topics and publication venues captured by metapaths PAP, PTP, and PVP respectively). Of course, the choice of the actual metapaths is not an inherent limitation of ArtSim+ as it can be adapted accordingly to also incorporate other metapaths. However, it is important to highlight that the currently tested version of ArtSim+ makes the aforementioned assumption regarding paper similarity. An additional limitation with regard to the metapaths we chose to implement is that they are unconstrained, that is, they do not limit the paths to be considered according to the values of the attributes of the involved nodes or edges. Constrained metapaths (e.g., used in Shi et al. (2016a)) could be used to tighten the focus of the similarities to be considered. For instance, for a recent paper, it may be useful to consider its similarity, based on metapath PAP, but only to papers published in the last 10 years, as intuitively, a paper in its cold start period is more likely to share similar popularity dynamics with the recent papers of a given author than with its older ones. Furthermore, the scholarly knowledge graph that ArtSim+ utilizes is based on AMiner's citation network data set (Tang et al., 2008) (see Section 4.1.1 for details). Although this is a popular data source used by many works (e.g., Dong, Chawla, & Swami, 2017;Shi, Li et al., 2016b;Sun, Barber et al., 2011a), it comprises a limited number of node types. Knowledge graphs with a richer schema, such as the Open Research Knowledge Graph (Jaradeh et al., 2019) and the OpenAIRE Research Graph (Manghi et al., 2019a(Manghi et al., , 2019b, would allow additional, more complex metapaths to be used when considering paper similarity. In addition, contrary to AMiner's graph, which focuses on computer science papers (because it is based on papers included in DBLP), the aforementioned knowledge graphs incorporate publications from various disciplines, paving the way for the investigation of domains with possibly distinct characteristics. In these cases, the importance of the examined similarity dimensions may be different, while even alternative, nonexamined dimensions may be of large importance.
Last but not least, for recently published papers, ArtSim+ assigns the average of the popularity scores of their similar papers. Although using the average as an aggregation function is a logical choice, other options can be examined as a future work, especially considering that the popularity scores follow a power law distribution. It can also be useful to consider a weighted scheme that incorporates the similarity score between papers in the aggregation process, as a paper may have significantly higher similarity scores with some papers than with others.

CONCLUSIONS
In this work, we presented ArtSim+, an approach that can be applied on top of an existing popularity estimation method to increase the accuracy of its results. The main intuition of our approach is that the popularity of papers in their cold start period can be better estimated based on the characteristics of older, similar papers. For our purposes, paper similarity is calculated exploiting information stored in scholarly knowledge graphs. More particularly, the proposed approach considers similarities based on the authors, the venues, and the topics of the papers under consideration. Our experimental evaluation showcases the effectiveness of ArtSim+, yielding noteworthy improvements in terms of Kendall's τ correlation and nDCG when applied on five state-of-the-art popularity measures, also outperforming ArtSim, its predecessor, which had been introduced in Chatzopoulos et al. (2020).
Future work could address ArtSim+'s current limitations, or apply its underlying ideas in different contexts (see Section 4.4). For example, it may be interesting to examine different types of (more complex) metapaths on HINs to calculate paper similarity. This could in turn reveal new semantics on what constitutes "more similar" papers, based on the underlying metapaths. Moreover, although ArtSim+ focuses on improving the estimation of paper popularity for cold start papers, similarity based on HINs could be used to improve the estimation of different types of paper impact, such as long-term impact or social media attention. Figure 1 was designed using resources from www.flaticon.com.

APPENDIX: DETAILED CONFIGURATIONS
In this section we present the exact parameter configurations that found, according to our experiments, to perform best in terms of Kendall's τ (Table A1) and nDCG@k (Table A2 and  Table A3).