Abstract
Statistical modeling of scientific productivity and impact provides insights into bibliometric measures used also to quantify differences between individual scholars. The Q model decomposes the log-transformed impact of a published paper into a researcher capacity parameter and a random luck parameter. These two parameters are then modeled together with the log-transformed number of published papers (i.e., an indicator of productivity) by means of a trivariate normal distribution. In this work we propose a formulation of the Q model that can be estimated as a structural equation model. The Q model as a structural equation model allows us to quantify the reliability of researchers’ Q parameter estimates. It can be extended to incorporate person covariates and multivariate extensions of the Q model could also be estimated. We empirically illustrate our approach to estimate the Q model and also provide openly available code for R and Mplus.
PEER REVIEW
1. INTRODUCTION
How is the impact of scholarly work explained by individual differences in the ability to take advantage of the current knowledge base? How is scientific productivity related to a scientist’s ability to create high-impact work? And how does luck factor into the equation? The Q model (Janosov, Battiston, & Sinatra, 2020; Sinatra, Wang et al., 2016) provides answers to these questions by decomposing the impact of a scholarly paper into a researcher’s Q parameter, which reflects their ability to create impact, and a luck component. The Q model is an influential, parsimonious statistical model of scientific productivity whose empirical evaluation allows us to better understand citation-based impact measures, which are ubiquitous in research evaluation. In fact, the Q model itself—as a parsimonious and empirically validated model—provides a direct route for assessment of individual researchers for practical purposes such as, personnel selection in academia (Forthmann, 2023).
Typically, the parameters of the Q model are estimated by maximum likelihood, whereby the likelihood function is derived from a trivariate normal distribution formulated for the parameters of the Q model (i.e., the Q and the luck parameter) and the logarithm of the number of publications published by scientist i (i.e., i’s productivity). However, as described in more detail later, it seems that standard optimization functions cannot be used directly to obtain parameter estimates (i.e., without multiple runs and aggregation). Furthermore, software code for estimation is generally not freely available, preventing the usage of the model in applied research and as a vehicle for assessment purposes (Forthmann, 2023). Therefore, recent research has focused on alternative ways to estimate the model’s parameters; Forthmann (2023), for example, showed that a restricted variant of the Q model can be estimated as a generalized linear mixed model (GLMM).
In the present manuscript, we pursue these efforts further by showing how the Q model can be estimated in a structural equation model (SEM) framework (e.g., Bollen, 1989). Placing the model within this framework has a number of advantages: First, relying on the SEM for Q model parameter estimation has the advantage that established and well-known software packages can be used by substantive researchers. Hence, it is easier for them to use the Q model in their work on scientific productivity. Second, the SEM framework allows us to easily integrate productivity into the estimation of the model, which was left out for pragmatic reasons in previous work. Finally, when estimated as an SEM, the Q model can straightforwardly be extended in several interesting ways so that applied researchers are able to examine new and/or more fine-grained research questions. For example, explanatory manifest or latent person-level covariates can be used to model between-researcher differences in the Q parameter and in productivity, respectively, or they can relate interindividual differences in the Q parameter to changes in a longitudinally assessed person-level variable (e.g., peer review quality; Callaham & McCulloch, 2011).
In the following, we first describe the Q model. This is followed by showing how the Q model can be placed into an SEM. We then illustrate our approach by examining the scientific productivity of about 20,000 scientists of multiple fields. Finally, in the discussion section we highlight limitations of the proposed approach and outline some questions for future applied and methodological research.
2. THE Q MODEL
Citation-based measures are ubiquitous in research evaluation. This is emphasized by the widely used journal impact factors (Garfield, 1972) and also indices at the level of individual researchers such as the h-index (Hirsch, 2005). Receiving more citations as reflected in such measures is often understood as an approximation of a more impactful journal, paper, or researcher (Hartley, 2017; Mutz & Daniel, 2019; Pan & Fortunato, 2014). However, Sinatra et al. (2016) argued that using citation-based measures as impact proxies in research evaluation requires a solid understanding of the interplay between the impact of a paper, individual differences in productivity, the capacity to produce high-impact work, and luck. Therefore, they developed the Q model, which explains the impact of scholarly work by two unobserved components, namely a researcher’s capacity to take advantage of the current knowledgebase to create impactful papers and luck.
Applications of the Q model in empirical research involve estimating the elements of the mean vector μ and the covariance matrix Σ. However, the problem is that and are unobserved (latent) variables. Sinatra et al. (2016) circumvented this problem by substituting with (i.e., this follows from Eq. 2) and then analytically integrating out from the trivariate normal distribution. The resulting (marginal) log-likelihood function was then maximized with Matlab’s fmincon optimization function. To this end, Sinatra et al. (2016) created 10 different starting conditions for parameter estimation and ran the fmincon function 10 times for each of the starting conditions. The final parameter estimates were then obtained by averaging across all 100 runs. The reason for this approach is not made clear in the original work on the Q model and, hence, we simply conclude that standard optimization functions such as fmincon cannot be used directly (i.e., without multiple runs and aggregation). In addition, their software code is not freely available. Janosov et al. (2020) employed a covariance matrix adapting evolutionary strategy optimization for parameter estimation (Hansen & Ostermeier, 1996). However, they also did not specify which specific software they used. Thus, currently no code for Q model parameter estimation by means of previously used optimization approaches is openly available. Hence, researchers would need to start writing their own optimization code for research utilizing the Q model in their own research.
3. THE Q MODEL AS A STRUCTURAL EQUATION MODEL
Here, we show that an SEM approach can be used to estimate the Q model parameters. Placing the model within the SEM framework has the advantage that it allows substantive researchers to use openly accessible software packages for Q model parameter estimation. In addition, as an SEM, the Q model can be easily extended, for example, to examine whether certain person-level covariates are related to the ability to create impactful research papers and to scientific productivity. In what follows, we demonstrate how the Q model can be expressed as an SEM. To accomplish this, some of the original Q model notation must be modified to conform to typical SEM notation. Additionally, certain parameters of the original Q model must be constrained based on theory and previous findings, and we outline these differences in notation and model constraints in Table 1. Therefore, the SEM Q model is a special case of the original Q model that is motivated by both theory and empirical evidence.
Comparison of the original Q model and the SEM Q model
Original . | SEM . | ||
---|---|---|---|
Parameter . | Estimated . | Parameter . | Estimated . |
Yes | Yes | ||
Yes | Yes | ||
Yes | Fixed to 0 | No | |
Yes | Yes | ||
Yes | Yes | ||
Yes | Yes | ||
Yes | Yes | ||
Yes | Fixed to 0 | No | |
Yes | Fixed to 0 | No |
Original . | SEM . | ||
---|---|---|---|
Parameter . | Estimated . | Parameter . | Estimated . |
Yes | Yes | ||
Yes | Yes | ||
Yes | Fixed to 0 | No | |
Yes | Yes | ||
Yes | Yes | ||
Yes | Yes | ||
Yes | Yes | ||
Yes | Fixed to 0 | No | |
Yes | Fixed to 0 | No |
First, we must assume that the capacity parameter and the luck parameter (i.e., ) and the number of papers and the luck parameter (i.e., ) are uncorrelated random variables. We think that this is unproblematic, because the capacity and the productivity parameter are person-level variables, while the luck parameter is specific to an article; the variables thus refer to different conceptual levels. Empirically, the two correlations are also practically zero (e.g., in Sinatra et al. (2016), the covariance terms were = 0.00 and = 0.00). Specifically, fixing these two covariances to zero (cf. Table 1) makes the SEM Q model a special case of the originally formulated Q model in which these two parameters are freely estimated (Sinatra et al., 2016).
Furthermore, another assumption that we made is that the log-transformed citation and publication counts are used as observed variables. We assume this here, because the original Q model proposes a log-normal distribution for citation and publication counts and for the practical reason that most SEM software does not necessarily allow specifying a link function (e.g., a log-link function). For example, while Mplus does allow choosing a link-function for categorical data (a logit or probit link can be chosen), this is not the case for other distributions, such as the normal distribution, that cannot be used together with a log-link (Muthén & Muthén, 1998). However, the original Q model is also estimated on log-transformed data and only estimation of the Q model as a GLMM allows direct modeling of the raw data via a log-link (Forthmann, 2023).
Figure 1 displays the Q model as an SEM (to see which parameters are estimated consult Table 1). Researcher capacity to create high-impact work is represented by the latent factor and all log-transformed citation count variables have a factor loading of one on this latent factor . In addition, the mean of can be understood as the sum of the means of researcher capacity and the luck parameter. In the original Q model, both means were estimated (cf. Table 1). However, Forthmann (2023) argues that the two mean parameters are not uniquely identified. Therefore, estimating a single mean parameter seems to be more parsimonious. Furthermore, as mentioned above, log-transformed publication counts are incorporated into the Q model as a single indicator latent variable with residual variance fixed to zero.
Finally, we mention that SEMs are typically estimated with data arranged in a wide format (i.e., rows reflect researchers and columns log-transformed citation counts plus a column for the log-transformed number of papers). As researchers differ in the number of papers they have published, we suggest generating Nmax = max(Ni) columns for the log-transformed citation counts (see Table 2 for an example, where Nmax is the maximum number of scholarly works across all researchers in a sample; see the R code provided in the OSF project for an example of how to generate the wide format data from the data in long format). If a specific researcher has published fewer than Nmax papers, the missing entries can be defined to be a missing value (see Table 2). For parameter estimation this is unproblematic, at least when a full information maximum likelihood approach is used.
Example data in wide format for estimation of the Q model as an SEM for three researchers
Researcher . | . | . | . | . | . |
---|---|---|---|---|---|
1 | 4.17 | 3.22 | 2.83 | 0.69 | 1.39 |
2 | 3.00 | 3.58 | 2.40 | NA | 1.10 |
3 | 4.66 | 3.61 | NA | NA | 0.69 |
… |
Researcher . | . | . | . | . | . |
---|---|---|---|---|---|
1 | 4.17 | 3.22 | 2.83 | 0.69 | 1.39 |
2 | 3.00 | 3.58 | 2.40 | NA | 1.10 |
3 | 4.66 | 3.61 | NA | NA | 0.69 |
… |
Notes. Nmax = 4. Researchers 2 and 3 have published fewer than Nmax papers and, hence, some of their log-transformed citation count variables are NA = missing value. Subscript i is omitted from notation in the table for simplicity.
3.1. Extending the Q Model by Incorporating Covariates
3.2. Estimating the Q Model as an SEM
We note that maximum likelihood estimation in SEM software often minimizes a maximum likelihood discrepancy function to obtain the parameter estimates and not the log-likelihood function shown above (Bollen, 1989; Mulaik, 2010). This discrepancy function can be derived from the log-likelihood function when the data vector of all persons is “complete”’ (see Bollen, 1989, for the concrete steps). In case of the Q model this would mean that all researchers have the same number of scholarly publications, which is a highly unlikely scenario. Therefore, Eq. 13 is used for maximization, which is called full information maximum likelihood estimation in the SEM literature (cf. Arbuckle, 1996; Rosseel, 2021).
3.3. Estimating Reliability of Researcher Capacity Estimates
4. EMPIRICAL ILLUSTRATION
We illustrate estimating the Q model parameters with an SEM with a data set that comprises about 20,000 scientists of multiple fields. We compare the SEM estimates with GLMM estimates reported in Forthmann (2023). In addition, we assess academic age as a person covariate to further illustrate the capabilities of SEM allowing to include explanatory covariates in the Q model.
4.1. Data Set
The data set comprises N = 20,296 scientists of multiple fields. It was created and used in a study by Liu, Wang et al. (2018). However, compared with the original work, we did not exclude scientists with fewer than 15 publications or career lengths of less than 20 years. This way, the findings are more comparable with the Q model estimates reported in Forthmann (2023). Impact was assessed in this data set by citation counts 10 years after publication and productivity was simply the number of published papers. As in previous work, we added a value of one to the citation counts to prevent technical issues in case of zero citations when the log-transformation is used. Finally, we used academic age as a covariate, which was computed as the difference between the last year in which a researcher has published and the first year in which they have published. We reanalyzed openly published data and ethics approval was not required, as per institutional and national guidelines.
4.2. Analytical Approach
All of the scripts needed to replicate the findings reported here can be found in a repository of the Open Science Framework (https://osf.io/bezsj/). We provide lavaan (Rosseel, 2012) as well as Mplus (Muthén & Muthén, 1998) code to estimate the Q model as an SEM. The data for estimating with Mplus were prepared using the R package MplusAutomation (Hallquist & Wiley, 2018). The maximum number of publications—and, hence, the number of observed impact indicators for SEM (cf. Figure 1)—was Max(N) = 256. Impact and productivity were log-transformed prior to analysis. The SEM Q model was specified as described above and depicted in Figure 1. The Q model was estimated by full information maximum likelihood as implemented in lavaan and Mplus. Factor scores in lavaan were obtained by means of the lavPredict() function, while in Mplus the FSCORES option of the SAVEDATA command was used. Reliability was computed according to Eq. 14.
4.3. Results
We found nearly identical Q model estimates regardless of the software used for SEM estimation (see Table 3). In addition, all estimates were very similar to the estimates obtained with the GLMM (see Table 3 again). The variances of both latent variables were comparable in size, with a somewhat larger variance for (confidence intervals overlap only slightly). The variance of the luck parameter was somewhat larger compared with the latent variable variances. The latent mean of was somewhat larger compared with the latent mean of . Importantly, the correlation between both latent variables was small, which could be interpreted in favor of fit of these data to the Q model. Finally, the estimated reliability of the factor scores (i.e., the researcher capacity estimates) was good.
Q model estimates
Parameter . | SEM—lavaan . | SEM—Mplus . | GLMM taken from Forthmann (2023) . | |||
---|---|---|---|---|---|---|
Estimate . | 95%-CI . | Estimate . | 95%-CI . | Estimatea . | 95% CI . | |
0.46 | [0.45, 0.47] | 0.46 | [0.45, 0.47] | 0.46 | [0.45, 0.48] | |
0.48 | [0.47, 0.49] | 0.48 | [0.47, 0.49] | – | – | |
1.79 | [1.78, 1.79] | 1.79 | [1.78, 1.79] | 1.80 | [1.77, 1.80] | |
2.25 | [2.24, 2.26] | 2.25 | [2.24, 2.26] | 2.26 | [2.25, 2.27] | |
3.76 | [3.75, 3.77] | 3.76 | [3.75, 3.77] | – | – | |
Cor(, ) | .14 | [.13, .15] | .14 | [.13, .15] | .13 | [.12, .15] |
Rel() | .91 | .90 | .90 |
Parameter . | SEM—lavaan . | SEM—Mplus . | GLMM taken from Forthmann (2023) . | |||
---|---|---|---|---|---|---|
Estimate . | 95%-CI . | Estimate . | 95%-CI . | Estimatea . | 95% CI . | |
0.46 | [0.45, 0.47] | 0.46 | [0.45, 0.47] | 0.46 | [0.45, 0.48] | |
0.48 | [0.47, 0.49] | 0.48 | [0.47, 0.49] | – | – | |
1.79 | [1.78, 1.79] | 1.79 | [1.78, 1.79] | 1.80 | [1.77, 1.80] | |
2.25 | [2.24, 2.26] | 2.25 | [2.24, 2.26] | 2.26 | [2.25, 2.27] | |
3.76 | [3.75, 3.77] | 3.76 | [3.75, 3.77] | – | – | |
Cor(, ) | .14 | [.13, .15] | .14 | [.13, .15] | .13 | [.12, .15] |
Rel() | .91 | .90 | .90 |
Forthmann (2023) reported standard deviations and here we simply squared the reported estimates. Hence, these variance estimates are slightly affected by rounding errors. Finally, it should be noted that in this work slightly different notation was used as compared to previous work.
4.3.1. Assessing academic age as a covariate
lavaan and Mplus results were nearly identical for the extended Q model. Hence, we report the lavaan results here only and refer for the Mplus results to the OSF repository. Academic age was negatively related with researcher capacity (standardized coefficient = −0.15, p < .001, 95% CI: [−0.16, −0.13], R2 = .02), while it was positively related with productivity (standardized coefficient = 0.46, p < .001, 95% CI: [0.45, 0.47], R2 = .21). Thus, scientists with longer careers published significantly more works, yet their researcher capacity tended to be slightly lower compared with scientists with shorter careers. In addition, the correlation between the residuals of researcher capacity and productivity was r = .24, p < .001, 95% CI: [.22, .25]. The latter finding implies that statistically controlling for academic age yields a relationship between researcher capacity and productivity that is less in accordance with the tenets of the Q model. In other words, scientists with higher researcher capacity tend to have a faster peer review process and therefore get their papers published in journals more quickly.
5. DISCUSSION
The Q model is a concise model for creative scientific productivity and impact of researcher’s scholarly work with strong theoretical and empirical support (cf. Sinatra et al., 2016). In this paper we have shown and empirically illustrated how the Q model can be conceptualized and estimated as an SEM. We extended the original Q model by showing how to incorporate person covariates into the model. This extended Q model was also empirically illustrated by regressing researcher capacity and productivity on academic age. Finally, our R and Mplus code to estimate the Q model as an SEM is openly available to facilitate future work on the Q model.
5.1. SEM Q Model Estimation vs. GLMM Q Model Estimation
One of the main advantages of the Q model as a SEM is the availability of well-established statistical software that can be used for more efficient Q model estimation. This advantage of the SEM approach is shared by the GLMM approach to estimate the Q model (Forthmann, 2023). However, while the overlap of both frameworks has been emphasized in the literature (Curran, 2003; Nestler & Humberg, 2024), we think that both frameworks should be in a researcher’s toolbox, as both focus on different levels of the Q model.
When formulating the Q model as an SEM, the focus is clearly more strongly on the person level. This is emphasized by the data structure (wide format with rows in the data set reflecting the number of researchers in the sample), modeling of the main latent variables at the person level, and the option to extend the Q model by incorporating person covariates. Importantly, within the SEM, the resulting coefficients, such as regression coefficients or residual correlations, can be readily interpreted at the person level. While it is also possible to include person-level covariates into the Q model as a GLMM, the effects of such covariates are more indirectly related to the latent variable(s). For example, when adding academic age as a person covariate into the GLMM Q model, the regression coefficient would reflect the relationship between academic age and impact, while the effect on researcher capacity would be more indirectly assessed by a potential reduction of the researcher capacity variance (i.e., reliability of researcher capacity could be partially explained by academic age).
By contrast, the Q model as a GLMM has an inherent focus on the level of individual papers. Thus, extending the GLMM Q model by paper-level covariates seems straightforward and has been discussed in previous work (Forthmann, 2023). It is possible to include paper-level covariates (such as the impact of the journal in which the paper is published) within the SEM, but this is slightly more complex compared to the GLMM. Specifically, for a single covariate, one would have to reshape the data of the covariate to the wide format (i.e., as with the impact data), predict each observed citation vector with the respective predictor vector, and fix the regression coefficient to the same value across all of these regressions. Thus, within an SEM, it is possible to model person level and paper level at the same time in a Q model analysis. However, when applied research focuses solely on paper-level covariates we recommend relying on the GLMM formulation of the Q model, while a sole focus on person-level covariates suggests using the SEM formulation introduced in the current work.
Besides the level, we think that using the SEM has the further advantage that it is much easier to model multivariate data. For example, the SEM framework allows us to relate a latent person covariate to the Q model latent variables (e.g., cognitive ability) which is not possible in standard implementations of the GLMM (e.g., in R). Similarly, when looking at the paper level and not on the person level as in the example before, an extended latent version of the Q model that incorporates other observed indicators than impact (e.g., if the paper has been published together with international coauthors; Mutz & Daniel, 2018) could also be estimated within an SEM. Finally, one could empirically test if the estimation of researcher capacity can be further informed by the number of third-party funded projects and their impact in a bivariate Q model. Altogether, this shows the potential of the SEM Q model to investigate a number of exciting new research questions.
Finally, we note that we estimated the full Q model in an SEM here. This is advantageous, because the covariance between the two latent variables is estimated in one step (i.e., the covariance between capacity and productivity) that takes the measurement error of researcher capacity into account. In previous work that relied on the GLMM, productivity was left out of the model, so that this covariance was obtained by a two-step approach (i.e., the estimated capacity estimates were correlated with the productivity scores) making it susceptible for attenuation effects (Wang, Chen, & Cheng, 2004). Although the one-step and the two-step estimates were only negligibly different in our illustrative example (because of the strong reliability of the researcher capacity estimates), this may be important in the analysis of other data sets.
5.2. Limitations and Future Directions
When estimating the Q model as an SEM, one needs to rely on full information maximum likelihood estimation, because researchers differ in their productivity and an impact variable will have missing values for all researchers publishing fewer than Nmax papers. When the number of papers is highly heterogeneous, very low levels of covariance coverage are inevitable. This was also the case for the data used in the current work. The estimation routines threw out warnings in this regard, yet given that the findings obtained from SEM estimation were nearly identical with previous estimates obtained from the GLMM, we did not see a reason to consider the findings to be suspicious. Nonetheless, we do not know yet the technical conditions under which SEM estimation of the Q model becomes infeasible. We recommend that future work takes a closer look at related issues while keeping an eye on sample size (some assessment contexts in academia imply rather small samples), heterogeneity in productivity, and potential solutions such as randomly distributing impact variables of each researcher across the columns of the data set to reduce heterogeneity of covariance coverage across variable pairs.
In our empirical illustration of the extended Q model, we focused on academic age. One could imagine other relevant covariates, such as cognitive ability (Rodgers & Maranto, 1989), yet even looking only at a linear relationship between academic age and researcher capacity or productivity is somewhat limited. Researcher capacity as conceptualized in the Q model has been previously studied in research on creative productivity (also in other domains than science). Following this research the capability in a domain should vary across a career and empirical findings suggest a negative quadratic effect of time on individual differences in being capable to produce quality works (Forthmann et al., 2021; Hass & Weisberg, 2015; Kozbelt, 2011). However, these works tested the relationships between fixed time intervals and capability for careers of roughly the same lengths. Thus, assessing academic age is somewhat different and going into the details of various operationalizations of career length is clearly beyond the scope of this work. We think, however, that the SEM formulation of the Q model provides a sound architecture for evaluating such more complex effects. Nevertheless, our illustration here has shown that evaluating the correlation between researcher capacity and productivity can be confounded by other variables. Hence, we strongly recommend taking person covariates into account when one wishes to evaluate the Q model.
6. CONCLUSION
The Q model is a parsimonious chance model of scientific productivity. The original work on the Q model used rather computationally intensive optimization routines. In this work we proposed an SEM formulation of the Q model which has the potential to pave the way for new empirical and methodological work on the model. Which person covariates best explain researcher capacity and productivity? How might organizational variables such as reputation of a researcher’s department (Helmreich, Spence et al., 1980; Rodgers & Maranto, 1989) factor in? What are the technical conditions under which Q model estimation within SEM works well? We hope that our work and openly available material will be helpful for the field to study these timely issues related to statistical modeling of scientific productivity.
ACKNOWLEDGMENTS
We acknowledge support from the Open Access Publication Fund of the University of Münster.
AUTHOR CONTRIBUTIONS
Boris Forthmann: Conceptualization, Formal analysis, Methodology, Software, Validation, Visualization, Writing—original draft, Writing—review & editing. Steffen Nestler: Conceptualization, Formal analysis, Methodology, Writing—original draft, Writing—review & editing.
COMPETING INTERESTS
The authors have no competing interests.
FUNDING INFORMATION
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
DATA AVAILABILITY
All analyses scripts are available at Open Science Framework: https://osf.io/bezsj/.
Note
Note that the hat notation in the original Q model refers to the log-transformed model parameters and does not imply an interpretation as an estimate or estimator, as is common in statistics.
REFERENCES
Author notes
Handling Editor: Vincent Larivière