Abstract
The diversity of analysis frameworks used in different fields of quantitative research is understudied. Using bibliometric data from the Web of Science (WoS), we conduct a large-scale and cross-disciplinary assessment of the proportion of articles that use linear models in comparison to other analysis frameworks from 1990 to 2022 and investigate the spatial and citation patterns. We found that, in absolute terms, linear models are widely used across all fields of science. In relative terms, three patterns suggest that linear-model-based research is a dominant analysis framework in Social Sciences. First, almost two-thirds of research articles reporting a statistical analysis framework reported linear models. Second, research articles from underrepresented countries in the WoS data displayed the highest proportions of articles reporting linear models. Third, there was a citation premium to articles reporting linear models in terms of being cited at least once for the entire period, and for the average number of citations until 2012. The confluence of these patterns may not be beneficial to the Social Sciences, as it could marginalize theories incompatible with the linear models’ framework. Our results have implications for quantitative research practices, including teaching and education of the next generations of scholars.
PEER REVIEW
1. INTRODUCTION
Diversity is an asset for the practice of scientific research. The benefits of diversity apply to how phenomena are represented in concepts and theories and how data are collected and analyzed. The diversity of quantitative analysis frameworks used in different fields of science is understudied (Koppman & Leahey, 2019; Leahey, 2005, 2008). A large-scale and cross-disciplinary assessment of the prevalence of linear models compared to other analysis frameworks is lacking. In the first part of this study, we measure the proportion of research articles that rely on linear models from 1990 to 2022. We refer to this measure as linear-model-based research prevalence. We present separated results for six macro fields of science (OECD, 2007) and 254 subdisciplines (Milojević, 2020; Pudovkin & Garfield, 2002). In the second part, we study the spatial patterns of the country-level prevalence of linear-model-based research in Social Sciences. Finally, we compare the average number of citations between articles using linear models and other statistical methods to assess the potential existence of a citation premium to linear-model-based research, namely, a larger average number of citations to linear-model-based research compared to research based on other methods.
Although we remain neutral about the ideal prevalence of linear-model-based research, investigating any sustained high prevalence, spatial patterns, and citation premiums related to it is essential. Assessing the implications of these findings can drive actions and discussions to preserve methodological pluralism in research, ensuring a robust and diverse approach to research.
1.1. What Are Linear Models?
Given data and under some assumptions regarding the distribution of Y, the distribution and correlation of the errors, and the relationships among variables and across units of observations, the values of b (i.e., the regression coefficients) can be estimated (i.e., the marginal association between the independent variables and the outcome variable). When some variables in X are randomly assigned or exogenous, their coefficients could be interpreted in causal terms. Furthermore, there are several techniques from Frequentist and Bayesian frameworks for assessing the statistical significance of all kinds of differences between regression coefficients and constant values (e.g., b = 0) and among them (e.g., b1 > b2). In addition, different combinations of independent variables can be evaluated and compared regarding their predictive accuracy concerning the outcome.
1.2. What Are Other Analysis Frameworks?
Alternative analysis frameworks (other methods herein) have been developed, for instance, in France in the 1960s and 1970s (Armatte, 2008; Lebaron & Le Roux, 2015) and in the United States more recently (Abbott, 2016; Koppman & Leahey, 2019; Ragin, 2014). These are all statistical techniques that do not require the specification of a dependent variable and do not aim to measure conditional associations or effects between the dependent and independent variables. Examples of other methods include network analysis, correspondence analysis, principal component analysis, sequence analysis, agent-based simulation, and cluster analysis.
Measuring the prevalence of linear-model-based research in comparison to research using other methods is essential because the lack of diversity in methods, or methodological monotheism, as termed by Bourdieu and Wacquant (1992, pt. III), could translate into a lack of diversity in narratives, standardized questions, and lowering of the breadth of scientific research. Although consolidating mainstream narratives indicates increased specialization (Leahey, Keith, & Crockett, 2010), it can also signal the marginalization of niche areas (Erola, Reimer et al., 2015; Moody & Light, 2006). As much as there are publication biases against so-called null-result research (Fanelli, 2012), methodological monotheism (or the lack of pluralism) can potentially prevent or exclude atypical or disruptive research endeavors deemed not conform to the known traditions of standard science (Buyalskaya, Gallo, & Camerer, 2021; Lamont & Swidler, 2014; Leahey, 2008; Leahey, Lee, & Funk, 2021; Lin, Evans, & Wu, 2022; Teplitskiy, 2016; Uzzi, Mukherjee et al., 2013).
1.3. Are Linear Models Hegemonic in the Social Sciences Compared to Other Methods?
In the Supplementary material, we have assembled seven telling quotations from books and review articles from the 1970s to the present that illustrate scholars’ concerns regarding the perceived high prevalence of linear-model-based research as an analysis framework in the Social Sciences. This list includes Hirschman’s assessment of fertility research, one of the most studied topics by the 1990s (Kohler, 2010; Mason, 1997):
The standard social science model is that society works pretty much like a regression equation: the task is to find a right set of predictors, solve the equation, and discover what factors are the most important in predicting social outcomes. This framework does lead to empirical generalizations, but there seem to be endless qualifications about the measurement of variables, the meaning and interpretation of variables, the substitutability of one variable for another, and complex interactions with historical settings. If science is to discover parsimonious principles that explain complex patterns, we do not seem to be making progress (Hirschman, 1994, p. 256)
In this context, our goal is to measure the prevalence of linear-model-based research and its trend over the last three decades (1990–2022). We present results for six major fields of science to gain comparative insights. Next, we document global patterns of linear-model-based research prevalence in the Social Sciences and discuss how they relate to existing inequalities in knowledge production and the Global South’s intellectual dependency (Krause, 2016; Quijano, 2000). These discussions are based on existing critiques of the untested hegemony of linear-model-based research (Gollac, 2004; Lebaron, 2003), its relations to the dominant analytical approach to studying the social world (Henrich, 2020; Nisbett & Masuda, 2003), and the Eurocentric historical nature of the Social Sciences (Bhambra & Holmwood, 2021; Zuberi & Bonilla-Silva, 2008).
2. DATA AND MEASURES
We use publications metadata from 40,603,923 articles from 1990 to 2022 indexed by Clarivate’s Web of Science (WoS) provided by the German Competence Network for Bibliometrics (German Competence Network for Bibliometrics, 2021) via the Max Planck Digital Library. We excluded all research articles with no abstract. WoS is one of the most exhaustive bibliometric databases, covering more than 21,000 journals (Clarivate, 2022; Visser, van Eck, & Waltman, 2021). English language journals from Western countries such as the United States and the United Kingdom are overrepresented in WoS. Together with China, these two countries comprise the largest shares of articles in the WoS database (Falagas, Pitsouni et al., 2008; Mongeon & Paul-Hus, 2016; Norris & Oppenheim, 2007). This bias implies that our results concern a specific area of existing research, mainly produced in the Global North and China (see Table 1 and Figure 1).
Articles’ visibility measures, control variables, and first authors’ location . | Articles reporting quantitative methods or data in their abstract . | Articles reporting quantitative methods in their abstract . | ||||||
---|---|---|---|---|---|---|---|---|
1990–2012 . | 2013–2018 . | 1990–2012 . | 2013–2018 . | |||||
Value . | S.D. . | Value . | S.D. . | Value . | S.D. . | Value . | S.D. . | |
Average number of citations three years after publication | 3.0 | 4.8 | 4.9 | 9.0 | 3.2 | 4.9 | 4.9 | 9.6 |
Proportion of articles with no citation three years after publication | 0.28 | 0.45 | 0.17 | 0.38 | 0.26 | 0.44 | 0.16 | 0.37 |
Proportion of articles mentioning linear models in abstracts | 0.34 | 0.48 | 0.37 | 0.48 | 0.64 | 0.48 | 0.67 | 0.47 |
Proportion of articles from single countries | 0.64 | 0.48 | 0.55 | 0.50 | 0.63 | 0.48 | 0.54 | 0.50 |
Articles’ year of publication | 2005 | 5.8 | 2016 | 1.7 | 2005 | 5.8 | 2016 | 1.7 |
First author location . | Total . | % . | Total . | % . | Total . | % . | Total . | % . |
United States | 167,737 | 42.8 | 116,243 | 30.5 | 89,728 | 42.4 | 64,883 | 30.5 |
United Kingdom | 35,997 | 9.2 | 29,988 | 7.9 | 17,263 | 8.1 | 15,387 | 7.2 |
China | 11,565 | 3.0 | 24,704 | 6.5 | 6,985 | 3.3 | 15,293 | 7.2 |
Germany | 14,519 | 3.7 | 18,554 | 4.9 | 8,250 | 3.9 | 10,784 | 5.1 |
Australia | 14,692 | 3.7 | 17,944 | 4.7 | 7,208 | 3.4 | 8,892 | 4.2 |
Canada | 20,435 | 5.2 | 15,986 | 4.2 | 11,564 | 5.5 | 9,157 | 4.3 |
Spain | 9,208 | 2.3 | 12,492 | 3.3 | 5,704 | 2.7 | 7,450 | 3.5 |
Netherlands | 13,399 | 3.4 | 12,609 | 3.3 | 7,743 | 3.7 | 6,965 | 3.3 |
Italy | 6,930 | 1.8 | 9,954 | 2.6 | 4,147 | 2.0 | 6,195 | 2.9 |
France | 7,494 | 1.9 | 8,125 | 2.1 | 4,253 | 2.0 | 4,737 | 2.2 |
Central and Southern Asia | 4,238 | 1.1 | 6,774 | 1.8 | 2,345 | 1.1 | 3,972 | 1.9 |
Eastern and Southeastern Asia | 22,867 | 5.8 | 26,067 | 6.8 | 13,699 | 6.5 | 14,885 | 7.0 |
Europe and North America | 41,637 | 10.6 | 52,028 | 13.7 | 21,974 | 10.4 | 28,132 | 13.2 |
Latin America and the Caribbean | 6,218 | 1.6 | 9,821 | 2.6 | 3,390 | 1.6 | 5,620 | 2.6 |
Northern Africa and Western Asia | 9,975 | 2.5 | 12,301 | 3.2 | 5,738 | 2.7 | 6,824 | 3.2 |
Sub-Saharan Africa | 5,001 | 1.3 | 7,228 | 1.9 | 1,857 | 0.9 | 3,511 | 1.7 |
Total | 391,912 | 100 | 380,818 | 100 | 211,848 | 100 | 212,687 | 100 |
Articles’ visibility measures, control variables, and first authors’ location . | Articles reporting quantitative methods or data in their abstract . | Articles reporting quantitative methods in their abstract . | ||||||
---|---|---|---|---|---|---|---|---|
1990–2012 . | 2013–2018 . | 1990–2012 . | 2013–2018 . | |||||
Value . | S.D. . | Value . | S.D. . | Value . | S.D. . | Value . | S.D. . | |
Average number of citations three years after publication | 3.0 | 4.8 | 4.9 | 9.0 | 3.2 | 4.9 | 4.9 | 9.6 |
Proportion of articles with no citation three years after publication | 0.28 | 0.45 | 0.17 | 0.38 | 0.26 | 0.44 | 0.16 | 0.37 |
Proportion of articles mentioning linear models in abstracts | 0.34 | 0.48 | 0.37 | 0.48 | 0.64 | 0.48 | 0.67 | 0.47 |
Proportion of articles from single countries | 0.64 | 0.48 | 0.55 | 0.50 | 0.63 | 0.48 | 0.54 | 0.50 |
Articles’ year of publication | 2005 | 5.8 | 2016 | 1.7 | 2005 | 5.8 | 2016 | 1.7 |
First author location . | Total . | % . | Total . | % . | Total . | % . | Total . | % . |
United States | 167,737 | 42.8 | 116,243 | 30.5 | 89,728 | 42.4 | 64,883 | 30.5 |
United Kingdom | 35,997 | 9.2 | 29,988 | 7.9 | 17,263 | 8.1 | 15,387 | 7.2 |
China | 11,565 | 3.0 | 24,704 | 6.5 | 6,985 | 3.3 | 15,293 | 7.2 |
Germany | 14,519 | 3.7 | 18,554 | 4.9 | 8,250 | 3.9 | 10,784 | 5.1 |
Australia | 14,692 | 3.7 | 17,944 | 4.7 | 7,208 | 3.4 | 8,892 | 4.2 |
Canada | 20,435 | 5.2 | 15,986 | 4.2 | 11,564 | 5.5 | 9,157 | 4.3 |
Spain | 9,208 | 2.3 | 12,492 | 3.3 | 5,704 | 2.7 | 7,450 | 3.5 |
Netherlands | 13,399 | 3.4 | 12,609 | 3.3 | 7,743 | 3.7 | 6,965 | 3.3 |
Italy | 6,930 | 1.8 | 9,954 | 2.6 | 4,147 | 2.0 | 6,195 | 2.9 |
France | 7,494 | 1.9 | 8,125 | 2.1 | 4,253 | 2.0 | 4,737 | 2.2 |
Central and Southern Asia | 4,238 | 1.1 | 6,774 | 1.8 | 2,345 | 1.1 | 3,972 | 1.9 |
Eastern and Southeastern Asia | 22,867 | 5.8 | 26,067 | 6.8 | 13,699 | 6.5 | 14,885 | 7.0 |
Europe and North America | 41,637 | 10.6 | 52,028 | 13.7 | 21,974 | 10.4 | 28,132 | 13.2 |
Latin America and the Caribbean | 6,218 | 1.6 | 9,821 | 2.6 | 3,390 | 1.6 | 5,620 | 2.6 |
Northern Africa and Western Asia | 9,975 | 2.5 | 12,301 | 3.2 | 5,738 | 2.7 | 6,824 | 3.2 |
Sub-Saharan Africa | 5,001 | 1.3 | 7,228 | 1.9 | 1,857 | 0.9 | 3,511 | 1.7 |
Total | 391,912 | 100 | 380,818 | 100 | 211,848 | 100 | 212,687 | 100 |
We limit our analysis to the Article document type, and we focus on abstracts to warrant a large temporal and spatial scope. Full-text analyses are impractical, and existing full-text databases for such analyses are more limited than bibliometric databases. We are aware that authors may not report the methods they use in the abstract. For example, if a method is standardized and widely used, there may be no need to mention it in the article’s abstract (Blake, 2010; Cohen, Johnson et al., 2010; Westergaard, Stærfeldt et al., 2018). Different method-reporting patterns across disciplines and methods themselves may affect our results in ways we cannot predict in advance. However, it seems reasonable to assume that abstracts from theoretically oriented disciplines (e.g., the Humanities and the Natural Sciences) and abstracts from articles using mainstream methods are less likely to report them. In these hypothetical cases, the unreported methods are not central to the articles’ contribution, and they are not a novelty that is worth highlighting to potential readers. Such types of biases have been documented regarding the country of study, whereby articles about the United States and Western European countries are less likely to report the country of study—the default location in Eurocentric terms—in titles and abstracts than articles about other regions of the world (Castro Torres & Alburez-Gutierrez, 2022; Kahalon, Klein et al., 2021).
2.1. Prevalence Measures
To partially counter the potential biases implied by looking only at abstracts, we compute four different measures for the prevalence of linear-model-based research. These measures differ in the articles included in the denominator from a relatively small and specific set that allows us to approximate the prevalence of linear-model-based research to a larger and more general analytical sample to check consistency over time.
Our primary measure of interest is the proportion of articles that report linear models in their abstract among all articles reporting any statistical method. In a second prevalence measure, articles that report linear models and other methods simultaneously are subtracted from the first linear-model-based research prevalence. Third, we compute the prevalence of linear models using the total number of articles reporting methods or quantitative data as the denominator. The complete list of terms referring to statistical methods and quantitative data is available in the Supplementary material. Using these terms, we select 7,164,784 articles. The terms referring to statistical methods are organized according to whether they refer to a linear model (e.g., regression analysis) or other methods (e.g., correspondence analysis). Each term is classified as general or specific depending on whether they use a generic or specific name for describing the methods (e.g., regression vs. logistic regression). We built an initial list of 57 terms. We circulated this list among six colleagues with diverse disciplinary backgrounds, including health sciences, natural and social sciences, and statistics, asking them to add missing methods or comment on the ones already included. After a few exchanges and clarification with colleagues, we consolidated a list of 73 terms; this list was further extended to 163 while conducting the literature review and adding both hyphenated/nonhyphenated names and American and British spelling conventions. Of these 163 terms, 15 refer to quantitative data, 110 refer to linear models (48 general, 50 specifics, 12 causality-related words), and 38 refer to other methods (18 general, 20 specific).
Finally, we compute the prevalence of linear-model-based research among all articles where abstracts include one of the following seven words: model, data, evidence, empirical, results, method, or methodsandone of the following five words: analysis, analyze, analyse, study, or investigate used once or more in the same abstract. We argue that these keywords allow us to identify potentially empirical papers dealing with processing, analyzing, or interpreting data. Using these terms, we enlarge our analytical sample to 13,720,556 articles. By enlarging the denominator, this last measure of prevalence helps us to assess the consistency of our results over time from a more conservative and strict perspective.
2.2. (In)visibility Measures
We compute two measures of papers’ visibility: the total number of citations received in the 3 years following publication and whether the paper received at least one citation during the same period. The 3-year window time warrants comparability between older and more recent papers, and it is justified because articles’ citations mature in the third year after publication considering disciplinary differences (Bornmann & Tekles, 2019; Wang, 2013). Considering citations in only 1 year could penalize articles published in later months of the same year (Donner, 2018).
2.3. Measuring Linear-Model-Based Research Relative Visibility
To examine linear-model-based research’s advantage in visibility outcomes, we fitted a series of negative binomial (link function = log) and binomial (link function = logit) models predicting our two visibility measures. The negative binomial distribution is suitable for modeling the number of citations in the first 3 years, given the strongly skewed distribution of this outcome. The binomial distribution is appropriate for modeling whether articles receive at least one citation. Our primary variable of interest in these models is whether or not the authors reported a linear model in their abstract. Hence, the regression coefficients of our variable of interest measure the difference in the average number of citations and the odds of having at least one citation between linear-model-based research and quantitative research that report other methods or quantitative data.
2.4. Geographical and Temporal Differences
To account for geographical differences in citation patterns, we stratified our models by the top 10 countries regarding the number of articles and the rest of the countries grouped into the United Nations Sustainable Development regions (United Nations, 2017). To assess potential changes over time in linear-model-based research visibility advantage, we ran separate models for articles published before and after the median year of publication (i.e., 2012). This partition favors the symmetry in the sample size and, therefore, the uncertainty of estimates in each sample. We control for the year of publication, and whether or not an article involves authors from different countries (i.e., it is a product of international collaboration). These control variables are essential because the number of citations and the proportion of papers without citations increases and decreases over time, respectively (Nielsen & Andersen, 2021; Seglen, 1992; Waltman, 2016). And articles involving authors from only one country receive, on average, fewer citations and are more likely to have zero citations (Gomez, Herman, & Parigi, 2022; Narin, Stevens, & Whitlow, 1991; Puuska, Muhonen, & Leino, 2014).
2.5. Summary of Visibility Measures and Control Variables
Table 1 shows descriptive statistics for the two visibility measures and the control variables along with articles’ distribution across the top 10 countries and world regions. These figures are shown according to whether articles report quantitative data and/or methods in their abstracts and for two periods (i.e., 1990–2012 and 2013–2018).
Table 1 corroborates increasing citations over time in our sample according to the average number of citations and the proportion of articles with no citations 3 years after publication. Likewise, the proportion of articles reporting linear models displays relatively small increases over time in both samples (i.e., 0.34 to 0.37 and 0.64 to 0.67). As for control variables, Table 1 suggests an increased geographical diversity according to the location of the first author. Although the United States comprises a large fraction of articles in all samples, there is a 12 percentage point decrease across the two periods in both samples, from approximately 40% to 30%. Likewise, the fraction of articles involving only one country has decreased over time.
To categorize publications into fields of science (OECD, 2007), we use a mapping of OECD macro fields of science to WoS subject classifications. Because the field of science categories is not mutually exclusive, articles can be included in more than one category at a time. We allow this multiple counting for field-specific analysis (Figures 1 and 2). Instead, we consider unique articles (single counting) in our country-level analysis and statistical models, which only deal with social science publications. In addition, we include only different countries per article (e.g., if an article has multiple authors from the same country and one author from another country, we consider it a product of the two countries and use distinct country addresses per article to account for international versus single country publications).
3. RESULTS
3.1. Trends in Linear-Model-Based Research in Comparison to Other Methods Across Fields
Figure 1 shows time trends in the prevalence of linear-model-based research for the six macro fields of science using our four different denominators (shades of red and blue) and the distinction between general and specific terms (green lines). This figure also shows aggregated numbers for each macro field, including the fraction of papers reporting methods, the fraction reporting methods or quantitative data, and the total number of articles.
The Humanities display the lowest fraction of articles reporting any method, with only 9% doing so. For the other disciplines, these fractions range from 18% in Agricultural Sciences to 28% in Engineering and Technology. The similarity of the fraction of articles reporting methods between the Natural Sciences and the Social Sciences speaks to the large scope of our list of terms. Slightly more than one-fifth of articles in both macro fields are selected using this list, and more than one-quarter in the case of Engineering and Technology. These fractions increase significantly when words referring to quantitative data are included, particularly in the Social Sciences, which further supports our assumption and the overall measurement logic.
In absolute terms, more than 7,000 articles reported linear models per year in all fields except for Agricultural Sciences (3,300) and the Humanities (260). Medical and Health Sciences rank first with more than 35,000 linear-model-based articles per year, followed by the Natural Sciences (>15,000), the Social Sciences (>9,000), and Engineering and Technology (>7,500). In relative terms, the sustained high prevalence of linear-model-based research (thick red lines) suggests that these methods are prevalent in four fields: Medical and Health Sciences, Agricultural Sciences, Social Sciences, and the Humanities. The twofold significance of linear-model-based research for the Social Sciences in absolute and relative terms speaks to the influential role of these analysis frameworks for quantitative research in the field.
The Agricultural Sciences, Humanities, and Social Sciences display above 50% prevalence of linear-model-based research over the 31 years of observation among articles reporting any methods. This pattern is robust to excluding articles reporting linear models and other methods simultaneously, particularly in the Social Sciences, where the prevalence of linear-model-based research hovers around 65% for both prevalence measures. These patterns mean that under strict and conservative estimates of linear-model-based research prevalence, approximately two-thirds of the research in Social Sciences has been conducted using the linear models analysis framework. The proportion of articles using general terms in this field (thicker green solid line) suggests that the high prevalence of linear-model-based research is driven mainly by articles reporting general terms instead of specific models. Despite a positive trend over time, the fraction of papers reporting specific models in the Social Sciences (e.g., hierarchical models) is slightly above 10% by the end of the study period. This pattern indicates that the high prevalence of linear-model-based research in Social Sciences has been largely unaffected by the increasing specialization of methods.
The consistency of the temporal trend for the proportion of papers reporting linear models as a fraction of articles reporting quantitative data (dotted red line) and all articles presumably performing empirical analyses (dotted blue line) suggests that the high prevalence of linear-model-based research among papers reporting any statistical framework in the Social Sciences is unlikely to be driven by the selection of articles according to our list of statistical methods. The lower prevalence of linear-model-based research among articles mentioning any methods and quantitative data may be related to articles that use quantitative sources as secondary or supplementary data without performing a direct statistical analysis to mention it in the abstract. Mentioning the use of secondary data without mentioning a statistical analysis framework may explain the significant difference between the red solid and dotted lines in the Humanities.
The patterns observed for Agricultural Sciences, Humanities, Medical and Health Sciences, and Social Sciences contrast with the low prevalence of linear-model-based research in Engineering and Technology and Natural Sciences. Recall that the fraction of papers reporting methods and quantitative data is very similar or higher in these three subfields compared to the Social Sciences, which somehow accounts for potential field-specific reporting methods practices. In Medical and Health Sciences, more than 80% of the articles reporting methods reported linear models. We interpret this pattern as indicative of the appropriateness of our measurement strategy because medical and health subdisciplines often deal with randomized control trials and clinical trial data and research questions that involve identifying causal relationships between medical treatments and individuals’ health or policy interventions and populations’ wellbeing (Mitra, Roy, & Small, 2022). Linear models are well suited for these types of standardized questions. Observed trends in Engineering and Technology and the Natural Sciences could be driven by what is considered “methodological pluralism” in literature (Lamont & Swidler, 2014), as these fields have an inductive approach to science and would use a more diverse multitude of observational and analytical methods than, for example, the Medical and Health Sciences.
Figure 2 shows the scatter plot between the prevalence of linear models as a fraction of papers reporting any methods and the annual average percentage point change in the prevalence of linear models from 1990 to 2022 for most subdisciplines across macro fields of science. The right panel zooms into the areas of the plot where most of the Social Sciences subdisciplines cluster. According to this figure, Social Sciences subdisciplines, such as Management, Educational Research, Psychology, and Economics, cluster in the top right areas of the plot, meaning above the 50% and growing prevalence of linear-model-based research from 1990 to 2022.
The clustering in high and growing prevalences of linear-model-based research among Social Sciences subdisciplines is only comparable to that of the Medical and Health Sciences. However, the latter displays a much higher prevalence of linear-model-based research and relatively lower yearly average increase due to ceiling effects. As shown in the right panel of Figure 2, 42 of the 47 Social Sciences subdisciplines, including several with more than 50,000 indexed articles (labeled), display a prevalence of linear-model-based research above 50%; 34 of them display positive temporal trends, meaning that linear-model-based research has grown over the past 31 years. For example, Education & Educational Research displays a linear model prevalence of 70% along with a 0.72 average percentage point yearly increase from 1990 to 2022 (i.e., more than 22 percentage point absolute increase in the prevalence of linear-model-based research during the analysis period).
This panel also shows that only five of the 47 subdisciplines in the Social Sciences display linear model prevalences below 50%: Asian Studies; Cultural Studies; Information Science & Library Science; Operation Research & Management Sciences; and Psychology, Mathematical. What is specific about these five subdisciplines is a question for future research.
3.2. The Center and Periphery Patterns in Prevalence of Linear-Model-Based Research in the Social Sciences
Figure 3 shows the distribution of Social Sciences articles across countries in the log-scale in the top panel and the prevalence of linear-model-based research at the country level at the bottom. The joint interpretation of these panels sheds light on the potential mechanism driving the sustained high prevalence of linear-model-based research and allows us to speculate on potential causes for these patterns due to well-documented geographic power relations in knowledge production (Bhambra & Holmwood, 2021; Castro Torres & Alburez-Gutierrez, 2022; Krause, 2016).
As seen in the top panel in Figure 3, the countries of the Global North (i.e., Northern America, West and Northern Europe, and Australia) and China contribute the largest share of papers in the quantitative Social Sciences indexed in WoS. The United States is the most significant contributor with exp(−0.96) = 38% of the total papers in our sample, followed by the United Kingdom (7.7%), China (5.7%), Germany (4.1%), and Canada (4.1%). The five remaining countries of the top 10 producers (Australia, Spain, the Netherlands, Italy, and France) contribute between 1.9% and 3.9%. Together, these top 10 producers account for almost three-fourths of all articles in our sample (74.7%). These global patterns suggest that our analytical sample is not biased regarding the geographical origin of articles, as it corresponds accurately to the highest-producing countries in all fields as reported by other studies (Castro Torres & Alburez-Gutierrez, 2022; Muthukrishna, Bell et al., 2020).
The country-level pattern of linear-model-based research prevalence (bottom panel) is partially consistent with the center–periphery and global North–South separation of countries, with some exceptions. Center–periphery ties, where the center exerts dominance in setting up research agendas and research methods, are typically observed in the global dynamics of knowledge production (Adame, 2021; Boshoff, 2009; Habel, Eggermont et al., 2014; Haelewaters, Hofmann, & Romero-Olivares, 2021; Krause, 2016). For example, countries with the lowest share of papers in Sub-Saharan Africa (periphery) display the highest linear-model-based research prevalence. The same applies to underrepresented countries such as Bolivia and Cuba in Latin America and the Caribbean, and Uzbekistan, Nepal, and Bangladesh in Asia. These countries are followed by the most prominent producers (i.e., the Global North), which, given their share of articles, drive the overarching trend of linear-model-based research prevalence. The United States is the most prominent case, with 38% of the total papers and 74% reporting linear models among papers reporting any statistical framework. Australia and Scandinavian countries display similar levels of linear-model-based research prevalence to that of the United States, yet none of them is as high as the prevalence observed in Sub-Saharan Africa and some countries in Asia and Latin America. France displays a distinctive pattern among the top 10 producers and Western Europe: relatively large shares of Social Sciences articles (1.8%) and less than 50% prevalence of linear-model-based research (49%). This geographic exception may be related to the development of the French School of statistics and its connection to the Social Sciences produced in this country’s research institutions.
3.3. The Greater Visibility of Linear-Model-Based Research Versus Other Statistical Analysis Frameworks
Figure 4 displays the association between reporting linear models and the odds of receiving at least one citation (left panel) and the average number of citations (right panel) within the 3 years after publication for the top 10 countries separately and for the remaining countries grouped in regions. If positive, these associations indicate visibility and citation premiums to linear-model-based research, respectively. If negative, they indicate the opposite, namely, a visibility and citation penalty compared to research using other methods.
Receiving at least one citation in 3 years is a low-bar benchmark for articles’ visibility. And the average number of citations within the 3 years after publications helps us distinguish articles that received more significant attention from the research community. The results are stratified for two periods: 1990–2012 and 2013–2018. The red and green markers show results for articles reporting any method and articles reporting any method or quantitative data, respectively. Our models control for whether the article is from a single country (vs. multiple countries) and the year of publication.
The patterns in Figure 4 show partial citation premiums for linear-model-based research in the centers of academic production (i.e., countries of the Global North). According to the left panel in Figure 4, from 1990 to 2012 (empty circles), there was a positive association between reporting linear models and receiving at least one citation, particularly for articles from the top 10 producer countries. Exceptions to this pattern include some top countries where the association is positive, but the confidence interval contains zero (the United Kingdom, Spain, and Italy), and some regions where the association is negative (Central and Southern Asia, North Africa and Western Asia). Notably, the only positive and significant coefficient across regional groups pertains to Europe and Northern America, which speak to the differential value of analysis frameworks between the center and peripheral countries. From 2013 to 2018 (filled circles), this visibility premium was held in half of the top 10 producer countries: the United States, China, Australia, the Netherlands, and France, and weakened elsewhere; only Germany and Italy displayed negative coefficients for 2013–2018, and their confidence intervals include zero. Across regions, virtually all coefficients became negative (except for Sub-Saharan Africa), further highlighting the differential value of methods between top producers (center) and the rest of the world (periphery).
The average number of citations mirrors these patterns. According to the right panel in Figure 4, during the first 2 decades of analysis (1990–2012), articles reporting linear models received, on average, more citations than articles reporting other methods (red markers). This is true for all top 10 producers except China and Spain. For example, on average, articles from the United States reporting linear models had 10.7% more citations (coefficient = 0.1) than articles reporting other methods. These patterns of greater visibility for linear-model-based research before 2012 are consistent when the reference group comprises articles reporting other methods or quantitative data (green markers). This consistency means linear-model-based research potentially held a citation premium within quantitative research broadly defined. From 2013 to 2018, the citation premium observed in the right panel reversed among the top 10 producer countries and became more negative among regional groups. All the filled red dots lay on the negative side of the plot in the right panel, and only the confidence intervals for Sub-Saharan Africa include zero. These negative and statistically significant associations mean that during the last 7 years of observation, linear-model-based research was associated with a citation penalty, which signals a potential reversal in their prevalent use, at least regarding research visibility in both the center and the periphery.
To understand the potential drivers of this penalty, we replicated the right panel of Figure 4, excluding the top 10% and 5% articles in the number of citations within each country and region (see Figure S2 in the Supplementary material). These two replications show no citation penalty for any of the top 10 producer countries. The linear-model-based research citation premium is observed for the United States, China, and Australia when the top 5% and 10% of cited articles are excluded from the sample. The citation penalty was held for all regions, although confidence intervals include zero for Europe, Northern America, and Sub-Saharan Africa. These patterns indicate that the reversal in the linear-model-based research premium may be driven by a few articles (produced in the top 10 producer countries) using other methods and receiving a disproportionately large number of citations.
4. DISCUSSION
The prevalence of linear-model-based research was above 50% for most Social Sciences subdisciplines from 1990 to 2022. This prevalence is higher in periphery countries compared to the centers of academic production. And linear-model-based research displayed a citation premium for having at least one citation from 1990 to 2018, and from 1990 to 2012 for the average number of citations. The citation premium regarding the average number of citations reversed in the last years of analysis (2013–2018). Here we discuss these empirical results further by embedding them in previous discussions in the literature. We provide further context on the potential underlying causes of our results and elaborate on possible outcomes.
Linear models are very flexible, as they can accommodate variables of different kinds (e.g., nominal, ordinal, interval, and continuous), and account for grouped observations in the dependent (e.g., repeated measures models) and independent variables (e.g., hierarchical or multilevel models). Linear models can be used with multiple link functions and probability distributions (Dobson & Barnett, 2008). These models have grown in complexity and specialization to accommodate better specific data types (e.g., categorical variables, duration, and sequence data) and to account for data features such as nested structures, serial correlations, and complex survey sampling (Cornwell, 2015; Courgeau & Leliévre, 1997; Mood, 2010). Additionally, linear models’ outputs are easy to summarize, present, and interpret, notwithstanding issues of misuse and misinterpretations (Fanelli, 2012; Gelman & Loken, 2014; Leahey, 2005; Sterne, 2001; Vidgen & Yasseri, 2016; Yanai & Lercher, 2020). This flexibility makes them very appealing for a wide variety of scientific endeavors across fields of science, from data description to causal analysis.
However, it is hard to argue that linear models are suitable for answering all research questions within a field, for example, the Social Sciences (Buyalskaya et al., 2021; Grigoropoulou & Small, 2022; Leahey, 2005, 2008; Schwemmer & Wieczorek, 2020). According to previous research, the relative success of other methods in terms of the number of publications in the Social Sciences compared to linear models has been lower (Koppman & Leahey, 2019). Research has also shown that the more significant success of linear models’ application in the Social Sciences cannot be fully explained by their flexibility and suitability (Leahey, 2005). Historical contingencies and power dynamics among individuals and institutions have played a role in the differential success of the linear model vs. other methods (Camic & Xie, 1994; Leahey, 2005; Lebaron, 2000; Porter, 1995). As a result, several contemporary assessments of social science research suggest that a potential hegemony of linear-model-based research could marginalize specific questions and worldviews in disciplines such as economics, sociology, and demography (Lebaron, 2003; Sigle, 2021; Zuberi & Bonilla-Silva, 2008).
According to these studies, there are not enough discussions about the assumptions underlying the representation of the social world in the form of a linear equation (Abbott, 1988; Leahey, 2005). These representations rely on assumptions regarding the nature of the relations among variables (Abbott, 1988; Bourdieu, 1996; Zuberi & Bonilla-Silva, 2008), the imperative to “control for,” standardize, or reduce data structure to identify pure effects (Gollac, 2004; Williams, 2019), and a particular understanding of causality as measurable only via quantitatively defined counterfactuals. These counterfactuals are typically obtained using technical procedures such as randomization, matching on observable characteristics, double differentiation, and instrumental variables (Smith, 2013).
These three essential features of linear-model-based research—centered on variables, concerned with pure effects or neat associations, and conceptualizing causal relations in terms of counterfactuals—align with a concrete understanding of theory as a set of testable propositions regarding the relations between outcomes and explanatory factors. According to Abend (2008), this is a valuable and legitimate definition of theory, which he terms “theory1,” but it is not unique. There are at least six other uses of the word theory in Social Sciences, labeled by him as “theory2” through “theory7.” Apart from Abend’s “theory1,” other theories are incompatible with the linear model framework. This lack of compatibility suggests that the hegemony of linear-model-based research may marginalize specific theories, worldviews, and narratives about the social world, at least in quantitative Social Sciences.
Explicit critiques in this direction have been put forward since the late 1970s in different subdisciplines, including sociology, demography, economics, philosophy, and political sciences (Abbott, 1988; Andler, Fagot-Largeault, & Saint-Sernin, 2002; Camic & Xie, 1994; Cornwell, 2015; Héran, 2006; Leahey, 2005; Lebaron, 2000; Porter, 1995; Sigle, 2021). Pierre Bourdieu, expressed it this way:
The particular relations between a dependent variable (such as political opinion) and so-called independent variables such as sex, age and religion, or even educational level, income and occupation tend to mask the complete system of relationships which constitutes the true principle of the specific strength and form of the effects registered in any particular correlation. The most independent of ‘independent’ variables conceals a whole network of statistical relations which are present, implicitly, in its relationship with any given opinion or practice (Bourdieu, 1996, p. 103 [1979]).
5. CONCLUSION
The first part of our analysis validates our measurement approach. We are able to detect at least 20% of articles reporting methods in their abstracts except in Agricultural Science and the Humanities. Using these analytical samples, we document substantial cross-field heterogeneity in the prevalence of linear-model-based research. In absolute terms, the average number of articles reporting linear models ranged from a few hundred in the Humanities to more than 35,000 in Medical and Health Sciences. The Social Sciences ranked third, with almost 9,000 articles per year. In relative terms, the prevalence of linear-model-based research ranged from low levels in Engineering and Technology (20%) to very high levels in Medical and Health Sciences (80%). The Social Sciences ranked second, with linear-model-based research prevalences hovering around 65% for the entire analysis period.
The second part of our analysis suggests that in the Social Sciences, linear-model-based research could be hegemonic. We conclude this based on three confluent patterns: the high, sustained, and growing prevalence of linear-model-based research over time in Social Sciences in general and across its subdisciplines; geographical patterns in linear-model-based research prevalence that are consistent with global inequalities in knowledge production; and the existence of a citation premium that has favored linear-model-based research for the entire period in terms of having at least one citation (i.e., avoiding invisibility) and at least until 2012 for the average number of citations.
Despite being incapable of establishing causal relationships among the processes underlying these patterns, the salience of our results allows us to discuss the potential drivers and implications of these patterns. We rely on some previous, in-depth, single-discipline analyses (Koppman & Leahey, 2019; Leahey, 2005, 2008) and historical accounts of scholarly traditions’ developments (Camic & Xie, 1994; Porter, 1995; Robson & Sanders, 2009, Chapter 2; Rouanet & Lépine, 1976) to inform our discussion and suggest future research areas.
The literature discussed the limitations of linear model approaches and the need to consider their presumptions (Abbott, 1988; Johnson-Hanks, Bachrach et al., 2011; Leahey, 2005; Sigle, 2021; Zuberi & Bonilla-Silva, 2008). These limitations would not be a concern if linear models were used as much as other methods. However, some discipline-specific studies have shown how linear modeling and hypothesis testing became normative in journals and fields such as economics, demography, sociology, psychology, and political sciences (Fanelli, 2012; Hirschman, 1994; Leahey, 2005; Lebaron, 2000; Ollion, 2011; Porter, 1995; Sigle, 2021). Our study extends these results to four macro fields of science and more in depth to 47 subdisciplines in contemporary quantitative Social Sciences by documenting a sustained, high, and growing prevalence of linear-models-based research over the past 31 years. The presented prevalence suggests that we may be neglecting perspectives and approaches that do not conform to this analytical framework.
Studies on the use of quantitative methods in Sociology have examined the institutional and individual-level determinants of authors’ use of alternative quantitative methods (Koppman & Leahey, 2019). According to these studies, conforming with mainstream analysis methods may be more beneficial for individual careers because existing institutional mechanisms reward this type of conformism. Authors with academic authority, seniority, prestige, and institutional credentials can afford the risk of exploring a nonmainstream perspective. And yet, their success in spreading the use of a given method is not always warranted, as shown by the case of correspondence analysis and qualitative comparative analysis (Koppman & Leahey, 2019). We cannot look at the institutional and individual-level determinants in all the 47 subdisciplines we study; however, the similarity among patterns between our results and previous in-depth single-discipline studies suggests that analogous mechanisms may be at play across these social science subdisciplines.
Linear models hegemony in terms of use, country-level distribution, and citation patterns could be consequential in several ways. First, hegemony could preclude the emergence, widespread use, and extension of potential path-breaking, atypical or disruptive approaches that could better solve current societal problems (e.g., rising social inequality, climate change, increased vulnerability of minorities). Second, global inequalities in knowledge production may be reinforced by the lack of methodological pluralism, as the capacity to develop new methods and bring them to the forefront of research is not evenly distributed across countries and institutions. Just as individuals with more extraordinary credentials and prestige can afford the risk of path-breaking analysis, countries, and regions in privileged positions (Zuckerman, 1970) (e.g., highly funded institutions) can lead research under alternative (risky) approaches (Guetzkow, Lamont, & Mallard, 2004; Hofstra, Kulkarni et al., 2020; Lamont & Swidler, 2014; Laudel & Gläser, 2014; Luukkonen, 2012) without strategic concerns for publication or research evaluation (Akbaritabar, Bravo, & Squazzoni, 2021; Rijcke, Wouters et al., 2016). Third, to the extent that methodological pluralism is not promoted in research training programs and institutions, it may take some decades to appear, because, although the visibility premium has reversed, many generations of researchers and instructors were trained under a context that privileges linear-model-based research.
Our study carries several limitations besides the relatively short temporal scope of our data from 1990 to 2022. First of all, we assume that methods reported in abstracts are the papers’ analysis framework. Whereas this is likely the case for most empirical studies, our assumption may be incorrect if articles refer to methods for other reasons. Similarly, we are missing articles that do not report methods in their abstract, which directly affects our prevalence measures.
Based on research on reporting study countries in titles and abstracts, we find it reasonable to assume that mainstream methods (as mainstream study countries) are less likely to be reported in abstracts due to their widespread use. Research has shown that when researchers study the so-called default cases, such as the United States, they are less likely to report the country of study in titles and abstracts (Kahalon et al., 2021; Castro Torres & Alburez-Gutierrez, 2022). We assume the same could apply to linear-model-based research, as it seems to be the default analysis framework according to cited literature. If this assumption holds, our analysis underestimates the prevalence of linear-model-based research and overestimates the use of other methods. To address this limitation at least partially and in the case of a scientific field more familiar to authors (i.e., demography), we used full text information from 719 Open Access research articles from 2011 to 2022 published in Demography, the flagship journal of the Population Association of America. We find some evidence that favors our assumption. We limit these full text articles to only documents with above 20 paragraphs and exclude comments, notes, and letters, leading to the inclusion of 708 articles. We found that 91% of these papers mention linear models in their full text, vs. 47% that mention other methods. Further, we computed the weighted proportion of papers that report methods in their abstracts. We use the number of times that methods are mentioned in the full texts as the weights and focus on papers that mention methods at least five times (626 papers). These proportions were 36.6% for linear models and 41.2% for other methods, which indicates that, at least in Demography publications, as one illustrative example of quantitative social science research, linear models are less likely to be reported than other methods. Our weighting and sample selection strategies give more importance to papers that mention methods several times, which likely correlates with the substantive relevance of the method for the paper. Future research using full text from larger samples should test the validity of this assumption for quantitative research in other disciplines.
Continuing with limitations, we cannot evaluate critical aspects such as potential bias in the methods taught in graduate training programs, editors’ and reviewers’ potential preferences for certain types of analysis over others, or the concrete reasons for which researchers across different contexts decide to use a given method and influence others’ decisions (Lane, Teplitskiy et al., 2022; Merton, 1968). However, the magnitude of our results and the high prevalence of general terms for referring to analysis frameworks suggest that our results are unlikely to be wrong.
Finally, there are more comprehensive text analysis methods to identify all noun-phrase clauses or word combinations that our list might have overlooked (e.g., by selecting an anchor term and finding adjectives pairing up with it), as well as full-text analysis that could yield more comprehensive analytical samples of articles and statistical methods. Nevertheless, we decided on the most straightforward and strict method of search, aiming to find the lower bound of the prevalent use of analytical frameworks. Using more complex text analysis methods and full-text databases will likely increase the observed trends in the prevalence of linear-model-based research and we encourage future research endeavors in this direction.
ACKNOWLEDGMENTS
We thank our colleagues Diego Alburez-Gutierrez, Enrique Acosta, Maarten Jacob Bijlsma, Beatriz Sofia Gil, Esther Dorothea Denecke, and Robert Gordon Rinderknecht, who reviewed and complemented the list of terms we have used here. We would also like to thank Diego Alburez-Gutierrez and Misha Teplitskiy for their helpful comments on the draft of our manuscript. We are thankful to the Nordic Network for the Science of Science for providing us with funding to present and receive feedback on our work at the Diversity in Science workshop at the University of Helsinki. We thank Monica Alexander for publicly sharing her workshop materials analyzing the full text of Open Access publications in Demography that we used for a robustness analysis comparing abstracts with full text.
AUTHOR CONTRIBUTIONS
Andres F. Castro Torres: Conceptualization, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing—original draft, Writing—review & editing. Aliakbar Akbaritabar: Conceptualization, Data curation, Investigation, Software, Validation, Writing—original draft, Writing—review & editing.
COMPETING INTERESTS
The authors have no competing interests.
FUNDING INFORMATION
This study has received access to the bibliometric data through the project “Kompetenznetzwerk” and we acknowledge their funder Bundesministerium für Bildung und Forschung (grant number 16WIK2101A). AF acknowledges funding from the European Research Council (ERC-2020-STG-948557-MINEQ).
DATA AVAILABILITY
We use data from the German Competence Network for Bibliometrics (grant number 16WIK2101A). Restrictions apply to the availability of these data, which were used under license for the current study and are not publicly available. The replication data and scripts necessary to recreate our analysis and results are available upon request from the corresponding author.
REFERENCES
Author notes
Handling Editor: Vincent Larivière