As “big and broad” social media data continues to expand and become a more prevalent source for research, much remains to be understood about its epistemological and methodological implications. Drawing on an original data set of 12,732 research articles using social media data, we employ a novel dictionary-based approach to map the use of methods. Specifically, our approach draws on a combination of manual coding and embedding-enhanced query expansion. We cluster journals in groups of densely connected research communities to investigate how heterogeneous these groups are in terms of the methods used. First, our results indicate that research in this domain is largely organized by methods. Some communities tend to have a monomethod culture, and others combine methods in novel ways. Comparing practices across communities, we observe that computational methods have penetrated many research areas but not the research space surrounding ethnography. Second, we identify two core axes of variation—social sciences vs. computer science and methodological individualism vs. relationalism—that organize the domain as a whole, suggesting new methodological divisions and debates.

Over the past decade, the advent of “big and broad” (Housley, Procter et al., 2014) social data from various digital sources has sparked discussions on how research is gradually being (re-)organized across a number of scientific disciplines, including the emergence of new interdisciplinary fields, such as computational social science and digital humanities (Kitchin, 2014; Lazer, Pentland et al., 2009). Although big social data writ large is thus the subject of richly textured epistemological debates, there have been only a few empirical investigations into research fields and domains affected by applications of such novel data sources (see Mohammadi and Karami (2020) for one exception). To date, much therefore remains to be understood about how large-scale digital social data and related computational methods rearrange contemporary research domains, as well as what this implies for methodological developments in the digital age.

This paper conducts an empirical study of an important research domain characterized by the influx of big social data: social media data-based research (i.e., research that uses social media-derived data sets to make knowledge claims). As social media data accommodates research interests and expertise from diverse disciplines, from media studies to computer science, it accordingly opens up new possibilities for research using various kinds of methods and methodologies (e.g., Agarwal, Xie et al., 2011; Bail, Argyle et al., 2018; Golbeck & Hansen, 2011; Iosifidis & Nicoli, 2020; Krieg, Berning, & Hardon, 2017; Tanzil, Hoiles, & Krishnamurthy, 2017; van Vliet, Törnberg, & Uitermark, 2020; Yuan, Feng, & Liu, 2019). In this light, investigating how—and the extent to which—the domain of social media data-based research is (re-)organized by methods, and according to what underlying principles of division and affinity, provides a suitably diverse case to understand big social data developments more broadly.

Prior work in science studies shows that methods play a pivotal role in scientific knowledge production and communication (Merton, 1973; Nielsen, 2013). To facilitate persuasion in scientific practices, researchers, as standard, have to demonstrate the correctness and soundness of their work based on commonly accepted methods. With the advent of the digital “data deluge,” some observers claim a revolutionary potential in methodological developments (Bartlett, Lewis et al., 2018). Against this backdrop, empirically studying the emerging patterns of method use, communication, and community formation in social media data-based research can help advance ongoing methodological discussions in ways fitting for quantitative science studies.

The present article seeks to lay the groundwork for such investigations, applying natural language processing (NLP) and network analysis tools to conduct novel, large-scale content and network analyses of bibliometric data. Our analysis builds on an original, carefully vetted data set containing 12,732 research articles published between 2005 and 2019 that use social media traces as core data sources for their knowledge claims. Relative to existing scientometric studies focused mainly on “big data” in specific disciplines (e.g., Becerra & Ratovicius, 2022; Karaboğa, Karaboğa, & Şehitoğlu, 2020; Liang & Liu, 2018; Soleimani-Roozbahani, Rajabzadeh Ghatari, & Radfar, 2019), this data set enables us a more encompassing, interdisciplinary, and granular analytical view into the restructuring of the scientific work occurring alongside the advent of big social data.

In technical terms, investigating research methods in textual data requires identifying relevant terms in such a way that our dictionary is as comprehensive across disciplines and over time as possible. To do so, we start by manually labeling method terms from randomly selected samples across disciplines on the one side and expanding the term list in the embedding space on the other side (Tulkens, Hilte et al., 2016). To shed light on the patterns underlying the divergence of research communities in the studied domain, we use clusters detected in a journal citation space (Carpenter & Narin, 1973; Wang & Bowers, 2016). By employing a custom-built method dictionary, along with the journal citation clusters, we provide fresh insights into the patterns of methods used to analyze social media-derived data sets. In particular, our analysis reveals that cluster formation is driven to a large extent by method affinities; yet in two distinct ways, based respectively on sharing one type of method or combining two or more in characteristic ways.

Moreover, to understand the overarching logic of method variation in the domain, we calculate the association between methods—word similarity in the embedding space—indicating that methods coshape three relatively distinct groups: social sciences, computer science, and statistics. Comparing these groups, we find a strikingly low association between the computer science group and a subset of the social science group characterized by ethnographic methods. Furthermore, via principal component analysis (PCA), we find two main axes of variation, which we label as social sciences vs. computer science and methodological individualism vs. relationalism. Overall, we argue, these analyses offer a more fine-grained mapping of the big social data method landscape than previous research, as well as suggesting ways in which this domain is gradually moving beyond a simple qualitative-quantitative dichotomy in the direction of novel method affinities (cf. Kang & Evans, 2020; Schwemmer & Wieczorek, 2020).

We review related work in two main directions, also drawing connections to science studies discussions and concepts. First, we introduce our studied domain and how it relates to the wider discussions of big social data. Second, we discuss some purported methodological implications of the advent of big social data, particularly with respect to established divides between qualitative and quantitative methods in social research.

2.1. Social Media Data and Big Social Data Developments

The domain of social media data-based research in many ways indexes the advent of big social data across multiple scientific disciplines, given that big social data is largely taken to mean social media data (Housley et al., 2014; Jussila, Menon et al., 2017). Although, importantly, not all social media data can be categorized as “big” or “broad” in scope, the analysis of large volumes of digital traces harvested from social media platforms (e.g., Twitter, Facebook, and Instagram) are by now pertinent practices in many disciplines and for many purposes (e.g., Agarwal et al., 2011; Bail et al., 2018; Golbeck & Hansen, 2011; Iosifidis & Nicoli, 2020; Krieg et al., 2017; Tanzil et al., 2017; van Vliet et al., 2020; Yuan et al., 2019). Although these are not coextensive terms, it is fair to say that most discussions of big social data hinge on what happens with social media data in and across (inter-)disciplines and emerging communities of research practice.

In observing these developments, commentators widely claim that social media data has rich potential to extend social research in new directions (Edelmann, Wolff et al., 2020; Halford, Weal et al., 2017; Parks, 2014). In particular, such data provides a new source of social information about both individual-level human behavior and large-scale collective phenomena, thus being also “broad” in the sense of encompassing populations in ways not previously feasible (Housley et al., 2014). Accordingly, social media data supports important investigations of culture, society, and health, as well as the rapid growth of many areas of computing (Olteanu, Castillo et al., 2019). In providing access to large and unstructured data sets, social media data is thus central to new developments in computational methods, whereby scholars trained in computer sciences are entering into new relations to social research (Agarwal et al., 2011; Lazer et al., 2009; Rohani, Shayaa, & Babanejaddehaki, 2016).

In response, a range of researchers nowadays interrogate the way social media data is used to produce and justify knowledge, including by drawing on insights from science studies (Halford et al., 2017; Kitchin, 2014; Olteanu et al., 2019; Venturini, Bounegru et al., 2018). Here, consideration is given to methodological principles, praxis, and legitimate interpretation when using social media data, as well as big social data more generally, to make knowledge claims in the social sciences. In countering certain earlier and arguably overenthusiastic proclamations on the “data revolution,” more scholars nowadays argue that big social data studies require a concurrent revival also of theories and qualitative methods to construct meaningful measures and produce reliable social research (Grigoropoulou & Small, 2022; Lazer, Hargittai et al., 2021).

As indexed by these still open-ended epistemological debates, our starting point in this article is that the data source of social media is itself flexible in ways that coshape but do not determine choices of method or epistemology for researchers (Kitchin, 2014). In the literature of science studies, researchers employed the concept of “theory-method packages” to highlight the interlockings of theories, methods, and empirics in research (Clarke & Star, 2008; Silvast & Virtanen, 2023). These packages represent that methods and theories both manifest and coconstitute actual practices, incorporating a set of implicit assumptions and concrete techniques, bringing the empirical phenomena in ways that create “doable problems” for researchers, rather than methods simply serving theories in research practices. By extending the concept of “theory-method packages” to the unfolding big social data developments, we hypothesize that researchers hailing from different scientific disciplines and traditions will be utilizing social media data in particular ways that combine abstract and concrete techniques and methods, as they carve out interesting scientific problems to tackle. In the process, novel communities of researchers are likely to emerge around new “data-method packages.” So far, in the literature, little is known empirically about how researchers do this: that is, how diverse methods are being mobilized and combined by researchers from different disciplines to analyze social media data. Moreover, we know little about how this potentially (re-)organizes relations between methods more generally, not least in the social sciences.

2.2. Methodological Implications Beyond the Qualitative-Quantitative Divide?

Discussions of the potentially revolutionary impact of the “data deluge” have been intense, particularly for social science research (Bartlett et al., 2018; Kitchin, 2014). On the one hand, it has been argued that we might witness the creation of a “new digital divide”—data-rich and data-poor in our time—or even a “colonization,” with engineers colonizing the social sciences under the transformation of social research through improved inferences with large-scale data and computational methods (Boyd & Crawford, 2012; McFarland, Lewis, & Goldberg, 2016). Advocates and commentators (Anderson, 2008; Mayer-Schönberger & Cukier, 2013; Uprichard & Carrigan, 2015) provocatively declare the possibility of a methodological revolution or even “methodological genocide,” which marginalizes, reproduces, and eliminates traditional research strategies under the prominence of big social data.

On the other hand, it has been claimed that such developments have fostered emerging interdisciplinary fields that conjoin domain and method expertise in new ways, such as computational social science and digital humanities, as well as mixed methodologies that involve efforts to enhance the methodological synthesis, synergy, and diversity (Cowls & Schroeder, 2015; Isfeldt, Enggaard et al., 2022; Kitchin, 2014). In this regard, researchers believe that big social data developments could weaken the longstanding qualitative-quantitative methodological divide in social research that hinders cooperation and collaboration and limits the potential for discoveries and insights (Kang & Evans, 2020).

However, as recent scientometric-informed research reveals, the divide between qualitative and quantitative methodologies still manifests in many fields, corresponding to the division in which different subfields have distinctive objects, approaches, evaluations, and publication outlets. A recent Quantitative Science Studies analysis of qualitative and quantitative science studies from 1988 to 2019 suggests that these two subfields represent not only different research objects and approaches—ontologies and epistemologies—but also manifest opposed evaluations of the same objects and approaches (Kang & Evans, 2020). Relatedly, based on 8,737 publications in sociology journals, authors have even identified an increasing divide between qualitative and quantitative methods, finding that publication outlets deepen and enforce such entrenchments between paradigms (Schwemmer & Wieczorek, 2020). Finally, based on a survey of 1,171 Danish researchers (Johansson, Grønvad, & Budtz Pedersen, 2020), scholars find a clear distinction between qualitative and quantitative research styles in most humanities disciplines, including psychology and linguistics.

Indicative as such studies may be, it is hard to draw conclusions on their basis when it comes to our core question: how big social data may be (re-)organizing relations between method communities. This is true because, as commentators argue, big social data developments might broach another new methodological divide in our time—primarily between disciplines. For example, Evans and Aceves (2016) argue that much text-mining work has not been produced by sociologists but rather by computer and information scientists committed to building new tools. Likewise, Wallach (2018) maintains that social scientists and computer scientists have different goals—explanations and predictions—leading to different modeling approaches. Even so, she argues that methodological debates should ideally be the products of joint deliberation involving both sides.

In sum, our study is motivated by this somewhat unclear situation, in which there is little consensus between different propositions on how to interpret the implications of the advent of big social data, neither in principle nor from existing empirical research. Studying the domain of social media data-based research via bibliometric techniques, we suggest, will provide a firmer picture of how and the extent to which methods and new method combinations provide structure to this site of big social data. We further offer a new organizing principle and calculate the association between methods, based on a novel quantification—word embedding—of these method terms.

We describe our methodology in three parts: data collection, method dictionary, and network construction. In particular, we propose a novel framework for building a new method dictionary to detect the method-related terms from the texts. The basic workflow, as presented in Figure 1A, is that we use the Word2Vec model (Mikolov, Sutskever et al., 2013) to expand the manually validated method terms—a manually annotated method list (i.e., method list) and an additional list of method terms mined from online sources (i.e., additional list)—using similarity measures based on word embeddings (Tulkens et al., 2016).

Figure 1.

Workflow for building a method dictionary. (A) The basic workflow comprises manual annotation, data mining, Word2vec model training, and query expansion. (B) Examples of the query expansion step, where nodes in the center represent search seeds (i.e., “content-analysis” and “machine-learning”) and five tokens in the periphery are candidates presented to human coders. The edge width indicates the similarity between word embeddings.

Figure 1.

Workflow for building a method dictionary. (A) The basic workflow comprises manual annotation, data mining, Word2vec model training, and query expansion. (B) Examples of the query expansion step, where nodes in the center represent search seeds (i.e., “content-analysis” and “machine-learning”) and five tokens in the periphery are candidates presented to human coders. The edge width indicates the similarity between word embeddings.

Close modal

3.1. Data Collection

Our strategy for investigating the research methods relied on bibliometric data from the titles, keywords, and abstracts. We obtained a data set of 12,732 articles from the Web of Science (WoS) published between 2005 and 2019 in 2,957 journals across 162 Web of Science Categories (WC) containing research using social media data. Although our data set has the evident limitation of excluding relevant conference proceedings, particularly those from computer science and information science, our analysis specifically focuses on what we defined as the method-level patterns rather than the technique or algorithm used (see Section 3.2 for details), and thus the main findings will most likely remain consistent if those conference proceedings are included. Additionally, we used the OECD Category1 to map the WCs to six broad field categories. Table 1 shows the number of articles retained in our analysis from each OECD Category. More details about our search strings are included in the Supplementary material.

Table 1.

Data involved in the analysis. All data collection ended in March 2020

OECD CategoryRecord start fromWith DOIsWith abstractsWith DOIs and abstracts
Engineering and Technology 2005 3,777 3,986 3,768 
Natural sciences 2006 923 958 923 
Social sciences 2005 5,398 5,795 5,380 
Medical and Health Sciences 2007 1,332 1,354 1,325 
Humanities 2006 487 544 479 
Agricultural sciences 2010 34 39 34 
Total – 11,951 12,675 11,909 
OECD CategoryRecord start fromWith DOIsWith abstractsWith DOIs and abstracts
Engineering and Technology 2005 3,777 3,986 3,768 
Natural sciences 2006 923 958 923 
Social sciences 2005 5,398 5,795 5,380 
Medical and Health Sciences 2007 1,332 1,354 1,325 
Humanities 2006 487 544 479 
Agricultural sciences 2010 34 39 34 
Total – 11,951 12,675 11,909 

3.2. Method Dictionary

As shown in Figure 1A, we first manually annotated an original method list and scraped an additional list from multiple online sources. We then trained an embedding model on our corpus, which consists of 11,909 articles with DOIs and abstracts, resulting in a vocabulary of approximately 60,000 words. See the Supplementary material for a more detailed description of our manual annotation, data mining, and Word2vec model training.

Using the trained word vectors, we expanded our lists of method terms based on the vector similarity in the embedding space. Specifically, we used the cosine similarity to compute the association between the terms on the method lists and all the terms in the corpus, which measures the cosine of the angle between two vectors (Schnabel, Labutov et al., 2015).

In the next step, we ranked all the terms and presented five instances with the largest cosine similarity to human coders, who qualitatively evaluated the suggested instances—accepting or rejecting them as a category (i.e., method-related term). This procedure is supposed to efficiently expand the queries by searching through the embedding space. It can also largely reduce the manual coding and validation effort, as annotators only need to accept or reject the new instances. Figure 1B displays the five best proposals for two terms, “content-analysis” and “machine-learning.” In this example, all terms except “netlytic” were added to the expanded queries, as “netlytic” is a social media analyzer product and not a method. We repeated the same procedure for all terms on the list, resulting in a list of 281 method-related terms.

After acquiring relevant terms, we performed a manual classification to identify the method-level terms. Here, we encounter two major challenges: (a) Researchers tend to describe the same concept at different levels of granularity, such as quantitative analysis, topic modeling, and Latent Dirichlet Allocation; and (b) it is challenging to define research methods from a highly diverse and interdisciplinary data set in a general way. An example is machine learning (ML), which is a very commonly used method-related term in our data set, but many practitioners in computer science think of ML as engineering rather than a research agenda, and the contribution of the ML paper itself might be a new method or technique. Therefore, instead of seeking a universal structure of these terms, as presented in the research onion model (Saunders, Lewis, & Thornhill, 2019), we generated a domain-specific, hierarchical structure based on our method list.

We inductively classified the method-related terms into three levels: algorithm or technique at the lowest level, method at the middle, and methodology or approach at the top. Although words denoting these categories are open to many interpretations, to make the classification goal in this analysis as straightforward as possible, we defined these categories as follows: A technique refers to instructions to solve specific problems or to perform computation; a method is defined as the approach procedure and guidelines that are used in conducting research and tackling the research problem; and a methodology is considered as a structured set of guidelines informed by broader epistemological commitments that cover various methods (Jamshed, 2014; Mingers, 2001).

For instance, in the context of statistics-related terms, we consider “anova” (analysis of variance) as a type of parametric inferential statistical method, but “inferential-statistics” is an approach that includes methods such as “anova.” For ML-related terms, we consider “neural-networks” as a method, and “machine-learning” as a methodology that incorporates “neural-networks,” “regression,” “classification,” “clustering,” and other methods. In contrast, terms like “adaboost” and “admm” are algorithms that seek to solve some specific problems when using a method under “machine-learning” methodology. After the manual classification, we normalized the method list by selecting the most basic terms. For example, the terms “data-mining,” “text-mining,” and “mining” were grouped under the shortest term “mining.” This merging process resulted in a more concise and comparable method list, containing 106 method-level terms (see Table S1 in the Supplementary material for the complete list of methods).

3.3. Network Construction

To identify the method patterns in the research communities, we constructed a relational space using the citation links between periodicals. As shown in Table 1, the number of publications across the predefined journal-based field categories—OECD fields—is far from evenly distributed. Therefore, we employed a relational space to partition the studied corpus with a higher resolution. We chose to cluster at the aggregated journal level rather than the individual paper level, as the Web of Science Categories (WC) along with the OECD fields are also defined at the journal level. Additionally, previous research (Havey & Chang, 2022) has shown that journals have distinct methodological preferences, making them a suitable unit of analysis.

The partitioning of citation-based networks into communities has been a major area of interest in science mapping literature (Boyack & Klavans, 2010). In network science, communities are locally densely connected subgraphs in the networks (Newman, 2006). The presence of communities significantly impacts a variety of a system’s dynamics and structural properties (Lambiotte & Panzarasa, 2009). Hence, there has been a collective interdisciplinary effort to develop mathematical tools and computational algorithms to identify communities in the networks. In this article, we adopted a flow-based algorithm, Infomap (Rosvall, Axelsson, & Bergstrom, 2009), as we considered it to capture citation flows naturally and hence to be well-suited for bibliometric analysis (Bohlin, Edler et al., 2014). We also note that previous bibliometric research has shown the map equation algorithm to outperform other algorithms (e.g., the modularity maximization-based algorithms) in clustering citation networks, as assessed by expert evaluation, despite being undervalued by the bibliometric community (Šubelj, van Eck, & Waltman, 2016)2.

To construct the relational space, we built an undirected journal citation network by linking two journals if they contain publications that cite each other3. Specifically, we first identified the citation links between papers within our corpus, connecting two papers if one cited another, and both were included in our data set of research using social media data. We then used these paper citation links to connect journals. The link strength is based on the undirected within-set paper citation links—the number of times that papers from two journals cite each other. Next, we applied the disparity filter (Serrano, Boguñá, & Vespignani, 2009) to extract the backbone structure of the graph while reducing the number of links that are otherwise too dense for community detection. After filtering, we applied Infomap to detect clusters. The modularity score obtained is around 0.51, indicating that the division of the network does form densely connected subgraphs (Newman, 2006). Detailed network topological quantities are shown in Table S2 in the Supplementary material. All of the analyses presented in this study used packages and functions of Python.

4.1. The Most Prevalent Methods to Analyze Social-Media-Derived Data Sets

Before turning to the relational and embedding spaces, we present the methods identified in the studied domain to analyze social media-derived data sets. Our detection finds method-related terms for 9,088 articles within all 11,909 texts, including method terms for 7,183 articles4. For each article, our algorithm may recognize more than one method-related term. Based on the frequency of terms, we show the distribution of the most common methods in Figure 2A. The term “experiment” has been used in more than 2,000 articles by researchers. We then examine the co-occurrence of “experiment” and related tokens. We find that “experiment” has been used by researchers in many scenarios, from natural experimental settings to experiments carried out on social media or to artificial experiments with computer simulation. Besides, it is easily seen that in this landscape, researchers frequently use terms such as “mining,” “classification,” “cluster,” “neural-network,” and “deep-learning.” This may reflect the uptake of computer science and, in particular, machine learning methods. Indeed, computer science techniques enable researchers to capture large-scale data sets, accordingly bringing along important scientific opportunities. Unsurprisingly, key method terms illustrate not only the computational practices around the ML topics but also the disciplinary lens; for example, social science research is indicated by terms such as “case-study” and “content-analysis.”

Figure 2.

The most frequent method terms identified in the corpus. The graph to the left with the colored bars shows the counts of the top 15 methods; the graph to the right shows the count of the top 10 methods over time, with a stacked area chart inserted showing the proportion over time. Note that we use the same colors to indicate the same methods.

Figure 2.

The most frequent method terms identified in the corpus. The graph to the left with the colored bars shows the counts of the top 15 methods; the graph to the right shows the count of the top 10 methods over time, with a stacked area chart inserted showing the proportion over time. Note that we use the same colors to indicate the same methods.

Close modal

Figure 2B further displays the use of the top 10 methods over time. This time-series graph is revealing in several ways. First, there is a clear trend of increasing use of all methods over time, corresponding to the overall expansion of interest in social media-derived data sets (Figure S1). Second, the variability of methods also increases as many method terms join in, including “classification,” “prediction,” “regression,” and “network-analysis.” Third, “content-analysis” was the first term used to analyze social media data, implying the initial interest in such data as digital traces to study social phenomena.

4.2. Delving into the Relational Space

4.2.1. Journal citation network

Next, we explore the method patterns of the (inter-)disciplinary research communities in the relational space based on the journal clusters. Figure 3 displays the journal citation network, showing how periodicals cite each other. The node size in the graph demonstrates the number of interconnected periodicals. The largest nodes—the most connected periodicals—in this space are PLOS ONE, Journal of Medical Internet Research, and Computers in Human Behavior, which originated in the Natural sciences, Medical and Health sciences, and Social sciences OECD fields, respectively. This suggests that these periodicals are the most active in research communication in this space. Besides, node color specifies the cluster detected by the community detection algorithm. This graph illustrates the division of periodicals into distinct clusters in the journal citation space, which was expected from the modularity analysis in Section 3.3.

Figure 3.

Journal citation network. Here nodes represent periodicals, and links are citation relations identified between these periodicals. Nodes are sized by degree and colored by clusters detected by the Infomap algorithm. We only show journals in the top 10 clusters (76% of the nodes) identified in the network.

Figure 3.

Journal citation network. Here nodes represent periodicals, and links are citation relations identified between these periodicals. Nodes are sized by degree and colored by clusters detected by the Infomap algorithm. We only show journals in the top 10 clusters (76% of the nodes) identified in the network.

Close modal

The disciplinary composition of these clusters reflects this relatively sharp division. Figure 4 illustrates the intellectual patterns of the top eight clusters (74% of the periodicals) based on the proportion of periodicals in each OECD field category. Note that we normalize the plots to mitigate the effect of statistical heterogeneity with field counts and cluster sizes across multiscale5. As shown in Figure 4A, periodicals from each OECD field tend to converge in one central cluster, yet with important exceptions indicating also cross-cluster dispersion. For example, Cluster 2 contains 74% of total periodicals from Engineering and Technology, and Cluster 1 includes 73% of Humanities periodicals. Besides, patterns are more complex for other fields—periodicals spread to many clusters. Specifically, approximately 60% of periodicals from Natural sciences and Agricultural sciences are in Cluster 4, and the rest spread to other clusters; about 50% of Social sciences periodicals are in Cluster 1, and the remaining proportion is dispersed to many others, including Clusters 3, 6, and 8; Medical and Health sciences journals are mainly distributed to Cluster 5, 3, and 7.

Figure 4.

Proportion of periodicals in each OECD field category across network clusters. The colored bars show the proportion of periodicals in OECD fields across the top eight clusters. Colors describe the clusters identified in the journal citation network in Figure 3. C1, C2, …C8 refer to Cluster 1, Cluster 2, …Cluster 8. Bars are normalized per OECD field in a and per cluster in b. From a, we see the proportion of clusters in each OECD field. For example, Cluster 2 contains over 74% of periodicals in Engineering and Technology, and the other seven clusters only include under 26% of Engineering and Technology periodicals. From b, we see each cluster’s OECD field composition. For example, more than 84% of periodicals in Cluster 1 are from Social sciences, and less than 16% of Cluster 1 are from the other five OECD fields.

Figure 4.

Proportion of periodicals in each OECD field category across network clusters. The colored bars show the proportion of periodicals in OECD fields across the top eight clusters. Colors describe the clusters identified in the journal citation network in Figure 3. C1, C2, …C8 refer to Cluster 1, Cluster 2, …Cluster 8. Bars are normalized per OECD field in a and per cluster in b. From a, we see the proportion of clusters in each OECD field. For example, Cluster 2 contains over 74% of periodicals in Engineering and Technology, and the other seven clusters only include under 26% of Engineering and Technology periodicals. From b, we see each cluster’s OECD field composition. For example, more than 84% of periodicals in Cluster 1 are from Social sciences, and less than 16% of Cluster 1 are from the other five OECD fields.

Close modal

Figure 4B further shows that clusters have a generally predominant field of study, although some are relatively discipline spanning. For example, Clusters 1, 2, and 5–8, primarily comprised work in Social sciences, Engineering and Technology, or Medical and Health sciences, respectively. In contrast, Clusters 3 and 4 connect many OECD-defined fields with different focal interests. By observing the disciplinary configuration of journal clusters, we see that periodicals tend to cluster and form distinctive intellectual subfields even across formally predefined field boundaries. In sum, this network helps us produce a more nuanced organization than the OECD Category mapping—revealing that at the aggregated level, research using social media data tends to cluster around distinct subject areas, yet with notable cross-field interactions.

4.2.2. Patterns in method use across network clusters

The central part of our analysis, to which we now turn our focus, as noted, to the method-level patterns of these (inter-)disciplinary journal clusters. We report the clusters’ method engagement, along with their OECD field composition.

We first investigate the proportion of clusters—the percentage of periodicals in the cluster—for each method, as shown in Figure 5A. This figure displays the most prominent cluster for the 10 most common methods6. Specifically, methods including “classification,” “prediction,” “mining,” “experiment,” “cluster,” “correlation,” and “regression” mostly appear in Cluster 2. This indicates that computer science-related methods are primarily used by researchers in Cluster 2—a research community characterized by the OECD field of Engineering and Technology. In contrast, “content-analysis,” “case-study,” and “network-analysis” are the most prominent in Cluster 1 (characterized by the field of Social sciences). In a broader sense, this figure reflects the meeting points between (inter-)disciplinary research communities and method communities. For example, the method community that uses “experiment” to describe their work has the largest intersection with Cluster 2, and the “content-analysis” community is mainly present in Cluster 1 and Cluster 3 (characterized by Medical and Health sciences & Social sciences), whereas the “network-analysis” and “regression” method communities attract many research groups, including Clusters 1–3, and 4 (characterized by Natural sciences & Social sciences). Of course, unlike journal communities, these method communities are not mutually exclusive—corresponding to the nature of periodicals involved in more than one method—so our mapping of intersections involves some degree of overlap. Nevertheless, we consider this as enriching our understanding and description of the studied domain regarding method engagement across research communities.

Figure 5.

Proportion of periodicals using the most common methods across network clusters. The colored bars show the proportion of periodicals using the 10 most common method terms across clusters. Colors describe the clusters identified in the journal citation network in Figure 3. Bars are normalized per method in a and per cluster in b. C1, C2, …, C8 refer to Cluster 1, Cluster 2, …, Cluster 8. Similarly, M1, M2, …, M10 refer to Method1, Method2, …, Method10. From a, we see the proportion of clusters for each method. For example, for periodicals using “experiment,” Cluster 2 contains over 46% of them, and Cluster 1 only contains 21%. From b, we see each cluster’s method distribution. For example, more than 24% of periodicals in Cluster 1 use “content-analysis,” 19% use “case-study,” and only 3% use “classification.”

Figure 5.

Proportion of periodicals using the most common methods across network clusters. The colored bars show the proportion of periodicals using the 10 most common method terms across clusters. Colors describe the clusters identified in the journal citation network in Figure 3. Bars are normalized per method in a and per cluster in b. C1, C2, …, C8 refer to Cluster 1, Cluster 2, …, Cluster 8. Similarly, M1, M2, …, M10 refer to Method1, Method2, …, Method10. From a, we see the proportion of clusters for each method. For example, for periodicals using “experiment,” Cluster 2 contains over 46% of them, and Cluster 1 only contains 21%. From b, we see each cluster’s method distribution. For example, more than 24% of periodicals in Cluster 1 use “content-analysis,” 19% use “case-study,” and only 3% use “classification.”

Close modal

Figure 5B further shows the distribution of methods for each cluster, revealing which methods are the most prominent for each cluster and how the method combination provides structure to our studied domain in the journal citation space. It is evident that these communities use a wide range of method terms. Some of them exhibit a relatively monomethod structure, such as Cluster 2—characterized by the OECD field of Engineering and Technology—uses “experiment,” “mining,” “classification,” “prediction,” and “cluster,” and Clusters 5 and 7 (characterized by Medical and Health sciences) connect “content-analysis” with “correlation” and “network-analysis.” Cluster 3—characterized by the OECD field of Medical and Health sciences & Social sciences—uses “content-analysis” and “regression.” This cluster mainly contains interdisciplinary journals, such as Journal of Medical Internet Research and Health Communication. An interesting aspect is that the three health-related clusters exhibit different method combinations, corresponding to the different (sub-)disciplinary subject areas.

Moreover, clusters also present a multimethod structure—indicating contemporary, novel communication between disciplines and methodologies. For example, Cluster 1—characterized by the OECD field of Social sciences—mainly utilizes “content-analysis,” “case-study,” “experiment,” and “network-analysis,” reflecting the interdisciplinary research that has emerged in the digital age. Likewise, Clusters 6 and 8 (characterized by Social Sciences) connect multiple methods, including “experiment,” “content-analysis,” “mining,” and “case-study.” Closer inspection reveals that these clusters primarily publish interdisciplinary research in computational social science and digital humanities, incorporating journals such as Computers in Human Behavior, New Media & Society, Journal of Interactive Marketing, and Computers & Education. Furthermore, Cluster 4 (characterized by Natural sciences & Social sciences, containing journals such as PLOS ONE and ISPRS International Journal of Geo-Information) displays the most significant variability of method terms—suggesting that periodicals in this cluster are more multidisciplinary and open to method communications. Interestingly, the term “content-analysis” is prominent in most journal clusters, indicating the flexibility of content analysis as both a qualitative and quantitative method and its centrality in analyzing social media data to trace and understand the social world that produced such data. Overall, patterns identified in the relational space demonstrate that method combinations largely provide structure to research in this domain, as we observe an apparent entanglement between research communities and method preferences—mainly correlated with the epistemic culture within (sub-)disciplinary subject areas.

4.3. Delving into the Embedding Space

To capture the nuanced semantic differences in how these methods are used, we examine the methods in the embedding space. We stress that the Word2Vec model simply treats these method terms as tokens. No relations are explicitly provided to the model; instead, the relational knowledge is captured through researchers’ usage of these terms. Figure 6 illustrates these method vectors’ principal component analysis (PCA) projection onto a two-dimensional space. PCA (Hotelling, 1933; Pearson, 1901) is a linear transformation technique for reducing the dimensionality of the data, while at the same time preserving most of the variability in the data. Essentially, it transforms the data into a new coordinate system such that principal components successively maximize variance and are uncorrelated with each other (Jollife & Cadima, 2016). Based on the cumulative variance, it would be reasonable to interpret the first two components because the proportion of variance explained by subsequent principal components significantly drops off. These two principal components amount to 28% of the total variance of the word vectors. Below, we provide a detailed interpretation of each component as an axis.

  • Axis 1 (18.4 % of variance): Social sciences vs. Computer science

Figure 6.

PCA plot of the most frequent 50 method terms. This plot captures two important axes: Social sciences vs. Computer science and Methodological individualism vs. Relationalism. Each method is annotated and represented as a blue circle. We further provide an interactive version with all terms and color the method terms in red (see https://yangliuf95.github.io).

Figure 6.

PCA plot of the most frequent 50 method terms. This plot captures two important axes: Social sciences vs. Computer science and Methodological individualism vs. Relationalism. Each method is annotated and represented as a blue circle. We further provide an interactive version with all terms and color the method terms in red (see https://yangliuf95.github.io).

Close modal

The first principal component differentiates the computer science-based methods of examining social media data on the positive and social science-based methods on the negative side of the axis. The high vector dimensions (d = 300) contribute the most to the variance associated with the discipline. Specifically, the computer science end chooses method terms around ML-related topics such as “neural-network,” “svm,” “cnn,” and “deep-learning.” In fact, these methods can be applied to a wide range of settings where large-scale computation is of interest, from mining digital traces for social theory to producing new methods and techniques conditioned on the supply of social media data.

In contrast, the social science side highlights a more qualitative style of research, including “thematic-analysis,” “netnography,” “ethnography,” and “content-analysis.” Indeed, the method of netnography is a new and widely adopted method to enrich ethnography for the analysis of culture with social media, corresponding to the challenges of increasing mobility and digitalization in online ethnographic research (Kozinets, 2019). This suggests that methods at this end tend to analyze communicative content on social media platforms to address collective framing and attention patterns in society, as well as trace the diffusion and evolution of such patterns (Evans & Aceves, 2016).

  • Axis 2 (9.4 % of variance): Methodological individualism vs. Relationalism

The second axis captures the differences in the methodological precept, particularly when looking at the social science side. The positive side is largely based on methodological individualism that aims to account for the social phenomena in terms of individuals (Weber 1922, as cited in Miller, 1978). Researchers use method terms such as “t test,” “questionnaire,” and “regression” to describe how they use social media data to interpret social phenomena by analyzing individual thoughts and actions. The negative side by contrast generally holds a relational view that concerns contextual and relational characteristics (Abbott, 2007; Ritzer & Gindoff, 1992). Researchers employ methods such as “discourse-analysis,” “network-analysis,” and “natural-language-processing” to delve into the contextual, symbolic, and relational aspects of social life.

Taken as a whole, this PCA plot enables us to inspect a two-dimensional feature—field (Social sciences vs. Computer science) and methodological precept (Methodological individualism vs. Relationalism)—for each method by looking at the position of the method. Anchoring the two dimensions of word embeddings innovatively demonstrates a detailed two-dimensional organizing principle of methods. For example, based on their positions, “discourse-analysis” can be interpreted as a social science-relationist method, whereas “natural-language-processing” can be considered a computer science-relationist method. Besides, “survey” and “network-analysis” are both on the social science side but differentiate in the Methodological individualism vs. Relationalism axis, suggesting that the former is used to study individual actions and the latter with relational patterns. This precisely aligns with the usage of these methods. Interestingly, the most common method term, “experiment,” lies in the middle of Axis 1 and the positive side of Axis 2, indicating its extensive usage in both fields, as an individualistic method. We note that together these two axes only explain around 30% of the variance in the data, suggesting that—although beyond the scope here—further investigation of principal components might yield interesting insights.

Next, we provide more insight into the association between these methods, based on the cosine similarity between the original high-dimensional word vectors. A clustering result of the methods is presented in Figure 7. From the dendrogram, we can see that these methods tend to cluster into three distinctive groups. In the broader sense, we may label these groups as social sciences (Group 1), computer science (Group 2), and statistics (Group 3), respectively. It is intriguing that the original high-dimensional space provides a detailed association between methods according to groups identified and, more importantly, differentiates these method groups. In particular, we observe a generally low similarity between particular method pairs from Groups 1 and 2, as highlighted in Figure 7. Consequently, from the dissimilarity between terms, one might see these methods have been rarely used in similar contexts; in turn, one can consider ways of combining and complementing—such as between ethnography and neural networks—and imagine a new treaty between them to benefit innovative communication and synergy of methods.

Figure 7.

Heat map for the most common 40 method terms. Colors indicate the cosine similarity between terms, and the dendrogram indicates the grouping structure based on hierarchical clustering. We identify three groups of methods with high similarity, which we label social sciences (blue), computer science (green), and statistics (orange). We then use a red box to highlight the method pairs with low similarity.

Figure 7.

Heat map for the most common 40 method terms. Colors indicate the cosine similarity between terms, and the dendrogram indicates the grouping structure based on hierarchical clustering. We identify three groups of methods with high similarity, which we label social sciences (blue), computer science (green), and statistics (orange). We then use a red box to highlight the method pairs with low similarity.

Close modal

The method patterns in the domain of social media data-based research can enrich our understanding of big social data developments, as well as informing the ongoing epistemological and methodological discussions on a solid basis. By concentrating on the notions of method affinity and division, we have revealed how methods organize research in this domain. We believe that our observation, despite being mainly static, reflects the broader patterns of (re-)composition that have taken place in science. As noted by previous literature, large-scale social data and computational methods are likely to reorganize research, with new interdisciplinary fields and methodological landscapes being fostered (Bartlett et al., 2018; Kang & Evans, 2020; Kitchin, 2014).

Informed by the nuances in our empirical analysis, we argue that distinctive communities of researchers use social media–derived data sets in ways that promote and combine particular methods, accordingly forming monomethod and multimethod communities. Through delineating these method communities, we recognize new “data-method packages” in the digital age that highlight the diversity of data-method combinations, reworking the existing concept of “theory-method packages” in science studies literature (Silvast & Virtanen, 2023). These packages suggest that in contrast to methods simply serving the data—that is, only specific methods are used to analyze social media data or big social data—the choice of methods and epistemology are coshaped by data but are rather flexible for research communities using social media data in real practices.

In addition, the method affinities we observed in these communities suggest counterevidence for the radical claims about big social data developments, such as the redundancy of social science methods and the colonization of computational methods in social research (McFarland et al., 2016; Uprichard & Carrigan, 2015). Notably, the multimethod communities—mainly characterized by the social sciences—largely connect diverse methods and methodologies, including “content-analysis,” “case-study,” “experiment,” “mining,” and “network-analysis.” In observing such innovative communication between diverse scientific disciplines and traditions, our findings thus correspond to and underwrite an interdisciplinary trend under the prominence of big social data, not least in social research (Kitchin, 2014).

Looking in turn at the method divisions, we distill a two-dimensional organizing principle that designates the core axes of method variation in this domain, which we label as Social sciences vs. Computer science and Methodological individualism vs. Relationalism. The first axis aligns epistemologically with the disciplinary practices and the underlying paradigms. One side is strongly associated with the social sciences, including thematic analysis, ethnography, and content analysis, and the other side is mainly related to computer science, including neural networks and support vector machines. This finding seems to be in line with previous claims about a new methodological divide between disciplines (Evans & Aceves, 2016; Wallach, 2018). Nevertheless, it is somewhat surprising that this separation is highly visible, even as we know the interdisciplinary fields, such as computational social science and digital humanities, have coalesced in the digital age. The second axis, however, moves beyond the disciplinary division to a new direction that underlines the role of methodological precept—indicating the differences in language usage between individualism and relationalism methods. Together, these two axes constitute a novel two-dimensional coordinate system, allowing us to locate each method’s unique position and draw both horizontal and vertical comparisons among methods. Therefore, we think that our mapping provides a more fine-grained organizing principle for methods used in this domain, cutting across the simple qualitative-quantitative dichotomy in the existing research (Kang & Evans, 2020; Schwemmer & Wieczorek, 2020). Additionally, as our studied domain characterizes the digital “data deluge,” we believe that our empirically grounded mapping of methods can be applied to a broader context—other research fields and domains that are influenced by such data developments.

Moreover, it is interesting to note that the vector similarity indicates a generally low association between computational methods and a specific subset of social science methods, including ethnography, netnography, and discourse analysis. This finding suggests that, unlike other research spaces in the social sciences, engineering practices have not much penetrated research surrounding ethnography. This discrepancy within the social sciences could be attributed to the status that ethnographic research, to a larger extent, involves researchers’ uncodified person-embodied tacit knowledge (Wolfinger, 2002). Yet, to address the problems of our time, the newer ethnographic practices have creatively and critically explored the possibilities of computational engagements, including “complementarity,” which attempts to combine quantitative width and qualitative depth by mixing computational and ethnographic methods, or “mutual-attuning” that goes beyond the complementary view through an iterative and mutual interweaving process (Blok & Pedersen, 2014; Isfeldt et al., 2022; Munk, Olesen, & Jacomy, 2022). In turn, this is to suggest that word embedding models have the potential to make sound inferences for knowledge discovery in quantitative science studies.

This article sets out to present a holistic mapping of the method patterns in an increasingly important research domain characterized by the deluge of big social data—social media data–based research. In particular, we aim to understand the emerging patterns of method use, communication, and community formation, as well as the underlying principles of division and affinity. We construct a novel embedding space for method detection, allowing us to extract knowledge content and relations in an unsupervised manner. By applying natural language processing (NLP) and network analysis tools to an original bibliometric data set containing 12,732 research articles using social media data, we reveal that method affinities to a large extent organize this domain; yet in two different ways, some communities embrace a culture of monomethod use, and others combine two or more methods in novel ways. Moreover, via vector similarity, we identify that methods coshape three distinctive method groups, of social sciences, computer science, and statistics, respectively; and via PCA, we further distill a two-dimensional organizing principle of methods in this domain as a whole—Social sciences vs. Computer science and Methodological individualism vs. Relationalism. More generally, these analyses contribute to a more fine-grained mapping of the big social data method landscape that moves beyond the traditional qualitative-quantitative dichotomy—suggesting new methodological divisions and debates in the digital age.

We also point out a number of concerns and limitations of our analyses. First, it must be recognized that we do not claim that our three-level method classification scheme (i.e., technique, method, and methodology) is “correct” in an absolute sense. Indeed, it is extremely challenging to make such a distinction in a highly diverse data set. Even as ambiguities remain, we consider distinguishing method-level terms to be a basic requisite for using method detection in a meaningful and comparable way. Second, our analysis of how methods (re-)organize this domain is mostly static. Future longitudinal empirical work will contribute to a better understanding of how the organizational structure has changed over time. Third, due to space limitations, we only report the top-level patterns in method detection and network clustering, thereby risking overlooking the lower-level patterns. Another important limitation of this study is our data set and sampling strategy. As previously noted, the analyzed sample was confined to journal articles indexed in WoS. Although we considered the collected articles as a representative sample for investigating the method-level patterns in relevant research, future studies might benefit from expanding the data collection to other sources of knowledge (e.g., conference proceedings) and examining method patterns at other levels (e.g., techniques and algorithms). Additionally, future studies could extend the search queries, such as by adding more terms denoting international social networking sites or more variants denoting data from social media. Finally, future work can also include research conducted on nationally delineated social media sites and in other languages, whose inclusion would provide a more comprehensive picture of method patterns globally.

Despite these limitations and uncertainties, we consider our work as adding to the quantitative science studies literature in both empirical reaches—extracting research methods from a highly diverse large-scale data set and inspecting method patterns using the embedding model, and theoretical advancement—engendering hypotheses about new methodological divisions. Future work may extend our study to investigate method patterns in other fields and domains that index the big (social) data developments, or other potential (co-)organizing principles such as platforms, author, or institution-based patterns.

We thank Snorre Ralund and Anna Rogers for their valuable suggestions and feedback on data analysis. We also thank Professor James Evans, Donghyun Kang, and the Knowledge Lab team for their insightful comments and discussions. We are grateful to the anonymous reviewers for their in-depth comments and wonderful suggestions.

Yangliu Fan: Conceptualization, Data curation, Investigation, Software, Visualization, Writing—original draft. Sune Lehmann: Conceptualization, Methodology, Software, Writing—review & editing. Anders Blok: Conceptualization, Investigation, Validation, Writing—review & editing.

The authors have no competing interests.

This research was supported by the DISTRACT Advanced Grant project, grant 834540 from the European Research Council (ERC).

The data set sourced from the WoS cannot be provided due to its proprietary nature; however, the search strings and the method list are provided in the Supplementary material. The trained phrase and embedding models, along with the code for analysis, have been published at https://github.com/YangliuF95/Method_embedding.

1

We used OECD Category to Web of Science Category Mapping 2012, in which all Web of Science categories (WC) are mapped into six major fields (https://help.prod-incites.com/inCites2Live/5305-TRS/version/default/part/AttachmentData/data/OECD%20Category%20Mapping.xlsx).

2

To ensure the robustness of our findings, we used the Leiden algorithm (Traag, Waltman, & van Eck, 2019) as an additional clustering algorithm. We found it to produce consistent results with our main findings. Further details on the application of the Leiden algorithm in our study are provided in the Supplementary material, Leiden clusters.

3

We have repeated our analysis using a directed journal citation network and found that our main results remained robust to this alternative network type. See Supplementary material, Directed journal citation network, for more details.

4

We have manually sampled 10 articles from the unidentified articles and found that most are fuzzy about methods, except one mentioning the term “phenomenological approach,” which is not included in our list. We have also tested the robustness of our algorithm by randomly sampling another 10 articles and confirmed that these articles mention the methods for their actual application.

5

We have also used null model tests to establish the statistical significance and confirmed that results from the normalized graphs are generally robust.

6

The 10 most common methods account for 67% of method information of the nodes. We conducted an extended investigation of the top 20 methods. Our interpretations of this space remained similar.

Abbott
,
A.
(
2007
).
Mechanisms and relations
.
Sociologica
,
2
,
1
22
.
Agarwal
,
A.
,
Xie
,
B.
,
Vovsha
,
I.
,
Rambow
,
O.
, &
Passonneau
,
R.
(
2011
).
Sentiment analysis of Twitter data
. In
Proceedings of the workshop on Language in Social Media (LSM 2011)
(pp.
30
38
).
Anderson
,
C
. (
2008
).
The end of theory: The data deluge makes the scientific method obsolete
. https://www.wired.com/2008/06/pb-theory/
Bail
,
C. A.
,
Argyle
,
L. P.
,
Brown
,
T. W.
,
Bumpus
,
J. P.
,
Chen
,
H.
, …
Volfovsky
,
A.
(
2018
).
Exposure to opposing views on social media can increase political polarization
.
Proceedings of the National Academy of Sciences of the United States of America
,
115
(
37
),
9216
9221
. ,
[PubMed]
Bartlett
,
A.
,
Lewis
,
J.
,
Reyes-Galindo
,
L.
, &
Stephens
,
N.
(
2018
).
The locus of legitimate interpretation in Big Data sciences: Lessons for computational social science from -omic biology and high-energy physics
.
Big Data & Society
,
5
(
1
).
Becerra
,
G.
, &
Ratovicius
,
C.
(
2022
).
Social sciences and humanities on big data: A bibliometric analysis
.
Journal of Information Systems and Technology Management
,
19
(
e202219011
). https://www.scielo.br/j/jistm/a/CMF74vhzSLZCbQKZ6pVnN4r
Blok
,
A.
, &
Pedersen
,
M. A.
(
2014
).
Complementary social science? Quali-quantitative experiments in a Big Data world
.
Big Data & Society
,
1
(
2
).
Bohlin
,
L.
,
Edler
,
D.
,
Lancichinetti
,
A.
, &
Rosvall
,
M.
(
2014
).
Community detection and visualization of networks with the map equation framework
. In
Y.
Ding
,
R.
Rousseau
, &
D.
Wolfram
(Eds.),
Measuring scholarly impact
(pp.
3
34
).
Cham
:
Springer
.
Boyack
,
K. W.
, &
Klavans
,
R.
(
2010
).
Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately?
Journal of the American Society for Information Science and Technology
,
61
(
12
),
2389
2404
.
Boyd
,
D.
, &
Crawford
,
K.
(
2012
).
Critical questions for Big Data
.
Information, Communication & Society
,
15
(
5
),
662
679
.
Carpenter
,
M. P.
, &
Narin
,
F.
(
1973
).
Clustering of scientific journals
.
Journal of the American Society for Information Science
,
24
(
6
),
425
436
.
Clarke
,
A. E.
, &
Star
,
S. L.
(
2008
).
The social worlds framework: A theory/methods package
. In
The handbook of science and technology studies
(pp.
113
137
).
Cowls
,
J.
, &
Schroeder
,
R.
(
2015
).
Causation, correlation, and big data in social science research
.
Policy and Internet
,
7
(
4
),
447
472
.
Edelmann
,
A.
,
Wolff
,
T.
,
Montagne
,
D.
, &
Bail
,
C. A.
(
2020
).
Computational social science and sociology
.
Annual Review of Sociology
,
46
,
61
81
. ,
[PubMed]
Evans
,
J. A.
, &
Aceves
,
P.
(
2016
).
Machine translation: Mining text for social theory
.
Annual Review of Sociology
,
42
,
21
50
.
Golbeck
,
J.
, &
Hansen
,
D. L.
(
2011
).
Computing political preference among Twitter followers
. In
Proceedings of the SIGCHI conference on human factors in computing systems
(pp.
1105
1108
).
Grigoropoulou
,
N.
, &
Small
,
M. L.
(
2022
).
The data revolution in social science needs qualitative research
.
Nature Human Behaviour
,
6
(
7
),
904
906
. ,
[PubMed]
Halford
,
S.
,
Weal
,
M.
,
Tinati
,
R.
,
Carr
,
L.
, &
Pope
,
C.
(
2017
).
Understanding the production and circulation of social media data: Towards methodological principles and praxis
.
New Media & Society
,
20
(
9
),
3341
3358
.
Havey
,
N.
, &
Chang
,
M. J.
(
2022
).
Do journals have preferences? Insights from The Journal of Higher Education
.
Innovative Higher Education
,
47
(
6
),
915
926
. ,
[PubMed]
Hotelling
,
H.
(
1933
).
Analysis of a complex of statistical variables into principal components
.
Journal of Educational Psychology
,
24
(
6
),
417
441
.
Housley
,
W.
,
Procter
,
R.
,
Edwards
,
A.
,
Burnap
,
P.
,
Williams
,
M.
, …
Greenhill
,
A.
(
2014
).
Big and broad social data and the sociological imagination: A collaborative response
.
Big Data & Society
,
1
(
2
).
Iosifidis
,
P.
, &
Nicoli
,
N.
(
2020
).
The battle to end fake news: A qualitative content analysis of Facebook announcements on how it combats disinformation
.
International Communication Gazette
,
82
(
1
),
60
81
.
Isfeldt
,
A. S.
,
Enggaard
,
T. R.
,
Blok
,
A.
, &
Pedersen
,
M. A.
(
2022
).
Grøn Genstart: A quali-quantitative micro-history of a political idea in real-time
.
Big Data & Society
,
9
(
1
).
Jamshed
,
S.
(
2014
).
Qualitative research method-interviewing and observation
.
Journal of Basic and Clinical Pharmacy
,
5
(
4
),
87
88
. ,
[PubMed]
Johansson
,
L. G.
,
Grønvad
,
J. F.
, &
Budtz Pedersen
,
D.
(
2020
).
A matter of style: Research production and communication across humanities disciplines in Denmark in the early-twenty-first century
.
Poetics
,
83
,
101473
.
Jollife
,
I. T.
, &
Cadima
,
J.
(
2016
).
Principal component analysis: A review and recent developments
.
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
,
374
(
2065
),
20150202
. ,
[PubMed]
Jussila
,
J.
,
Menon
,
K.
,
Gupta
,
J.
, &
Kärkkäinen
,
H.
(
2017
).
Who is who in big social data? A bibliographic network analysis study
.
Proceedings of the 4th European Conference on Social Media ECSM 2017
,
4
,
161
169
.
Kang
,
D.
, &
Evans
,
J.
(
2020
).
Against method: Exploding the boundary between qualitative and quantitative studies of science
.
Quantitative Science Studies
,
1
(
3
),
930
944
.
Karaboğa
,
T.
,
Karaboğa
,
H. A.
, &
Şehitoğlu
,
Y.
(
2020
).
The rise of big data in communication sciences: A bibliometric mapping of the literature
.
Connectist: Istanbul University Journal of Communication Sciences
,
58
,
169
190
.
Kitchin
,
R.
(
2014
).
Big Data, new epistemologies and paradigm shifts
.
Big Data and Society
,
1
(
1
).
Kozinets
,
R. V.
(
2019
).
Netnography: The essential guide to qualitative social media research
(3rd ed.).
Thousand Oaks, CA
:
Sage
. https://us.sagepub.com/en-us/nam/netnography/book260905
Krieg
,
L. J.
,
Berning
,
M.
, &
Hardon
,
A.
(
2017
).
Anthropology with algorithms?
Medicine Anthropology Theory
,
4
(
3
).
Lambiotte
,
R.
, &
Panzarasa
,
P.
(
2009
).
Communities, knowledge creation, and information diffusion
.
Journal of Informetrics
,
3
(
3
),
180
190
.
Lazer
,
D.
,
Hargittai
,
E.
,
Freelon
,
D.
,
Gonzalez-Bailon
,
S.
,
Munger
,
K.
, …
Radford
,
J.
(
2021
).
Meaningful measures of human society in the twenty-first century
.
Nature
,
595
(
7866
),
189
196
. ,
[PubMed]
Lazer
,
D.
,
Pentland
,
A.
,
Adamic
,
L.
,
Aral
,
S.
,
Barabási
,
A. L.
, …
Van Alstyne
,
M.
(
2009
).
Social science: Computational social science
.
Science
,
323
(
5915
),
721
723
. ,
[PubMed]
Liang
,
T. P.
, &
Liu
,
Y. H.
(
2018
).
Research landscape of business intelligence and big data analytics: A bibliometrics study
.
Expert Systems with Applications
,
111
,
2
10
.
Mayer-Schönberger
,
V.
, &
Cukier
,
K.
(
2013
).
Big data: A revolution that will transform how we live, work, and think
.
American Journal of Epidemiology
,
179
(
9
),
1143
1144
.
McFarland
,
D. A.
,
Lewis
,
K.
, &
Goldberg
,
A.
(
2016
).
Sociology in the era of Big Data: The ascent of forensic social science
.
American Sociologist
,
47
(
1
),
12
35
.
Merton
,
R. K.
(
1973
).
The sociology of science
.
N. W.
Storer
(Ed.),
Chicago, IL
:
University of Chicago Press
.
Mikolov
,
T.
,
Sutskever
,
I.
,
Chen
,
K.
,
Corrado
,
G.
, &
Dean
,
J.
(
2013
).
Distributed representations of words and phrases and their compositionality
. In
Proceedings of the 26th international conference on neural information processing systems
(
Vol. 2
, pp.
3111
3119
).
Miller
,
R. W.
(
1978
).
Methodological individualism and social explanation
.
Philosophy of Science
,
45
(
3
),
387
414
.
Mingers
,
J.
(
2001
).
Combining IS research methods: Towards a pluralist methodology
.
Information Systems Research
,
12
(
3
),
240
259
.
Mohammadi
,
E.
, &
Karami
,
A.
(
2020
).
Exploring research trends in big data across disciplines: A text-mining analysis
.
Journal of Information Science
,
48
(
1
),
44
56
.
Munk
,
A. K.
,
Olesen
,
A. G.
, &
Jacomy
,
M.
(
2022
).
The thick machine: Anthropological AI between explanation and explication
.
Big Data & Society
,
9
(
1
).
Newman
,
M. E. J.
(
2006
).
Modularity and community structure in networks
.
Proceedings of the National Academy of Sciences of the United States of America
,
103
(
23
),
8577
8582
. ,
[PubMed]
Nielsen
,
K. H.
(
2013
).
Scientific communication and the nature of science
.
Science & Education
,
22
(
9
),
2067
2086
.
Olteanu
,
A.
,
Castillo
,
C.
,
Diaz
,
F.
, &
Kıcıman
,
E.
(
2019
).
Social data: Biases, methodological pitfalls, and ethical boundaries
.
Frontiers in Big Data
,
2
,
13
. ,
[PubMed]
Parks
,
M. R.
(
2014
).
Big data in communication research: Its contents and discontents
.
Journal of Communication
,
64
(
2
),
355
360
.
Pearson
,
K.
(
1901
).
LIII. On lines and planes of closest fit to systems of points in space
.
London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science
,
2
(
11
),
559
572
.
Ritzer
,
G.
, &
Gindoff
,
P.
(
1992
).
Methodological relationism: Lessons for and from social psychology
.
Social Psychology Quarterly
,
55
(
2
),
128
140
.
Rohani
,
V. A.
,
Shayaa
,
S.
, &
Babanejaddehaki
,
G.
(
2016
).
Topic modeling for social media content: A practical approach
. In
International Conference on Computer and Information Sciences (ICCOINS)
(pp.
397
402
).
Rosvall
,
M.
,
Axelsson
,
D.
, &
Bergstrom
,
C. T.
(
2009
).
The map equation
.
European Physical Journal: Special Topics
,
178
(
1
),
13
23
.
Saunders
,
M.
,
Lewis
,
P.
, &
Thornhill
,
A.
(
2019
).
Research methods for business students
(8th ed.).
Pearson International
. https://elibrary.pearson.de/book/99.150005/9781292208794
Schnabel
,
T.
,
Labutov
,
I.
,
Mimno
,
D.
, &
Joachims
,
T.
(
2015
).
Evaluation methods for unsupervised word embeddings
. In
Proceedings of the 2015 conference on empirical methods in natural language processing
(pp.
298
307
).
Schwemmer
,
C.
, &
Wieczorek
,
O.
(
2020
).
The methodological divide of sociology: Evidence from two decades of journal publications
.
Sociology
,
54
(
1
),
3
21
.
Serrano
,
M. Á.
,
Boguñá
,
M.
, &
Vespignani
,
A.
(
2009
).
Extracting the multiscale backbone of complex weighted networks
.
Proceedings of the National Academy of Sciences of the United States of America
,
106
(
16
),
6483
6488
. ,
[PubMed]
Silvast
,
A.
, &
Virtanen
,
M. J.
(
2023
).
On theory–methods packages in science and technology studies
.
Science, Technology, & Human Values
,
48
(
1
),
167
189
.
Soleimani-Roozbahani
,
F.
,
Rajabzadeh Ghatari
,
A.
, &
Radfar
,
R.
(
2019
).
Knowledge discovery from a more than a decade studies on healthcare Big Data systems: A scientometrics study
.
Journal of Big Data
,
6
,
8
.
Šubelj
,
L.
,
van Eck
,
N. J.
, &
Waltman
,
L.
(
2016
).
Clustering scientific publications based on citation relations: A systematic comparison of different methods
.
PLOS ONE
,
11
(
4
),
e0154404
. ,
[PubMed]
Tanzil
,
S. M. S.
,
Hoiles
,
W.
, &
Krishnamurthy
,
V.
(
2017
).
Adaptive scheme for caching YouTube content in a cellular network: Machine learning approach
.
IEEE Access
,
5
,
5870
5881
.
Traag
,
V. A.
,
Waltman
,
L.
, &
van Eck
,
N. J.
(
2019
).
From Louvain to Leiden: Guaranteeing well-connected communities
.
Scientific Reports
,
9
(
1
),
5233
. ,
[PubMed]
Tulkens
,
S.
,
Hilte
,
L.
,
Lodewyckx
,
E.
,
Verhoeven
,
B.
, &
Daelemans
,
W.
(
2016
).
A dictionary-based approach to racism detection in Dutch social media
.
arXiv:1608.08738
.
Uprichard
,
E.
, &
Carrigan
,
M.
(
2015
).
Emma Uprichard: Big data and “methodological genocide”—Methodspace
. https://www.methodspace.com/blog/emma-uprichard-big-data-methodological-genocide
van Vliet
,
L.
,
Törnberg
,
P.
, &
Uitermark
,
J.
(
2020
).
The Twitter parliamentarian database: Analyzing Twitter politics across 26 countries
.
PLOS ONE
,
15
(
9
),
e0237073
. ,
[PubMed]
Venturini
,
T.
,
Bounegru
,
L.
,
Gray
,
J.
, &
Rogers
,
R.
(
2018
).
A reality check(list) for digital methods
.
New Media and Society
,
20
(
11
),
4195
4217
.
Wallach
,
H.
(
2018
).
Viewpoint: Computational social science ≠ computer science + social data
.
Communications of the ACM
,
61
(
3
),
42
44
.
Wang
,
Y.
, &
Bowers
,
A. J.
(
2016
).
Mapping the field of educational administration research: A journal citation network analysis
.
Journal of Educational Administration
,
54
(
3
),
242
269
.
Wolfinger
,
N. H.
(
2002
).
On writing fieldnotes: Collection strategies and background expectancies
.
Qualitative Research
,
2
(
1
),
85
93
.
Yuan
,
E.
,
Feng
,
M.
, &
Liu
,
X.
(
2019
).
The R/evolution of civic engagement: An exploratory network analysis of the Facebook groups of occupy Chicago
.
Information, Communication & Society
,
22
(
2
),
267
285
.

Author notes

Handling Editor: Vincent Larivière

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.

Supplementary data