Tracing data: A survey investigating disciplinary differences in data citation

Abstract Data citations, or citations in reference lists to data, are increasingly seen as an important means to trace data reuse and incentivize data sharing. Although disciplinary differences in data citation practices have been well documented via scientometric approaches, we do not yet know how representative these practices are within disciplines. Nor do we yet have insight into researchers’ motivations for citing—or not citing—data in their academic work. Here, we present the results of the largest known survey (n = 2,492) to explicitly investigate data citation practices, preferences, and motivations, using a representative sample of academic authors by discipline, as represented in the Web of Science (WoS). We present findings about researchers’ current practices and motivations for reusing and citing data and also examine their preferences for how they would like their own data to be cited. We conclude by discussing disciplinary patterns in two broad clusters, focusing on patterns in the social sciences and humanities, and consider the implications of our results for tracing and rewarding data sharing and reuse.


INTRODUCTION
Data sharing and reuse are pillars of open science.Sharing data can enable transparency in research, while reusing data created by other people offers the potential to validate existing findings and improve scientific efficiency (Baker, 2016;National Institutes of Health, 2023;Pasquetto, Randles, & Borgman, 2017).Although (open) data objects, such as databases, data collections and data sets, are reused (Federer, 2019), such reuse is often invisible or is not easy to trace (Lane, Mulvany, & Nathan, 2020;van de Sandt, Nielsen et al., 2019).
Data citations (i.e., citations in reference lists to data) are considered to be key to tracing data reuse and incentivizing data sharing (Lowenberg, Chodacki et al., 2019).Despite numerous advocacy efforts to encourage and standardize data citation (Data Citation Synthesis Group, 2014;Make Data Count, n.d.), such citations are rare in the academic literature (Ninkov, Gregory et al., 2021;Peters, Kraker et al., 2016).If data reuse is acknowledged in publications, data are usually mentioned in a footnote or within the full text of publications (Park, You, & Wolfram, 2018;van de Sandt, 2021).conceptualized as a practice that exists on a spectrum (Gregory, 2021;Pasquetto et al., 2019).Pasquetto and colleagues (2019) in particular propose a continuum of data reuse spanning from more-frequent comparative activities to less-frequent integrative uses (i.e., bringing together data for new analysis or to identify new patterns).
Although it is commonly accepted that data citations indicate some type of use, not all uses of data will be captured in a publication or in a citation (Borgman, 2016;Federer, 2019).Using data in teaching, to calibrate instruments or to verify results, for example, may not typically be recognized or cited in an academic publication (Gregory, 2021).

Practices of Citing and Mentioning Data
The terminology surrounding data citation practices varies in the literature.Here, we differentiate between data citations, which we define as referring to data objects (i.e., data sets, repositories, collections of data) in reference lists, and data mentions, which refer to data objects throughout other sections of a publication, including in footnotes, supplementary material, figures, and acknowledgments.Building on the work of van de Sandt et al. (2019), we further define indirect data citations as citations to publications related to data (i.e., to papers analyzing data or to data papers).These definitions are based primarily on the location of a reference.Unlike other proposed definitions, we do not use the terms formal or informal to differentiate between types of citations, as "formality" is defined differently across communities.We also differentiate between how these methods can be used to trace signs of data reuse.Table 1 summarizes these definitions and relates them to other terms used in the existing literature.
Many studies of data citation practices employ scientometric approaches: analyzing bibliographies, entire publications, or bibliometric databases to detect mentions of data objects and traces of data reuse.Most scientometric studies draw on similar bibliometric sources, such as Clarivate's Data Citation Index (DCI), a database of data records from selected repositories with related citation information (Clarivate, 2022), or DataCite, a nonprofit organization providing persistent identifiers (PIDs) and services for research data and other research outputs (DataCite, n.d.).Other studies draw on data from curated data repositories, particularly the Inter-university Consortium for Political and Social Research (ICPSR) (e.g., Banaeefar, Burchart et al., 2022;Lafia, Fan et al., 2022;van de Sandt, 2021).Studies across these sources document differences regarding data supplies and a variation in methods for both citing and mentioning data.

Data citations
In an analysis using the DCI, Robinson-García et al. (2016) find that 88% of indexed data remained uncited.The majority of data objects with citations in the DCI are from repositories in crystallography and biomedicine, perhaps reflecting more established infrastructures and data sharing norms in these fields (Robinson-García et al., 2016).Other studies confirm the uncitedness of most data indexed in the DCI and document differences in broadly defined disciplinary fields, particularly in the natural and life sciences (e.g., Peters et al., 2016).Park and Wolfram (2017) also confirm the greater number of data citations in the biomedical and physical sciences observed by Robinson-García et al. (2016) and further suggest that selfcitation and citation of coauthors are common.
In a study of DataCite, Robinson-Garcia et al. (2017) observe a variety of skewed distributions, where 2% of data centers account for 80% of data objects and a handful of repositories contain DOIs for related scientific publications.Such skewed distributions have also been

625
Tracing data observed in ocean science data in DataCite, where data reuse can be primarily attributed to data from a few organizations or by the data creators (Dudek, Mongeon et al., 2019).
A more recent analysis of DataCite documents an overall lack of citation relations between data and publications; approximately 1% of data sets in the corpus of nearly 8.5 million records contain citation information (Ninkov et al., 2021).The authors also identify a dearth of disciplinary metadata in the corpus; of the data sets that have both disciplinary and citation information, the majority come from the natural sciences, specifically earth and environmental or biological sciences.
Robinson-García et al. ( 2016) also find that different disciplines cite different types of data objects.The majority of data citations in the DCI in the social sciences and humanities are to data studies, defined as studies and experiments with associated data (i.e., census data (Clarivate, 2022)).Nearly all citations in engineering and technology and "science" are to data sets (e.g., single data files).Even if data objects are cited in reference lists, data citations vary in their formats, completeness, use of persistent identifiers (PIDs), and overall adherence to citation guidelines (Mayo et al., 2016;Mooney & Newton, 2012;van de Sandt et al., 2019).PIDs are particularly important, as they provide a sustainable mechanism for identifying and locating individual data objects (Peters et al., 2016).
In 2012, data citations in the social sciences and humanities lacked PIDs, publisher information, and electronic retrieval location (Mooney & Newton, 2012).Roughly 10 years later, data citations to social science data in ICPSR included many traditional metadata elements (e.g., title, author, and publication date), although the majority still lacked PIDs (van de Sandt, 2021).Moss andLyle (2018, 2019) further identify a spectrum of data citations and data mentions that do not include PIDs (i.e., citations which are "almost complete" to data mentions that are "barely there," consisting primarily of a data set's title.In many cases, this lack of identifiers and other citation elements stands in opposition to explicitly stated data citation guidelines from both data repositories and journals (Mayo et al., 2016;van de Sandt, 2021), which are designed to facilitate long-term data identification.

Data mentions and indirect data citations
A lack of data citations does not mean that data are not acknowledged in publications.Researchers refer to data throughout sections of traditional academic papers, as well as in figures, tables, and captions (Mooney & Newton, 2012;Park et al., 2018;Pepe, Goodman et al., 2014).Such indications of data use may not be directly traceable by automated indexes (Table 1) but may rather remain hidden, camouflaged in data mentions or indirect data citations, particularly to publications in which data have been previously analyzed (van de Sandt et al., 2019).Other indirect data citations reference data papers, papers dedicated to describing data and their contexts of creation (Callaghan, Donegan et al., 2012).Although data papers are increasingly cited within scholarly communication, initial evidence suggests that the number of citations to data papers varies by discipline and may not indicate actual data reuse (Jiao & Darch, 2020).
Further disciplinary patterns of data mentions and indirect data citations have also been observed.In an analysis of genetics and heredity data in the DCI, Park et al. (2018) demonstrate a strong tendency in biomedical fields to mention data within the main text of articles; fewer data mentions occur in other areas of a publication (e.g., in acknowledgements or supplementary material).This pattern was also observed in a study of three openly available oceanographic data sets (Belter, 2014) and an analysis of life science data published in Dryad (Mayo et al., 2016).

Quantitative Science Studies 626
Tracing data van de Sandt (2021) analyzed data and software citation in the social sciences, using data from ICPSR, and in high-energy physics (HEP), using data from CERN.Mentions to data from ICPSR occur most frequently in the methodology or in a dedicated "data" section of a publication and often consist of the data title and year but do not have other identifying or descriptive elements, such as the study acronym or version number.Data mentions in HEP are more heterogeneous and their exact location more difficult to classify, reflecting the variety of publication structures in the subdisciplines of HEP (van de Sandt, 2021).
When discussing bibliometric studies of disciplinary differences, it is important to note the role of classification systems when interpreting the results.Each data source and methodological approach uses a particular disciplinary or subject classification, complicating comparisons.For example, data sets and data studies within the DCI receive the subject classification of the repository in which they are published (Force & Robinson, 2014), whereas DataCite relies heavily on disciplinary metadata provided by data repositories, which can then be enhanced or mapped to other classifications (Garza, Strecker et al., 2021).
Repository-based analyses (e.g., at ICPSR or CERN) subsume many disciplinary subgroups within broad categories, such as "social sciences" or "high-energy physics."We also make use of broad disciplinary categories to facilitate comparison in this study, but we recognize that such comparisons are challenging and that examining disciplinary (data) practices at a high level can potentially obscure differences in subfields and research communities (Ninkov et al., 2022).

Motivations for Citing and Mentioning Data
Motivations for citing academic literature have long been studied and theorized in scientometrics and related fields (see Bornmann and Daniel (2008) and Tahamtan and Bornmann (2019) for reviews).Although citations can be used to acknowledge intellectual and cognitive influences (Merton, 1973(Merton, , 1988) ) citation motivations and practices are also socially situated and constructed (Collins, 2004;Knorr-Cetina, 1981).Citations are therefore made for a variety of reasons, including persuasion (Gilbert, 1977); authority claims (Moed & Garfield, 2004); paying homage to pioneers and colleagues; or correcting and criticizing earlier work (Garfield, 1965).
Although data citations and mentions are largely taken as a sign of data use, there is a paucity of empirical evidence and conceptual development about motivations for citing or mentioning data.Existing literature synthesizes arguments made by those working to encourage data citation, rather than examining actual citation motivations of researchers themselves.Such arguments focus on motivating researchers to cite data as a way to connect data and literature, to facilitate data discovery and reproducibility, to understand the use and impact of data, and to recognize and reward data management work (Mayernik, 2012;Silvello, 2018).
Work on data citation undertaken from the perspective of research infrastructures often focuses on the practical uses of citations and metrics to demonstrate the value of the infrastructures themselves (Mayernik, Hart et al., 2017).In this context, data citations may be made to persuade others of the quality of data used in a particular study or to credit and reward data providers (Mayernik et al., 2017).Although it is debatable if these motivations provide direct incentives for researchers to cite data in reference lists (Mayernik, 2012), recent surveys show that the vast majority of respondents believe that data citations would provide an important credit mechanism for sharing research data (Digital Science, Goodey et al., 2022;Tenopir, Allard et al., 2020).

Quantitative Science Studies 627
Tracing data This belief reflects the current academic reward system, where citations to scholarly literature are traditionally viewed as the primary currency, what Merton calls "pellets of peer recognition" (1988, p. 621).Literature citations and making data (openly) available have also been shown to be linked.Piwowar and colleagues were among the first to demonstrate a citation advantage for articles that have openly available data within cancer research (Piwowar, Day, & Fridsma, 2007) and genetics (Piwowar & Vision, 2013), findings corroborated in an analysis of papers with data availability statements in publications in PLOS and BiomedCentral (Colavizza, Hrynaszkiewicz et al., 2020).These findings suggest that an increase in literature citations could be a means of incentivizing researchers to share their data, and to cite the data of others.It remains unclear, however, if accruing additional literature citations is in fact a motivating factor for sharing, citing, or mentioning data in practice.
Data citation motivations are not often explored through a disciplinary lens.In a move towards studies in this direction, Banaeefar et al. (2022) classify the context and types of citations to data in ICPSR.They report that data citations are typically made in order to refer to findings from another study; provide a brief data point as background information; and acknowledge the use of a survey instrument, experimental measure, or comparison of methodological approaches.

METHODS AND DATA
Asking researchers directly about their practices and motivations can add additional context to the literature reviewed in Section 2. Surveys have been increasingly used as a way of measuring data sharing and reuse practices within disciplines.Tenopir and colleagues conducted a series of survey studies (Tenopir et al., 2011;Tenopir, Christian et al., 2018;Tenopir, Dalton et al., 2015;Tenopir, Rice et al., 2020), documenting that perceptions of data sharing vary significantly by discipline.Schmidt, Gemeinholzer, and Treloar (2016) controlled for disciplinary differences using a two-sample comparison approach in their survey.Annual surveys about disciplinary data practices are also conducted by academic publishers and private companies; for example, Digital Science, owner of figshare (Digital Science, Hahnel et al., 2020;Digital Science, Simons et al., 2021;Digital Science et al., 2022).
Unlike our approach, these surveys relied on convenience samples and did not aim for representativity according to academic disciplines.The majority also focus on data sharing and reuse, rather than explicitly investigating data citation.

Questionnaire
The questionnaire (Gregory, Ninkov, Peters et al., 2022) was designed and scripted in Survey-Monkey.It employed a branching design with two primary branches: one for researchers who reuse data and one for those who do not, who we term nonreusers.Researchers reusing data were asked a maximum of 28 questions; nonreusers were asked up to 22 questions.The questionnaire consisted of three sections: 1. Reusing and Citing Data, where participants were asked about their practices, preferences regarding their own data, and their citation motivations; 2. Rewarding Data Management; and 3. Demographics.
Questions were designed based on past research in data reuse (e.g., Gregory et al., 2020;Pasquetto et al., 2019), data citation (e.g., Robinson-García et al., 2016;Silvello, 2018;van de Sandt, 2021), citation motivations (Garfield, 1965;Mayernik, 2012;Mayernik et al., 2017), andacademic reward (National Information Standards Organization, 2016).Question types included binary, multiple choice, five-point Likert scale, multiple response, and open-ended questions; the exact number of each question type varied by survey branch.This paper reports the results from questions in the first section of the questionnaire, Reusing and Citing Data, and excludes open-ended questions from the analysis.
We improved the understandability and accuracy of the questionnaire in two rounds of review.We first distributed the questionnaire to experts in scientometrics, research data management, and survey research to test the content, phrasing, and overall design.We then conducted a pilot study with a stratified random sample of 1,000 researchers using the recruitment and sampling methodology described in Section 3.2.The pilot study yielded a 1.2% response rate; responses from the pilot study are not reported in our results.

Sampling and Recruitment
Our population of interest consisted of researchers across disciplines who have published a paper indexed in WoS between 2016 and 2020.We aimed to create a representative sample of this population according to disciplinary domain.To do this, we used a two-step approach, incorporating the subject classification of journals in which authors have published and researchers' own disciplinary identification, a process detailed in our earlier work (Gregory, Ninkov, Ripp et al., 2022) and outlined below.
In the first step, we determined the percentage of researchers by discipline according to journal subject classification.We queried the Observatoire des Sciences et des Technologies (OST) local WoS database for articles published between 2016 and 2020.The retrieved articles had both an associated email address for the corresponding author and a journal-level subject classification assignment, according to the National Science Foundation (NSF) journal-level classification.The result of this query was 5.8 million unique email addresses associated with 8.2 million articles.To avoid underrepresentation of humanities researchers in the email distribution, we used the distribution of articles in subsequent steps.
To facilitate comparison with past work, we mapped the NSF classification scheme for retrieved articles to the Organisation for Economic Cooperation and Development's (OECD) revised Field of Science and Technology (FOS) classification (Ninkov et al., 2022).The FOS schema, with six high-level categories and 42 subcategories, provides a balance between breadth and specificity (Organisation for Economic Cooperation and Development, 2007).Using this distribution, we determined the needed number of respondents from each discipline to achieve a confidence interval of 0.025 in our statistical analysis.We then randomly sampled unique emails accordingly.We sent 158,600 recruitment emails between January 28, 2022 and March 4, 2022 via SurveyMonkey.One reminder email was sent after 3 weeks to encourage participation.
Classifying researchers via journal-level subject classifications can be problematic.Researchers may publish in journals in multiple fields, and journal-level classifications may not accurately reflect the subject of individual articles.Participants therefore also selected their own FOS subdisciplines in the questionnaire.We mapped participants' selected subdisciplines to the six main FOS disciplines and compared this to our desired sampling distributions, as responses were received.We used the participants' classification to determine if our desired disciplinary distributions had been met.We sent an additional round of 5,000 recruitment emails to researchers in medical and health to match our desired number of respondents.Data collection stopped once the desired minimum number of respondents in all fields was met.Table 2 summarizes the results of our sampling and mapping methodology.

Quantitative Science Studies 629
Tracing data

Survey Response, Data Preparation, and Data Analysis
We received 3,632 responses, 2,509 of which were complete, yielding a survey completion rate of 68.6%.Of those who did not complete the survey, 65.2% of nonreusers dropped out after the third question and 74.6% dropped out after the fourth question.Of reusers with incomplete responses, 63.8% stopped responding after the fourth question.Incomplete responses were excluded from this analysis.During data cleaning, we identified and removed 17 respondents whose responses had been incorrectly recorded in the survey system, potentially because participants used the browser back button.This yielded 2,492 complete responses and an uncorrected response rate of 1.57%.Controlling for invalid emails, bounced emails, and opt-outs (n = 5,201) produced a response rate of 1.62%, similar to a survey using comparable recruitment methods (Gregory et al., 2020).We recoded ordinal variables and multiple-choice responses to account for the branching design of the survey.Codes, variables, and data cleaning steps are further explained in the data dictionary and documentation published with the anonymized survey data (Ninkov et al., 2023).
Data were analyzed using Excel and SPSS.Normality testing indicated the use of nonparametric tests for significance.Table 3 summarizes the statistical tests used for each question, namely a Kruskal-Wallis H test along with a Chi-squared test coupled with Cramer's V to measure the substantive significance.Questions that were the same from both branches were combined for analysis (e.g., questions 7 and 15 or 8 and 16); only questions reported in this paper are included in Table 3.To analyze multiple response questions, we treated each possible variable as a single question and performed the appropriate statistical test for each variable.
1 Percentages are used in reporting to enhance readability.Chi-squared tests were conducted using observed counts.Observed counts as well as percentages are provided for questions with significant differences in Supplementary material A. As seen in Table 2, we received relatively more responses in some disciplines than others, particularly in the social sciences.We therefore weighted the number of responses to match our desired distribution when reporting descriptive statistics for the entire population.
We report our results using visualizations in combination with descriptive and inferential statistics.To aid comparisons between disciplines, we begin each section of the findings with a figure visually summarizing results with significant differences between disciplines.We then provide figures summarizing our data at the level of the entire population in addition to narrative descriptions of overall trends and significant disciplinary differences.A synthesis figure with all statistically significant results is in Supplementary material A.

Limitations
This study has limitations regarding a potential sampling bias, the questionnaire design, and our chosen analysis methods.Although we used random sampling techniques to recruit a variety of researchers, respondents interested in the topics of data citation and reuse who are confident in their ability to complete an English-language survey would be more likely to respond.Our sample consists of researchers who have published in journals indexed in WoS, a database which has its own biases.Certain disciplinary domains, such as the humanities, are underrepresented in WoS, as are researchers from the Global South and those who do not publish in English (Mongeon & Paul-Hus, 2016;Petr, Engels et al., 2021;Sugimoto & Larivière, 2018).Although these limitations are a source of sampling bias, drawing from this population also allowed us to target our desired population of researchers across domains.A further limitation in our analysis could be due to the lack of granularity in the FOS classification system, which we use to report our results.
Responses indicate self-reported behaviors and attitudes, which could be affected by a desire to give socially acceptable answers.Responses were also influenced by the options to questions which we provided.To counter this, we designed our questions based on past research and provided open-ended response options for questions.Responses could also have been impacted by individual interpretations and the ordering of the questions.Our two-phase review of the questionnaire helped to address some of these limitations.Additionally, a list of definitions for terms was provided at the beginning of the questionnaire and was linked to on every page of the survey.Terms included data reuse, defined as "using data which others have created, for any purpose," and data sharing, defined as "making your data available to others, e.g., in a data repository."The full list of terms is provided in the survey questionnaire (Gregory, Ninkov, Peters et al., 2022).

Ethics and Data Availability
We received ethical approval from the University of Ottawa for the study under number S-08-21-7283.The anonymized data from this survey are available under a CC-BY-4.0 license (Ninkov et al., 2023).

FINDINGS
We begin by contextualizing our results with a description of the demographics of respondents and their reported data reuse practices.We then present our findings regarding data citation and mentioning practices, citation motivations, and respondents' preferences for their own data.To facilitate understanding our narrative results, we begin sections with tables summarizing statistically significant responses by discipline.

Quantitative Science Studies 631
Tracing data Reflecting our sampling and recruitment strategy, the majority of respondents are from the natural sciences and are in middle to senior career stages (Figure 1).Respondents primarily work in universities, followed by research institutions; most work in North America or Europe/Central Asia.Roughly two-thirds of respondents self-identify as men (66.2%) and one-third as women (31.5%).

Data Reuse Practices
Figure 2 summarizes statistically significant results in questions related to data reuse practices.
The majority of respondents report reusing data (81.3%) and sharing their own data (81.0%).This indicates a potential self-selection bias in our sample towards people who share and reuse data.Roughly three-quarters of respondents (71.9%) also reported reusing their own data multiple times.There is a significant difference but small association in data reuse according to academic discipline (χ 2 (5, n = 2,492) = 27.18,p < .001,V = .104),with researchers in engineering and technology reusing data more and those in agricultural sciences less than expected, compared with other disciplines.

Types of data and types of data reuse
Across disciplines, there is a tendency for respondents to reuse both quantitative and qualitative data more than either data type alone (Figure 3).A significant difference with a medium association between disciplines was also detected (χ 2 (15, n = 2,026) = 155.04,p < .001,V = .160),where social scientists reuse quantitative data more than expected, and researchers in the humanities use qualitative data more than expected.
We also asked how frequently researchers reuse data for the different purposes proposed in Gregory et al. (2020).Significant differences to this question were detected (Figure 2, question 4), where researchers in engineering and technology more frequently reuse data as model, algorithm, or system inputs; to calibrate instruments; or for verification purposes than do other disciplinary groups.Using data to identify trends and make comparisons or predictions is more frequently done by researchers in the humanities, natural sciences, and engineering and technology.Researchers in the natural sciences and, to a lesser extent, the humanities, more frequently integrate data to create new data sets than do researchers in other domains.These results suggest that data integration is influenced by a researcher's disciplinary domain and highlight that data integration is not something that every discipline engages in at the same frequency.
There was no significant difference between disciplines for two types of data reuse: using data as the basis for a new study (H(5) = 7.115, p = .212)and using data in teaching (H(5) = 7.657, p =

Quantitative Science Studies 633
Tracing data .176).This indicates that these types of data reuse are done with the same frequency levels (sometimes or often) across disciplines, which supports the preliminary findings of Gregory et al. (2020).

Nonreusers of research data
Roughly one-fifth of survey respondents (n = 466) do not reuse data in their work.We specifically asked these respondents to indicate their reasons for not reusing data (Figure 4).Across  disciplines, the most frequently selected option was that reusing data was not relevant to respondents' research methods, although significant differences with a medium association between disciplinary groups for this option were identified (χ 2 (5, n = 466) = 20.989,p < .001,V = .212).A total of 81.3% of nonreusers in humanities state that reusing data is not relevant to their research methods; this was not a reason selected as often by researchers in the social sciences, agricultural sciences, or medical and health.
Another significant difference between disciplines is tied to a lack of available relevant data for nonreusers.A medium association was detected for respondents in the social sciences, who selected this option more than other disciplines (χ 2 (5, n = 466) = 11.768,p = .038,V = .159).Difficulties finding data are more of a barrier to reusing data in the social sciences and engineering and technology.
We did not detect significant disciplinary differences for many of the options to this question.One of the most common reasons (31.6%) to not reuse data across disciplines was that it is not normal practice in respondents' communities.Similarly, only slightly more than one quarter of respondents to this question indicated that they get more credit for creating their own data than for using other people's data.A lack of trust in data also does not appear to be a reason for many researchers not to reuse data, with only 11.6% selecting this option, nor does an awareness about how to credit the data of others.

Citing and Mentioning Data: Practices, Motivations, and Preferences
This section presents findings related to respondents' reported practices for citing and mentioning data; their data citation motivations; and their preferences for how others can acknowledge their own data.

Citing and mentioning practices
We asked respondents which data objects they refer to in publications, as well as to describe the methods with which they do so (i.e., by including a data citation, a mention in the footnote or body of text, or an indirect citation (Figure 5)).We also asked respondents about their awareness and use of data citation standards.Significant differences for this question are reported in Figure 8.
A total of 77.7% of data reusers indicated that they often or always cite or mention another publication in which the data have been analyzed (Figure 6).Respondents also frequently selected that they often or always refer to the source of the data (70.7%);referring to the data themselves was the third most frequently selected, with 58.3% of respondents across disciplines reporting that they either often or always cite or mention data.Significant disciplinary differences were detected for how frequently respondents refer to two types of data objects: publications analyzing data and data papers.Social scientists cite or mention publications in which data have been previously analyzed less frequently than other disciplinary groups.Both social scientists and humanities researchers refer to data papers less frequently than other disciplines, particularly those in engineering and technology and natural sciences.No significant disciplinary differences were detected for how often respondents cite or mention the data themselves; referring to the data source is also a common practice for respondents across disciplinary groups.
Respondents indicated the frequency with which they employ various methods to refer to data (Figure 7).Across the sample, respondents report often including a citation to related papers, although a significant disciplinary difference was identified for this option (H(5) = 61.877,p < .001).Researchers in engineering and technology more frequently cite or mention related papers, and social scientists engage in this practice the least, compared to other

Quantitative Science Studies 635
Tracing data disciplinary groups.This tendency is supported by our previous finding regarding the types of data objects cited by social scientists (Figure 6).
Another oft-reported practice is to include citations to data in reference lists (Figure 7).We did not detect a significant disciplinary difference for this option.This finding is also supported by the results of a separate question, in which 69.0% of data reusers across disciplines stated that they cite data in a reference list and 24.0% stated that they sometimes do so.
Significant differences were identified for referring to data in footnotes (H(5) = 116.581,p < .001)and for referring to data in the body of a publication (H(5) = 16.980,p = .005).Perhaps reflecting common practices of citing academic literature, humanities researchers more frequently refer to data using footnotes than other disciplines.Social scientists most frequently refer to data throughout the body of a publication, which supports the findings from van de Sandt (2021).All respondents are generally unaware of and do not use many citation standards that have been developed specifically for data (e.g., those developed by DataCite or scientific societies (Figure 8)).Respondents report being most aware of data citation standards created by journals and publishers or those included in long-standing citation guidelines (e.g., APA or MLA).If respondents are aware of guidelines, they tend to use them.Significant disciplinary differences were identified for respondents' awareness and use of data citation standards.Social scientists, for example, were less aware of all citation standards, with the exception of standards from citation style guides, which they are aware of and use more than expected.Other disciplinary groups have greater awareness and use of other recommendations for data citation, particularly natural sciences and agricultural sciences, who are aware of and use recommendations issued by DataCite, repositories, and scientific societies more than expected.

Motivations for citing data
We asked respondents who explicitly said that they cite data in a reference list about their motivations for doing so. Figure 9 summarizes statistically significant results for the relevant questions.
Overall, motivations that reflect ideal scientific best practices (i.e., to show intellectual debt, to assist others in locating data, or to support the validity of research claims) were selected more frequently than external reasons (Figure 10).Some 8.4% of respondents to this question stated that they cite data because they were advised to (e.g., by journals or publishers).

Tracing data
Significant disciplinary differences, although with small associations, were found for three motivations for data citation.Citing data as a way of demonstrating intellectual debt (χ 2 (5, 1,876) = 25.497,p < .001,V = .117)was selected more frequently than expected by social scientists and humanities respondents.Facilitating data discovery (χ 2 (5, 1,876) = 15.803,p = .007,V = .092)was selected more often than expected by researchers in the social sciences, medical and health sciences, and humanities.Using data citations to indicate data usage (χ 2 (5, 1,876) = 22.062, p < .001,V = .108)was particularly important for researchers in the humanities and medical and health.No significant disciplinary difference was detected for respondents who cite data to reward data providers; respondents across disciplines were roughly evenly split between those who selected this option and those who did not.
In a separate question, more than half of respondents across disciplines report citing their own data when they use data again.Respondents do not commonly cite data when criticizing or correcting the data of others (26.7%) or when correcting errors in their own data (21.2%).One notable exception to this is in the humanities, where respondents cite data in order to criticize the work of others much more than expected.

Preferences for respondents' own data
We asked all respondents a series of questions regarding their preferences for how they would like their own data to be cited or mentioned.Figure 11 summarizes statistically significant differences between disciplines for these questions.The overwhelming majority of all respondents (98.5%) would like other people to refer to their data in some way.Mirroring the question design in Section 4.2.1, we asked respondents about their preferences for both types of data objects and referencing methods.
Across the sample, respondents prefer that others cite or mention a publication analyzing the data (84.3%)compared to other options, such as referring to the source of the data (55.3%) or the data themselves (46.3%) (Figure 12).Significant differences between disciplines were detected for the types of data objects that respondents prefer others to cite or mention (Figure 12).Social sciences and humanities are the only disciplines preferring that others cite or mention the data themselves more than expected.Respondents in all disciplines would like others to refer to a publication analyzing their own data; no significant difference was detected for this option.
A total of 72.5% of respondents chose more than one option for this question.Across disciplines, respondents frequently selected related publications and data sources together (Figure 13).In the medical and health, natural, and social sciences, related publications and data were also often chosen in conjunction.This suggests that respondents prefer that others cite multiple data objects to indicate the reuse of their data.
There is a preference among all respondents for others to include a citation of some sort in a reference list, be that a citation to the data themselves (71.3%) or to a related publication (69.5%) (Figure 14).This seems to stand in contrast to our findings about data objects.Although respondents do not strongly prefer that others cite/mention the data themselves (Figure 12), they do want others to use a data citation (Figure 14).One explanation could be that respondents consider citations to other data objects (i.e., data sources) to constitute data citations.The findings in both Figure 12 and Figure 14 demonstrate that respondents across disciplines prefer others to cite related publications.Significant disciplinary differences were identified for nearly every option to this question.Researchers in engineering and technology prefer data mentions in figures, captions, and tables and to have indirect citations to related papers more frequently than expected.Researchers in the humanities do not.Humanities and social science respondents prefer the use of data mentions in footnotes (χ 2 (5, 2455) = 131.739,p < .001,V = .232);this is the strongest association that we detected for this question.Social scientists also prefer that their data be mentioned in the body of publications more than other disciplinary groups.

DISCUSSION
This paper presents findings from a survey explicitly investigating data reuse and citation practices using a carefully constructed representative sample of researchers by discipline, as represented in WoS.We explored questions about the frequency of types of data reuse across disciplines and reasons why researchers do not reuse data.We examined researchers' reported practices and motivations for citing and mentioning data and also investigated respondents' preferences for how they would like their own data to be cited.
Although we found many disciplinary differences, our results particularly highlight differences in the social sciences and humanities (SSH).We therefore discuss our findings in two broad clusters, beginning with SSH researchers and then addressing other disciplinary groups.

Commonalities between SSH researchers
Social sciences and humanities researchers share some practices and preferences regarding data reuse and citation.Compared to other surveyed disciplines, SSH respondents are slightly more likely to reuse their own data than to share them with others.The reuse of one's own data or "material" is common practice in the humanities and some areas of qualitative social sciences, where a particular object, corpus, or ethnographic study can be used as data throughout a researcher's career (Borgman, 2015).Our results also indicate that it may be common for social scientists to reuse their own quantitative data (Figure 2).SSH are also the only disciplinary groups who prefer that others cite or mention their own data, as opposed to other data objects.This preference contrasts with the citation practices of SSH researchers documented in scientometric work, where SSH scholars cite "data studies," rather than individual data files (Robinson-García et al., 2016), likely representing the varied ways in which researchers define data (Borgman, 2015;Leonelli, 2015).
Both disciplines also less frequently cite or mention data papers.This reflects the slower emergence of data papers and journals in SSH (Candela, Castelli et al., 2015) and possibly a history of using data from governmental sources, where data papers may not be as relevant.There is some evidence that the landscape of data papers in SSH may be changing, and that data papers may have an effect on metrics of associated papers and data (McGillivray, Marongiu et al., 2022).
Both disciplinary groups cite data using footnotes and prefer to have their own data mentioned in footnotes more than other disciplines, although this practice is stronger in the humanities.This reflects the long-standing practice of using footnotes as a way of referencing, particularly in the humanities (Hammarfelt, 2012;Ochsner, Hug et al., 2016), and the tendency among social scientists in our results to mention data throughout a publication, which has also been documented in previous studies (Moss & Lyle, 2019;van de Sandt, 2021).
Respondents across the sample indicated that they cite data to acknowledge intellectual debt; our results suggest that this is a particularly important motivation for SSH researchers.Referring to data as "intellectual building blocks" may be a factor of the purposes for which SSH respondents reuse data (i.e., as the basis for a new study or to integrate (literature) sources to build an argument).Acknowledging intellectual debt via citation is an established motivation for citing literature (Garfield, 1965;Merton, 1973); it could be that when a researcher's data is literature, as is the case in some areas of humanities research, literature and data citation motivations are also intertwined.
SSH researchers, as well as those in medical and health, cite data as a way to help others to locate and access data.Although this may be a motivation for data citation, the actual practice of many social science researchers may impede this goal.Mentioning data throughout the body of a publication or using incomplete data references (Banaeefar et al., 2022) may hinder automated forms of data discovery, which rely on or recommend the use of data citations with PIDs in reference lists (Data Citation Synthesis Group, 2014;Wilkinson, Dumontier et al., 2016).Recent efforts exploring alternative methods for automatically linking and discovering data from within the body of publications (see Lane et al., 2020) are more in line with the practices of social science researchers.

Social sciences: Unique practices
Although our findings demonstrate similarities among SSH researchers, we also find differences between social scientists and those in the humanities.Social science researchers report most often reusing quantitative data.This supports the findings of Fear (2013), documenting the prevalence of reusing numerical or statistical data created through social research methods and harkens to the long-standing debate about the reuse of qualitative data within the social sciences (Bishop & Kuula-Luumi, 2017;Curty, 2016).Compared to other respondents who do not reuse data, social scientists indicated that challenges with data discoverability and a lack of available relevant data on certain research topics may inhibit data reuse.
In contrast to the other disciplines, social scientists cite or mention publications in which data have been previously analyzed less frequently.Instead, social science researchers tend to cite data objects throughout a publication, as found in studies at ICPSR (Banaeefar et al., 2022;Moss & Lyle, 2019;van de Sandt, 2021) and prefer that others do this as well.Social scientists are also most aware of and use data citation standards issued by long-standing citation style guides (e.g., APA).This could indicate a tendency for researchers to use standards with which they are already familiar, or it could signal a conflation among respondents between data and literature citation standards, given the recency of APA data-specific guidelines (American Psychological Association, 2022).

Humanities: Unique practices
Although the majority of humanities respondents reuse both quantitative and qualitative data, humanities researchers reuse qualitative data much more than other disciplinary groups.We also see that in many cases, humanities respondents indicated doing the opposite of other disciplines, as also noted by Cannon, Grant, and McKellar (2022), engaging in practices that may be rooted in specific research methodologies.
Along with the natural sciences, humanities researchers reuse data to integrate different data sources, identify trends, and make comparisons more frequently than other disciplinary groups.This could indicate the use of digital methods among humanities respondents, but it

Quantitative Science Studies 643
Tracing data could also be a sign of a tradition of bringing together and comparing different sources, both digital and analog, to make research claims.
Scientometric studies have suggested that self-citation may be common in existing data citations (Park & Wolfram, 2017).More than any other disciplinary group, humanities researchers cite their own data in order to make corrections; along with natural sciences researchers, humanities scholars also cite their own data in order to build on their past work.Again, perhaps reflecting critical research and discourse methods, respondents from the humanities cite data to both criticize and correct the data of others more than other disciplines.

Agricultural Science, Natural Sciences and Engineering and Medical and Health Science
We have discussed many of the disciplinary differences identified in our results from the standpoint of SSH researchers.The practices, preferences, and motivations of researchers in other disciplinary groups also share commonalities and have some differences.Agricultural science, natural sciences, engineering and technology, and medical and health sciences are similar when it comes to the type of data they reuse, all reusing both quantitative and qualitative data.Although the majority of all survey respondents report sharing their own data, those in natural sciences do so more compared to other disciplines, supporting scientometric work in this area (Ninkov et al., 2022;Robinson-Garcia et al., 2017).
Building on the results of earlier work (Gregory et al., 2020), we also see strong reflections of disciplinary methodologies in the frequency of reusing data.In addition to the differences discussed in Section 5.1.1,our results show that natural sciences and engineering and technology researchers most frequently reuse data to calibrate instruments, to verify their own data, or as model, algorithm, or system inputs.
Across our sample, respondents report most frequently citing or mentioning an article analyzing the data, compared to other data objects (Figure 6).Researchers in agricultural sciences, natural sciences, and engineering and technology engage in this practice more than other disciplines.Engineering and technology researchers also report citing data in figures, tables, and graphs, a practice that mirrors how these researchers would prefer other people to refer to their own data.This reflects a link to how researchers discover data from the literature (Pepe et al., 2014), where they also draw data for reuse from figures or captions.
Although we found disciplinary differences in motivations for citing data, the strength of the detected associations was small.Engineering and technology are often situated on the opposite side of the spectrum from humanities in terms of citation motivations.Engineering and technology researchers, as well as those in agricultural sciences, do not cite data as often to acknowledge intellectual debt, to help others locate data, or as a sign of data use.We hypothesize that these differences in motivations could be linked to different reasons for reusing data and to associated research methods.Common data uses in these disciplinary groups, as model, algorithm, or system inputs; to calibrate instruments; or for verification purposes, may not be seen as meriting an acknowledgement of "intellectual debt," but may rather be so standard that they are seamlessly integrated into research workflows.

CONCLUSION: CONSIDERATIONS FOR TRACING DATA REUSE
This study sheds light on relationships between data citation and data reuse, while also providing insight into why researchers cite, or do not cite, data in their academic work.Our results contextualize the broader development of research data services and have implications

Quantitative Science Studies 644
Tracing data for efforts to trace signals of data reuse (e.g., in the development of data metrics).We conclude by highlighting three points for consideration when tracing data.
6.1.Data "Citation" Is Varied and Differently Interpreted Our results show that respondents from all disciplines reuse data for various purposes in research and teaching.However, the survey also reveals that this reuse of data is reflected via a variety of mechanisms in publications, including data mentions and indirect citations to related literature.At the same time, the vast majority of data reusers responding to our survey state that they cite data in reference lists, suggesting that researchers may have different interpretations of what it means to "cite" data, and that they may construe these different mechanisms as valid and appropriate forms of data citation.We also see signs that researchers prefer that others reference a combination of different data objects to indicate data reuse (Figure 13).
These types of variations in practice and preference contrast with efforts that have gained momentum in the scholarly infrastructure space, such as those of data repositories and organizations such as DataCite, which encourage the standardized citation of data in one location-reference lists-and the use of PIDs for individual data sets.Relying solely on data citations to trace signs of data reuse potentially disadvantages researchers who are engaging in what they see as best practice, particularly if such signals are incorporated into systems of academic recognition and reward.As seen in our findings, citing and mentioning data are shaped by discipline-specific practices, standards, and research cultures.These practices seem to be rooted in long-standing traditions of indicating use in certain ways (e.g., via footnotes) and of referring to certain objects, particularly academic publications.The power of disciplinary and academic norms, including those of reward systems based on literature citation metrics, may impede citing data in reference lists.As literature citations are the primary currency in academic reward systems, researchers may be loath to cite data rather than publications.Researchers may also find that their current practices meet the needs and expectations of their disciplinary communities.At the same time, survey responses from individuals who do not reuse data suggest that data citations could help to counter some barriers to data reuse, such as in facilitating data discovery.This juxtaposition raises a central question.When should research practice be adapted to current technical requirements and recommendations for data citation, and when should requirements and recommendations be adapted to reflect actual practice?Addressing this question requires long-term engagement with research communities and disciplinary debate.

Acknowledging Data Reuse Is Complex
Another key insight derived from our results is that acknowledging data reuse in academic work may be more complex than acknowledging the reuse of ideas, methods, and other knowledge present in academic publications.Research data are extremely diverse and exist at different levels of granularity in different formats (Peters, Kraker et al., 2017).Although different formats for communicating scholarly knowledge have been developed (Priem, 2013), such knowledge is often transmitted via standardized formats (i.e., journal articles, book chapters, conference proceedings), perhaps facilitating more homogeneous methods of citation.
Our results suggest that researchers cite or mention data for reasons related to ideal good research practices and that they are not motivated by external recommendations (e.g., from

Quantitative Science Studies 645
Tracing data journal publishers).We also see that if they are aware of such citation guidelines that they tend to use them.As suggested by Banaeefar et al. (2022), this could indicate that although researchers are willing to cite data, they are still developing norms about when it is appropriate to acknowledge data reuse via a citation and when it is not.
Taken together, these points for consideration highlight that data citation is complex, local to different disciplines and communities, and tied to existing research practices and systems of recognition and reward.Although we have explored data citation practices, preferences, and motivations across disciplines, there is still much work to be done.Future work is needed to examine data citation practices according to other characteristics (e.g., by academic career stage), as is conducting more in-depth qualitative studies.Additionally, it is important to consider how to advance and adapt the development of metrics, policies, and recommendations that incorporate data citations, particularly those related to rewarding individual data sharing and reuse.

Figure 1 .
Figure 1.Percentage of respondents' A: disciplinary domains (n = 2,492), B: geographic regions of employment, C: years of experience, and D: employment institution.Percentages (except for discipline) have been weighted.

Figure 2 .
Figure 2. Summary of statistically significant results by discipline for questions related to data reuse.Blue indicates a result greater than the average of the reporting statistic for each question; red indicates a result less than the average.Darker shades indicate larger deviations from the average.Reported mean ranks for Question 4 support the results of the Kruskal-Wallis tests.

Figure 3 .
Figure 3. Types of data reused by participants, showing percentages of respondents in each discipline.

Figure 4 .
Figure 4. Reasons for not reusing data (multiple responses possible).Options with significant differences are indicated in blue.Percentages are based on weighted number of respondents answering this question (n = 466).

Figure 6 .
Figure 6.Data objects.Frequency of citing or mentioning various data objects.Options with significant differences are indicated with an asterisk.Percentages are based on weighted number of respondents per discipline answering this question (n = 2,026).Bars are arranged around the middle (50% mark) of the "sometimes" category.N/A responses are not shown.

Figure 5 .
Figure 5. Summary of statistically significant results by discipline for questions related to data objects and citation/mentioning methods.Blue indicates a result greater than the average of the reporting statistic; red indicates a result less than the average.Darker shades indicate larger deviations from the average.Reported mean ranks for Question 6 support the results of the Kruskal-Wallis tests.

Figure 8 .
Figure 8. Awareness and use of data citation standards.Options with significant differences are indicated with an asterisk.Percentages are based on weighted number of all respondents (n = 2,492).

Figure 7 .
Figure 7. Methods.Frequency of methods used to cite or mention various data objects.Options with significant differences are indicated with an asterisk.Percentages are based on weighted number of respondents answering this question (n = 2,026).Bars are arranged around the middle (50% mark) of the "sometimes" category.

Figure 9 .
Figure 9. Summary of statistically significant results by discipline for questions related to citation motivations.Blue indicates a result greater than the average of the reporting statistic; red indicates a result less than the average.Darker shades indicate larger deviations from the average.

Figure 11 .
Figure 11.Summary of statistically significant results by discipline for questions related to citation preferences for respondents' own data.Blue indicates a result greater than the average of the reporting statistic; red indicates a result less than the average.Darker shades indicate larger deviations from the average.

Figure 10 .
Figure 10.Motivations for citing data (multiple responses possible).Options with significant differences are indicated in blue.Percentages are based on weighted number of respondents answering this question (n = 1,884).

Figure 12 .
Figure 12.Data objects.Preferences for how respondents would like others to refer to their own data (multiple responses possible).Options with significant differences are indicated in blue.Percentages are based on weighted number of respondents (n = 2,454).

Figure 14 .
Figure 14.Methods.Preferences for how respondents would like others to refer to their own data (multiple responses possible).Options with significant differences are indicated in blue.Percentages are based on weighted number of respondents answering this question (n = 2,454).

Figure 13 .
Figure 13.Multiple data objects.Preferences for how respondents would like others to refer to their own data.Dark blue indicates objects most often selected together.Dark red indicates those least frequently selected together.

6. 2 .
Data Citation Is Rooted in Other Practices.When Do We Meet Researchers Where They Are?

Table 1 .
Definition and explanation of data citation terms

Table 3 .
Statistical tests of significance used in the analysis

Table 2 .
Summary of sampling researchers by disciplinary classification