Data availability statements can provide useful information about how researchers actually share research data. We used unsupervised machine learning to analyze 124,000 data availability statements submitted by research authors to 176 Wiley journals between 2013 and 2019. We categorized the data availability statements, and looked at trends over time. We found expected increases in the number of data availability statements submitted over time, and marked increases that correlate with policy changes made by journals. Our open data challenge becomes to use what we have learned to present researchers with relevant and easy options that help them to share and make an impact with new research data.
This study looks at data availability statements submitted to Wiley journals for useful information about how researchers actually share research data. This is important because researchers, particularly those with grant funding, are increasingly required to share the data they create [1, 2]. Our challenge becomes to present researchers with relevant and easy options that help them to share and make an impact with new research data.
2. RELATED WORK
More than a decade ago the US National Institutes of Health (NIH) said “data sharing is essential for expedited translation of research results into knowledge, products, and procedures” and began requiring data sharing plans in all grants greater than US$500,000 . Since 2018, the General Office of the State Council in China has required that “all scientific data derived from the science and technology plans … shall be archived in the relevant science data centers” . State Council governs the funding made available by the Chinese Academy of Sciences and also the Ministry of Science and Technology, under which sits the National Natural Science Foundation of China. The EUR100 billion Horizon Europe funding program from the European Commission will challenge funded researchers to deliver “open access to publications, data, and to research data management plans” starting in 2021. Private not-for-profit funders, like the Wellcome Trust and the Bill & Melinda Gates Foundation, are among those with the most progressive requirements. Funders argue that sharing data creates more value and impact from every grant they award and enhances trustworthiness and potential reproducibility. This is the open data challenge: Sharing data first, and then realizing the value from doing so.
While the arguments for data sharing are compelling and the rhetoric often exciting (“researchers are creating, gathering and using data in hitherto unimagined-volumes” ), reservations have been expressed that reflect researchers' concerns, including the absence of infrastructure and incentives, and the presence of disincentives such as the fear of getting scooped and concerns about misinterpretation or misuse of shared data [7,8]. Whatever your position in those arguments, it is fair to say that the open data challenge is a big challenge, and it is important to recognize that communities of researchers are ready to meet it in different ways and to different degrees . For example, life scientists have been reported as the most ready to share the data they create [10,11], and early career researchers may be particularly well-prepared to do so [12, 13]. In fact, researchers in most data-oriented disciplines have embraced the challenge to an extent, and even where research data are associated with complex ethical issues like consent and privacy, the obligation to share data is recognized . It helps that researchers can look forward to greater impact [15, 16]. It also helps that understanding of what “open” data really mean has become quite sophisticated, aided by the FAIR Data Principles  and promoted with the much-used soundbite that research data should be “as open as possible, as closed as necessary” .
Looking at this through a publishing-focused lens, when researchers are ready to share data, publishers and journals can play a useful role in enabling and realizing the benefits. They help communicate and explain standards and expectations [19, 20]. They help researchers meet the data sharing requirements including those set by their funders . They increase the discoverability of shared data, perhaps 1000-fold (although there may be more correlation in that number than simply causation) . They prompt researchers to “plan for the longevity, reusability, and stability of the data” .
Opinions about data sharing among researchers continue to be widely surveyed [24, 25]. Actual data sharing practices have been investigated by looking at data availability statements published in journal articles [26, 27, 28]. Data availability statements describe whether and how researchers' newly analyzed research data have been made available, and the conditions under which they can be accessed. When research authors have shared data in appropriate research data repositories, their data availability statements can include permanent identifiers to link between the journal article and the data. Studies of data availability statements have been used to assess the impact and increase effectiveness of data sharing policies at journals [26, 27, 28], and study how data sharing practices are changing. Some conclude that practices fall short of the study authors' expectations [26, 29, 30]. This reminds us how important it is for publishers and journals to set reasonable expectations, and to support those expectations with robust policy and process . It also speaks to the pace of change in different communities as they become familiar with, interested in, able to, and required to share research data. Publishers and journals need to match that pace, and they can also lead change. Measuring, interpreting, and acting on data sharing trends ensures that publishers and journals continue to serve researchers well.
We used topic modeling, an unsupervised machine learning technique, to identify topics from 124,000 data availability statements submitted by research authors to 176 Wiley journals between 2013 and 2019. The complete workflow is available at GitHub①. The workflow is managed with Snakemake .
Wiley's electronic editorial office systems allow for the inclusion of custom questions on a journal-by-journal basis. We first extracted all of the records that contained the term “data” in either the question or the answer, but then limited the selection to questions that mentioned “data availability” or “data accessibility”.
We then used spaCy  to tokenize the answers, limiting the tokens to nouns, proper nouns, and adjectives. We also added some custom stop words to ignore like “Wiley”, “url”, “et”, and “al”.
Then, we used scikit-learn  to create a term frequency-inverse document frequency (TF-IDF) matrix  of the tokenized answers, followed by Latent Dirichlet Allocation (LDA) . We initially used 20 topics to cluster the documents. We used pyLDAvis  to visualize the topics estimated by the model.
Finally, we labeled the topics where possible for further analysis and discussion, using Wiley's Data Sharing Policy Author Templates  as a starting guide.
4. RESULTS AND ANALYSIS
Simply counting the number of answers to the custom questions that contain the term “data availability” or “data accessibility” shows a dramatic uptick in volume starting in early 2019 (Figure 1). This coincides with the rollout of Wiley's Expects Data policy, which added data availability statement requirements to more than 100 journals starting in December 2018 .
Visualization of the topics estimated by the LDA (Figure 2) shows that the initial choice of 20 topics is reasonably spread out. As described in the original paper on LDAvis : “In this view, we plot the topics as circles in the two-dimensional plane whose centers are determined by computing the distance between topics, and then by using multidimensional scaling to project the intertopic distances onto two dimensions, as is done in . We encode each topic's overall prevalence using the areas of the circles, where we sort the topics in decreasing order of prevalence”. Lambda is a relevance metric that can be adjusted to alter the rankings of terms in order to aid topic interpretation. The keyword frequencies are ranked in the right panel for the complete topic model, and hovering over an individual topic shows how that topic compares to the complete model.
Selecting the number of topics in an LDA analysis is an iterative process, and there is no formula for predicting what will be the “best” number of topics. Too few topics and they will be too general; too many, and they will be too specific. In other projects, we have used the pyLDAvis tool shown in Figure 2 to evaluate how well a topic model probably covers the topic space, looking at how much overlap there is between topics. As shown in Figure 2, there are some clusters, but only a couple of topics with significant overlap (5 and 15, which we labeled as “Third-party restrictions” and “Genetics databases”). Based on our experience with other projects, this is a good starting point for a topic model.
After checking that the number of topics seemed reasonable, we labeled the individual topics by hand by reading the top 20 statements for each topic and making our best guess what the cluster was about.
Some of the topics correspond well with one of the standard templates we encourage authors to use on our Data Sharing Policy page  and were easy to label, such as “#5: Third-party restrictions”, which matched with “Data subject to third-party restrictions”.
Other labels were more problematic. “6: Uncharacterizable” was a cluster that included experimental sections and actual data that the authors had copied and pasted into the Data Availability statement, perhaps highlighting the need for better author instructions. “7: Mixed” had many different kinds of statements that the LDA algorithm with the given parameters had combined. Tuning the text preprocessing parameters or the LDA parameters (number of topics and other hyperparameters) might resolve this mixed topic.
Some labels are also repeated. Topics 8 and 9 are both examples of “Available on reasonable request”, although the LDA algorithm has resolved them into two separate topics based on the words contained in the statements themselves.
The percentages of each topic are shown in Figure 3.
We can visualize which topics correspond to which document, as shown in Figure 4. In Figure 4, each row corresponds to a single document, and each column to a topic. The intensity of the color corresponds to the topic weighting for that document as calculated by the model. We would expect that a document would correspond strongly to one or two topics. For example, Document 1 corresponds most strongly to Topic 11, while Document 2 corresponds most strongly to Topic 15, but also to Topic 16.
We can also visualize which words correspond most strongly to each topic, as shown in Figure 5. As expected, words like “datum” (the tokenized version of “data”), “available”, “reasonable”, and “request” correspond strongly to Topic 8, “Available on reasonable request”. Topic 19, “Available at repository with DOI”, is strongly associated with terms like “dryad” and “repository”.
Finally, we can visualize the trends in topic growth over time (Figure 6) by assigning each document to the highest-weighted topic. A significant number of statements (approximately 46,000) were not predicted to belong to any topic category, so we filtered these out. This could be investigated in future work – it might be noise spread out over the entire data set that could be reduced by tuning the model. We can see that after December 2018 the number of authors making data available on reasonable request (Topic 8) made a sharp increase.
5. DISCUSSION AND CONCLUSIONS
We see growth across all the topics we used to categorize data availability statements submitted to 176 journals between 2013 and 2019 (Figure 1).
Figure 3 and Figure 4 illustrate how our methods delivered results more than offering analysis; respectively, they show overall percentage of topics identified, and the relationship of topics to individual documents analyzed. Figure 5 shows relationship of topics to words used by authors, and offers readers a simple visual validation for our methods and results; for example, the intense yellow area above Topic 8 “Available on reasonable request” indicates strong relationships with the words: reasonable, request, corresponding author, finding, study, available, and datum. Figure 6 offers trends in topics over time, and these are discussed in more detail below.
We see a particularly sharp increase in growth in early 2019 after launching the Expects Data policy which, for journals that adopt it, requires a data availability statement in every article . Implementing this requirement for data availability statements correlates with many more submitted data availability statements (as is to be expected of a successful implementation). It also correlates with many more declarations made by researchers that data are available on request (for example, Topic 8 in Figure 6, and the related Topics 4, 9, 10, 12, 14, and 20). This is an improvement over the absence of any statement about data. It seems reasonable to anticipate that as researchers become familiar with, interested in, able to, and required to share research data the high proportion of data availability statements categorized as Topic 8 (and related Topics listed above) will gradually be replaced by data availability statements that describe shared data (like Topics 1, 2, 3, 15, 17, 18, and 19).
For data that have been shared, Topic 19 is a good standard to aspire to. It indicates that data are shared in a repository with a permanent digital object identifier (DOI). The number of data availability statements categorized as Topic 19 shows steady growth over six years. This is reassuring, but Topic 19 does not show the sharp increase in 2019 that we might expect to correlate with launch of our Expects Data policy. Several related topics that also describe data having been shared online (Topics 1, 2, 3, 15, 17, and 18) do show the expected sharp increase in early 2019. For Topic 19, consistent growth may be real and based on author behavior, or it may be an artefact of the analysis that we could investigate in future work.
Data that are available in genetics databases, per Topic 15, also show an interesting trend: steeper growth between 2014 and 2016; a distinct flat period between 2016 and 2018; and then steep growth in 2019, correlating with launch of our Expects Data policy. This could be an area for future analysis. It is also interesting to note continuing presence and moderate growth in Topics 13 and 16, which indicate that data have been shared in journal supporting information.
To conclude, if our goal is simply to enable research authors to describe in their journal articles whether or not they have shared the new data they have created then this can be achieved using a policy that requires data availability statements. If our goal is to increase data sharing, then launching a policy and studying the data collected from it may also be valuable: it creates insights into how to enable and support better experiences for researchers, more data sharing, and higher-quality data sharing. For example, data from this study could help identify which kinds of articles without shared data are similar to those with shared data, and to which journals both are submitted. With that information we could design and launch supportive policies and services where they are more likely to be welcomed by researchers, and therefore where they are most likely to have a positive impact.
All authors, C. Graf (email@example.com), D. Flanagan (firstname.lastname@example.org), L. Wylie (email@example.com) and D. Silver (firstname.lastname@example.org) made substantial contributions to the design of this paper. D. Flanagan and L. Wylie contributed to the methodology part of this paper and all authors participated in the investigation and data collection. C. Graf and D. Flanagan wrote the first draft. All authors approved the version to be published and are accountable for the paper.
Thanks to Elisha Morris at Wiley for the literature search and analysis we used to write our introduction. Thanks to Yan Wu at Wiley for insights into data sharing requirements in China. Thanks to Gary Spencer at Wiley for useful discussions about author behavior and manuscript submission processes. Thanks to Alex Moscrop at Wiley for providing our data. Written collaboratively and preprinted using Authorea; thanks to Alberto Pepe and the Authorea team.