Abstract
Currently, there is limited research investigating the phenomenon of research data repositories being shut down, and the impact this has on the long-term availability of data. This paper takes an infrastructure perspective on the preservation of research data by using a registry to identify 191 research data repositories that have been closed and presenting information on the shutdown process. The results show that 6.2% of research data repositories indexed in the registry were shut down. The risks resulting in repository shutdown are varied. The median age of a repository when shutting down is 12 years. Strategies to prevent data loss at the infrastructure level are pursued to varying extent. Of the repositories in the sample, 44% migrated data to another repository and 12% maintain limited access to their data collection. However, neither strategy is a permanent solution. Finally, the general lack of information on repository shutdown events as well as the effect on the findability of data and the permanence of the scholarly record are discussed.
PEER REVIEW
1. INTRODUCTION
With the amount of published research data steadily increasing (Benjelloun, Chen, & Noy, 2020), the long-term preservation of data sets is gaining importance, especially if research data are to be regarded as self-contained components of the scholarly record (Manghi, Mannocci et al., 2021). For this idea and data citation to succeed, continuous access to data sets is required, because in order for data sets to become citable units, they must be permanently available (Buneman, Dosso et al., 2021). Concerns about perpetual access to digital scholarly texts have resulted in the establishment of a distributed network of preservation services that is maintained jointly by various stakeholders (Mering, 2015). However, the adoption of these preservation services is slow compared to the growth in the number of academic journals, and some journals have been shut down and disappeared (Laakso, Matthias, & Jahn, 2021). Research data might be even more vulnerable, as the burden of long-term preservation rests predominantly on dedicated repositories—preservation systems comparable to those for scholarly texts currently are not widely spread and can be difficult to realize (Kiefer, 2015).
Long-term preservation of research data requires continuous care of not only data sets but also of the repositories that hold them (Eschenfelder & Shankar, 2017). The TRUST Principles, a set of guiding principles for research data repositories formulated by a multistakeholder group, ask repositories to “ensure uninterrupted access to [their] valuable data holdings for current and future user communities” (Lin, Crabtree et al., 2020, p. 3). To meet these expectations, repository operators need to find solutions for infrastructure maintenance, governance, securing of funding, and continuity planning. The long-term operation of research data repositories presents a challenge, and sometimes, for varying reasons and despite best efforts, research data repositories are shut down.
Beyond case studies of individual repositories or selected disciplines, there is currently little research investigating the phenomenon of research data repositories being shut down and the impact this has on the long-term availability of data (Boyd, 2021). To address this issue, this paper identifies research data repositories of all types and from all disciplines that have been closed and presents information on the shutdown process.
Based on metadata from a registry of research data repositories and information collected from repository websites, this paper takes an infrastructure perspective on the long-term preservation of research data. The following questions are addressed:
How common is the phenomenon of research data repositories being shut down?
What are the risks that lead to repositories being shut down?
What measures are taken to prevent data loss when repositories are shut down?
The paper concludes with reflections on how repositories and registries can contribute to better documentation of repository shutdown and data migration.
2. BACKGROUND
2.1. Research Data Repositories
Infrastructures are “pervasive enabling resources in network form” (Bowker, Baker et al., 2010, p. 98). They permeate specific areas of life and support certain practices (van Laak, 2023). In the area of knowledge work, information infrastructures form “robust networks of people, artifacts, and institutions that generate, share, and maintain specific knowledge about the human and natural worlds” (Edwards, 2013, p. 17). Research data repositories are specialized information infrastructures that focus on the curation, preservation, and dissemination of research data (Boyd, 2021; Edwards, 2013; Johnston, Carlson et al., 2018). They facilitate data journeys—the movements of data from the sites of production to the sites of (re)use—and are central components for realizing the vision of habitual data publication, a fundamental open science practice (Austin, Bloom et al., 2017; Bates, Lin, & Goodale, 2016).
Although studies on researchers’ attitudes towards publishing their data in repositories are still inconclusive (Thoegersen & Borlund, 2021), there is evidence that the use of research data repositories to make data available has increased in recent years (Jiao, Li, & Fang, 2022; Khan, Thelwall, & Kousha, 2023). Research data repositories shape and are shaped by the communities they serve. They can become sites of scientific collaboration by enabling researchers to gather around data sets they jointly use (Costa, Qin, & Wang, 2014; Lafia, Fan et al., 2022; Qin, Hemsley, & Bratt, 2022), and might have to adapt services if the designated community it serves shifts over time (Donaldson, Zegler-Poleska, & Yarmey, 2020). At a global level, the landscape of research data repositories has continually evolved. Driven by the ideal of promoting open access to scholarly publications, the number of repositories increased significantly between 2005 and 2012 (Pinfield, Salter et al., 2014), initially focusing on text publications. Later, infrastructures specializing in the management of research data gradually emerged, forming a global network of heterogeneous research data repositories (Kindling, Pampel et al., 2017). In 2023, the global registry re3data lists more than 3,000 research data repositories. Different types of research data repositories have evolved to serve specific needs, for example institutional repositories that support members of research and higher education organizations (Arlitsch & Grant, 2018), discipline-specific repositories that address researchers from a particular research area (Banzi, Canham et al., 2019), or generalist repositories, which allow the storage of data irrespective of discipline (Stall, Martone et al., 2020).
Overall, the publication of research data appears to be characterized by concentration tendencies. For example, a recent study of a data discovery service found that only 20 repositories accounted for almost 80% of the total data sets indexed (Benjelloun et al., 2020). Most research data repositories are also operated by institutions located in Europe and North America (Kindling et al., 2017), and in the last few years, European research data repositories have grown substantially in size (DANS, DCC et al., 2022). The number of repositories in Africa, Asia, and BRICS countries has increased in recent years, but operators of these repositories sometimes find it challenging to fully realize their visions, for example if funding or institutional support are lacking (Academy of Science of South Africa, 2019; Cho, 2019; Misgar, Bhat, & Wani, 2020; Nishikawa, 2020).
2.2. Long-Term Operation of Research Data Repositories
2.2.1. Time scales of research data repositories
Time is a challenge for all types of infrastructures, because they operate at two different time scales at once: They must both be usable now and remain usable in the long term. As Karasti, Baker, and Millerand (2010, p. 400) put it, “an infrastructure occurs when here-and-now practices are afforded by temporally extended technology.” Both timescales can threaten the existence of a research data repository—it might be shut down if it is unable to serve current needs and practices, or if it cannot offer reliable services over a long period. Repository operators are aware of these timescales, as many have expressed concerns both about the long-term maintenance of their repository and the ability to develop new functionalities (Khan, Thelwall, & Kousha, 2021).
A research data repository also has to bridge another time gap: the varying life spans of its technical components and its data collection. Likely, the life span of a data collection is longer compared to the technical components required for its preservation: “data collections accrue slowly and steadily, yet software and hardware can change relatively rapidly and are beyond the control of collection staff responsible for data collection” (Thomer & Rayburn, 2023, p. 4). Therefore, in order to preserve data collections, the research data repositories storing them must also be maintained.
2.2.2. Risks to the long-term operation of research data repositories
Planning for the long-term preservation of research data is challenging, because various factors can put both the data and the repository that holds them at risk (Thomer, Weber, & Twidale, 2018). Research data are at risk of being lost if the research data repository is threatened, for example if it is facing loss of funding (Mayernik, Breseman et al., 2020). In interviews, developers and auditors of repository standards as well as repository staff identified five potential sources of risk: finance, legal, organizational governance, repository processes, and technical infrastructure (Frank, 2022). Barateiro, Antunes et al. (2010, pp. 8ff) developed a comprehensive typology of risks to digital long-term preservation systems, such as research data repositories. The typology lists vulnerabilities (“weaknesses […] in the environment”) and threats (“events that affect normal behaviour”) that can adversely impact components of these systems (see Table 1). Vulnerabilities can be introduced to preservation systems by software (process), by characteristics of the information objects being preserved (data), or by the infrastructure (infrastructure). Preservation systems can be threatened by nondeliberate actions (disasters), deliberate actions (attacks), managerial decisions (management), or changes in laws (legislation).
Vulnerabilities | Process | Software faults |
Software obsolescence | ||
Data | Media faults | |
Media obsolescence | ||
Infrastructure | Hardware faults | |
Hardware obsolescence | ||
Communication faults | ||
Network service failures | ||
Threats | Disasters | Natural disasters |
Human operational errors | ||
Attacks | Internal attacks | |
External attacks | ||
Management | Economic failures | |
Organizational failures | ||
Legislation | Legislative changes | |
Legal requirements |
Vulnerabilities | Process | Software faults |
Software obsolescence | ||
Data | Media faults | |
Media obsolescence | ||
Infrastructure | Hardware faults | |
Hardware obsolescence | ||
Communication faults | ||
Network service failures | ||
Threats | Disasters | Natural disasters |
Human operational errors | ||
Attacks | Internal attacks | |
External attacks | ||
Management | Economic failures | |
Organizational failures | ||
Legislation | Legislative changes | |
Legal requirements |
Risks to the long-term operation of research data repositories have long been discussed in the literature, in particular the risk of economic failure (Chowdhury, 2013). Concerns about funding cuts that threaten databases in the life sciences have been raised for years (Baker, 2012; Merali & Giles, 2005). If funding for databases is discontinued, entire research communities in the life sciences can be impacted, as the cases of the Biological Magnetic Resonance Bank (BMRB) (Nature Structural & Molecular Biology, 2012), EcoCyc (SRI International, 2014), and Online Mendelian Inheritance in Man (OMIM) (Kaiser, 2016) exemplify. In these cases, researchers who used the repositories publicly called to save the databases, as they were considered essential resources. These initiatives were successful, as the databases were still operational at the time this paper was written. Economic sustainability remains a major concern today (Ficarra, Fosci et al., 2020), although the funding of research data repositories has become more reliable in part (Burns, Lana, & Budd, 2013). Revenue streams of research data repositories can change over time (Eschenfelder, Shankar, & Downey, 2022), but still depend predominantly on publicly funded organizations (Imker, 2020).
Repositories also try to reduce technical risks (process and infrastructure vulnerabilities) (Eschenfelder & Shankar, 2017). Generally, infrastructures such as research data repositories can be adapted and reconfigured, but the “installed base,” components they build on, might limit this flexibility and therefore pose a risk (Hirsch, Ribes, & Inman, 2022). These limits can contribute to tensions between building something new and maintenance work (Ribes & Finholt, 2007).
Ensuring the sustainability of a research data repository is a continuous process (Eschenfelder & Shankar, 2017). Despite this fact, and the importance of infrastructure maintenance for the long-term preservation of data, it is not explicitly included in widely used models of digital curation, for example lifecycle models or the OAIS reference model (Thomer, Rayburn, & Tyler, 2020).
2.3. Research Data Repositories Being Shut Down
One characteristic of infrastructures in general is that they are often taken for granted and only noticed once they break down (Star & Ruhleder, 1996, p. 113). This is also true for information infrastructures such as research data repositories, but the point when they “shut down” is not always clear cut; for example, the information infrastructure can be closed entirely, scaled down, or disassembled into components that are repurposed later (Steinhardt, 2016).
Despite careful planning, a research data repository might be closed eventually (Dean, 2016). Currently, there is little evidence of how often this phenomenon occurs. A study of biological databases found that after a period of 18 years, 75% were either closed entirely or the content was no longer updated (Attwood, Agit, & Ellis, 2015), but this high rate of shutdowns might not translate to other repository types or disciplines. In another case, a consortial repository was closed after former consortium members gradually reorganized research data management and moved their data to institutional repositories (Dean, 2016). A recent study evaluated link rot in four registries of scholarly infrastructures but only provided limited evidence of de facto repository availability (Mannocci, Baglioni, & Manghi, 2022). Generally, shutting down an information infrastructure should not be considered a failure, because it is “a highly reflective and introspective set of practices” (Steinhardt, 2016, p. 2205) that generates new insights, which can be very useful for future undertakings.
If a repository is integral to the mission of the hosting institution, it is more likely to receive long-term institutional and financial support (Attwood et al., 2015). An institution’s mission might often outlast the lifetime of a repository, and, therefore, the institution should make arrangements for the event that a repository has to be closed (Dean, 2016). If anticipated, the shutdown of a repository can be planned and the stored data sets can be migrated to another repository, but sudden closure might result in permanent data loss (Boyd, 2021). Therefore, to ensure continuity of access to data, the repository certification organization CoreTrustSeal asks applicants to consider succession planning, the “preparations for handover of digital objects and services to another repository” (CoreTrustSeal Standards and Certification Board, 2022, p. 13).
Migration of data sources is a demanding task, as discrepancies between the legacy and the new system must be overcome (Nyitray & Reijerkerk, 2021). Although migration is likely common among long-lived research data repositories (Thomer et al., 2020), it is unclear how prevalent the strategy is. The study of biological databases referred to above found that after 18 years, 7% of the databases had been migrated to new sites (Attwood et al., 2015), but it only covers infrastructures of one discipline. Migration is often initiated in response to changing needs, of both repository operators and users, with the intention to improve repository content, service and management (Thomer et al., 2018; Thomer, Starks et al., 2022).
2.4. Closed Repositories in Registries
In a time frame similar to the emergence of repositories, specialized registries collecting information about them have evolved. Presently, there are several of these services, which differ in the types of repositories they cover, the metadata schemas they use to describe them, as well as the stakeholders and use cases they serve. The records that registries maintain can facilitate repository use, as well as monitoring activities. Because repository registries collect information on information infrastructures, they can also be valuable resources for research.
Depending on their mission, registries approach repository shutdowns differently. For example, the registry OpenDOAR focuses on active open access repositories. The inclusion criteria of the service state that a repository “must be currently and reliably accessible to any web user around the world” to be indexed1. If editors become aware of a repository shutdown, the record of that repository can be made invisible to the public.
In contrast, the registry FAIRsharing, which originally focused on databases in the life sciences, does keep a record of repositories that have been shut down, because one objective of the service is to monitor the evolution of the objects they describe (Sansone, McQuilton et al., 2019). The metadata schema therefore reflects the status of repositories, which can be one of ready, in development, uncertain, or deprecated. If a repository is deprecated, a date and a reason can be specified2.
The registry re3data also aims to describe the developments of the global landscape of research data repositories, including the emergence and shutdown of these infrastructures (Pampel, Vierkant et al., 2013). As a result, re3data documents the trajectory of the global repository landscape over time by representing information on the life-span of a repository via the elements startDate and endDate in the re3data Metadata Schema (Strecker et al., 2023a, p. 10):
startDate: The date the research data repository was released.
endDate: The date the research data repository ended its service of ingesting new research data and/or providing it.
3. METHODS
3.1. Extracting Repository Descriptions from re3data
Due to their overlapping missions, there are considerable intersections between collections of repository registries (Baglioni, Mannocci et al., 2023). Data collection for this analysis is based on the registry of research data repositories re3data3, because it is currently the most comprehensive source of information on repositories with a clear focus on research data. It covers more than 3,000 repositories of all types and from all disciplines. The service was launched in 2012 and therefore has recorded information on the repository landscape for more than 10 years. Repositories are indexed by an international editorial team based on a comprehensive metadata schema. re3data metadata is available under an open license (Creative Commons CC0 1.0 Universal4) through an open API.
3.2. Defining Inclusion Criteria
It can be difficult to determine whether a repository is permanently closed—for example, a repository might reopen after being shut down for maintenance, or the repository homepage might have been moved to a new URL (Attwood et al., 2015). In addition, research data repositories have multiple temporal properties, such as a life-span and temporal coverage, and they can be difficult to differentiate when attempting to determine what constitutes a closed repository.
In this paper, a repository is considered shut down if data is no longer accessible under the original or a new URL, or if the repository website clearly states that the service has ceased operations (while sometimes maintaining very limited access to the data). This definition refers to the life-span of a repository and does not relate to the temporal coverage of its collection, unless the website explicitly states that the data are deprecated.
It is important to recognize the ambiguity of the term closed in the context of research data repositories. In this paper, closed does not refer to restrictions placed on access to data or repository services, but only to the status and life-span of repositories.
3.3. Collecting Data
The analysis draws on re3data repository descriptions and supplements it with information collected from repository websites.
To identify candidates for closed repositories, information on the end date of the 3,069 repositories indexed in re3data was retrieved via the API on January 2, 2023. As shown in Figure 1, this list was then restricted to repositories with an end date; this produced a list of 223 candidate repositories. The list of candidates was reviewed by the authors. After visiting the website of each repository on the list, duplicate entries (seven) were removed, as well as repositories that did not meet the inclusion criteria for defining closed repositories given above (25).
For the remaining 191 repositories, information on the start and end date (year) was retrieved from re3data on January 2, 2023. Given that these properties are optional in the re3data Metadata Schema, the values were verified intellectually and completed where possible. Between January 2 and 30, 2023, repository websites, both the current version and versions archived by the Internet Archive, as well as additional resources such as data papers describing the repositories were searched for information on the shutdown process. The content analysis of these materials focused specifically on the reason for shutting down and information on the repository taking over custody of the data. This information was collected for each repository in the sample, and, in addition, the availability of data on the current repository websites was checked. The statements providing reasons for shutting down the repository were generalized and summarized into categories based on the typology of risks to preservation systems developed by Barateiro et al. (2010) in Table 1. A similar approach was used previously to study risks to research data in laboratory settings (Kowalczyk, 2015).
On January 30, 2023, the type and subject of the closed repositories was retrieved via the re3data API to contextualize the findings (Strecker et al., 2023a, pp. 10–11):
type: The type of the research data repository (for example: disciplinary, institutional).
subject: The disciplinary focus of the research data repository. (based on subject areas defined by the German Research Organization DFG5).
The final data set is openly available (Strecker et al., 2023b).
4. RESULTS
4.1. Prevalence of the Repository Shutdown and Time Series Analysis
An analysis of the end dates in the sample indicates that re3data has recorded 191 repositories that were closed, with the years of shutdown spanning a period of 25 years. At the time of data collection, this constitutes 6.2% of all repositories indexed in re3data (3,069).
The first repository in the sample was shut down in 1999; the next closure occurred, after a pause of several years, in 2005 (see Figure 2). This is probably linked to the history of re3data—the service officially took up operations in 2012, and coverage of repositories being shut down before that date is likely sparse. The growth in the number of closed repositories has been fairly consistent from 2012 onward. As of the end of the data collection phase, one repository was closed in 2023.
Both a start and end date could be identified for 158 closed research data repositories. Those repositories had been operational for between 1 and 57 years before being shut down. The median age of a repository when shut down was 12 years.
4.2. Characteristics of Closed Repositories
As Figure 3 shows, most repositories that were shut down were disciplinary and specialized in data from the life sciences and natural sciences. Compared to all repositories indexed in re3data at the time of data collection, repositories with these characteristics are also overrepresented in the sample. It is important to note that both properties (type and subject) can be repeated in re3data, so a repository might be assigned more than one type or subject; these combinations are reflected in the Venn diagrams.
A comparison of the 25th and 75th percentiles of the age distribution of closed repositories shows that compared to long-lived closed repositories, repositories with a shorter life-span (25th percentile) were more likely institutional and had a focus on humanities and social sciences (see Table 2). In contrast, long-lived closed repositories (75th percentile) were more likely disciplinary with a focus on life sciences.
Type | 25th percentile | 75th percentile | |
Disciplinary | 81.6% | 95% | |
Institutional | 23.7% | 7.5% | |
Other | 13.2% | 12.5% | |
Subject | 25th percentile | 75th percentile | |
Humanities and social sciences | 26.3% | 5% | |
Life sciences | 47.4% | 57.5% | |
Natural sciences | 47.4% | 47.5% | |
Engineering sciences | 10.5% | 7.5% |
Type | 25th percentile | 75th percentile | |
Disciplinary | 81.6% | 95% | |
Institutional | 23.7% | 7.5% | |
Other | 13.2% | 12.5% | |
Subject | 25th percentile | 75th percentile | |
Humanities and social sciences | 26.3% | 5% | |
Life sciences | 47.4% | 57.5% | |
Natural sciences | 47.4% | 47.5% | |
Engineering sciences | 10.5% | 7.5% |
4.3. Risks Resulting in Repository Shutdown
The reasons given for the closure of repositories were matched to the typology of risks to preservation systems by Barateiro et al. (2010). As Table 3 shows, for the majority of closed repositories in the sample (62.5%; 120), a reason for the shutdown could not be determined.
Risk . | Description . | Number of repositories . |
---|---|---|
N/A | No information available | 120 |
Threats–management–organizational failures | Repository was shut down as part of broader reorganization initiative within the operating organization, or because its mission is considered fulfilled | 37 |
Threats–management–economic failures | Repository was closed because funding was cut | 27 |
Vulnerabilities–infrastructure–hardware obsolescence | Repository was closed because of technological difficulties | 5 |
Vulnerabilities–process–software obsolescence | Repository was closed because of technological difficulties | 5 |
Threats–attacks–external attacks | Repository was closed because of acute hacking or security incidents | 2 |
Vulnerabilities–data–media obsolescence | Repository was closed because the data is considered obsolete | 1 |
Risk . | Description . | Number of repositories . |
---|---|---|
N/A | No information available | 120 |
Threats–management–organizational failures | Repository was shut down as part of broader reorganization initiative within the operating organization, or because its mission is considered fulfilled | 37 |
Threats–management–economic failures | Repository was closed because funding was cut | 27 |
Vulnerabilities–infrastructure–hardware obsolescence | Repository was closed because of technological difficulties | 5 |
Vulnerabilities–process–software obsolescence | Repository was closed because of technological difficulties | 5 |
Threats–attacks–external attacks | Repository was closed because of acute hacking or security incidents | 2 |
Vulnerabilities–data–media obsolescence | Repository was closed because the data is considered obsolete | 1 |
For the remaining 71 repositories, 77 risks could be identified. Among them, threats (66) were more likely than vulnerabilities (11) to result in shutdown, meaning that for these repositories, shutdown can be attributed to specific events rather than weaknesses in the environment. The most common risks that led to shutdown were managerial threats in nature (64), due to either organizational failure (37) or economic failure (27). Examples of organizational failure include repository shutdown as part of broader reorganization initiatives within the operating organization, or because the mission of the repository was considered fulfilled. Economic failures cover all types of funding cuts, including the cessation of project-related funding. Vulnerabilities of repository technology led to five repositories closing; because repository websites and related materials provided no additional information, both hardware and software obsolescence were identified as risks in these cases. Threats introduced by external attacks, for example hacking incidents, resulted in repository shutdown in two cases. In one case, vulnerabilities due to media obsolescence lead to repository shutdown—the research data were considered obsolete and were no longer maintained. Reasons rooted in both vulnerabilities of technology and threats of economic failure were cited by one repository.
4.4. Risk of Data Loss
The analysis revealed that for 88% (168) of the repositories in the sample, data were no longer available on the repository website. The remaining 23 (12%) repositories still maintained access to data in a limited capacity, for example via a simple FTP interface. Maintenance of data and access services beyond that, including search interfaces, had been ceased. A total of 44% (84) of the repositories have listed a repository that has taken over custody of the data. Data loss might have occurred at the closure of 90 (47.1%) repositories; these repositories did not maintain (limited) access to data or name a repository that had taken over custody of the data.
One of these cases with a high risk of data loss concerns the repository BIIACS6. The repository had acquired certification from Data Seal of Approval (DSA), a now discontinued predecessor of the repository certification organization CoreTrustSeal. The repository was launched in 2008, certified by DSA in 2013, and shut down in 2018. It is unknown why the repository was shut down. In the self-assessment document submitted to DSA, the repository expressed “a commitment to maintain the perpetuity of the data with the highest standards, as part of [its] mission” (Leeuw, 2019). However, as of today, the data are no longer accessible from the repository website and no repository taking over custody of the data was named. The handles that the repository issued to ensure persistent identification of content no longer resolve today.7
4.5. Data Migration
For 84 (44%) closed repositories, evidence of data migration was found. In the sample, 75 repositories are listed that have taken over custody of data. Most (60) of these substituting repositories were listed only once, but some were mentioned two (10), three (3), and four (2) times. Three of the repositories mentioned most frequently (PubChem8 (4), Gene Expression Omnibus9 (3), and ArrayExpress10 (3)) are large, disciplinary repositories that were established around 20 years ago and are well known by researchers within the discipline.
Eight instances of data migrations reflected in the sample are part of a large-scale reorganization of toxicology data providers within the U.S. National Library of Medicine (NLM) in 2019, where the content of toxicology resources was integrated into other NLM sources (Bolton, Zhang et al., 2020). Information on chemicals was migrated to PubChem, which affected four repositories in the sample. Data migrations were announced beforehand and are well documented (Kim, Chen et al., 2021). A database transition page is still maintained today11.
The analysis revealed that there are three cases in the sample where custody of data was transferred to a repository that was later shut down (see Figure 4). In case 1, the chain of custody potentially ended with the new repository closing, because no evidence was found that the data were migrated again. In the other two cases, data were migrated again to repositories that are still operational today, resulting in a chain of transfers of data custody that is still intact. Figure 4 shows that in some cases, after a repository was shut down and the data migrated to a new repository, that repository proceeded to be closed in the same year.
5. DISCUSSION
5.1. Life-Span of Research Data Repositories
Overall, the emerging landscape of research data repositories is dynamic, with new repositories opening and others being shut down. The analysis showed that research data repositories are shut down fairly frequently: 6.2% of the repositories indexed in re3data have been closed. Since re3data took up operations in 2012, repository closures were recorded each year. Consequently, a repository shutdown is not a rare event. As suggested in the literature, shutdown should be considered an expected part of the repository life cycle (Dean, 2016). It is normal for conditions of data movements to change (Bates et al., 2016), and repository shutdown does not have to be a negative outcome. It can be an appropriate measure that might even result in new insights for the future (Steinhardt, 2016).
In the sample, the median age of a repository when closing is 12 years. This limited life span could potentially put research data at risk and should be considered when defining retention periods for the long-term preservation of research data. Because repository shutdown is a real threat, infrastructure maintenance should also be reflected in models of digital preservation, and shutdown scenarios planned in advance.
Some closed repositories had particularly long or short life-spans. A comparison of the top and bottom percentiles of the age distribution revealed differences in type and subject specialization of these long- and short-lived repositories. The sample of closed repositories indicates that disciplinary repositories catering to the life sciences might be more successful at staying operational long-term. However, there are also active long-lived repositories in other disciplines; for example data archives in the social sciences that have been established in the 1960s and are still operational today (Downey, Eschenfelder, & Shankar, 2019). Therefore, more research is needed to fully understand this relationship, particularly studies that also consider active repositories with long life-spans.
The risks that result in the shutdown of repositories are varied. Some might be anticipated and planned for in advance, such as organizational changes within broader reorganization initiatives, whereas others might not, such as threats from acute security incidents.
As described above, infrastructures operate at different time scales at once by serving current needs and practices while also providing reliable long-term services. It is not clear from the analysis whether long- or short-term requirements put repositories at risk of shutdown more frequently. For example, managerial decisions to reorganize resources within an institution might be motivated by efforts to provide more comprehensive services for current users, or by an attempt to conserve resources in the long term. Overall, more research is needed to determine factors that put research data repositories at risk of being shut down, and how they can be addressed. However, for most repositories in the sample, the risks resulting in shutdown are unknown. The issue of missing information on repository shutdowns is discussed in more detail below.
5.2. Identifying Closed Repositories
Reviewing repository websites during data collection has demonstrated that it can be difficult to determine whether a repository is permanently shut down. In part, this is due to the fact that a repository has multiple temporal properties, such as a life-span and temporal coverage of its collection. These properties are distinct but can be difficult to separate in some cases. For example, a repository might have stopped ingesting new data at a certain point in time, but still maintain access to its collection. Depending on the context, the data might then be considered deprecated, or they might have retained its analytical potential, for example as historical data for time series analysis. Future research could further investigate the interrelation between temporal coverage and life span of research data repositories.
Another factor complicating the identification of closed repositories is the lack of information on planned downtime. A repository might be temporarily unavailable, for example due to maintenance activities. In the case of prolonged downtime periods where no information on the maintenance schedule is given on the repository website, repository users might assume the repository is permanently closed and report this to registries. In the sample, there was at least one repository that had an end date in re3data, but came back online after an unannounced downtime period12. Repositories can avoid this by announcing planned downtime on their websites.
5.3. Strategies for Preventing Data Loss
The analysis of information collected from repository websites focused on two strategies for preventing data loss when a repository is shut down: maintaining limited access to data and migrating data to another repository. The results show that these strategies are used by repositories to varying degrees.
Most repositories do not uphold access to data when they are shut down, often leaving no trace of the repositories or their contents on their websites. Only a few repositories opt to maintain limited access to data, for example via a simple FTP interface. This strategy, however, is not a permanent solution, as all curation and long-term preservation activities have been discontinued. If no measures are taken to preserve individual data sets, it becomes more likely as time passes that the data will no longer be usable. In addition, browser support for FTP is declining, which might limit human access to data in the future.
In contrast, the strategy of migrating data to another repository is more common, being used by almost half of the repositories in the sample. Overall, this strategy is more likely to retain the usefulness of data, because the repositories that have taken over custody of the data can apply appropriate curation and preservation measures. However, this strategy should not be considered definitive either, as the burden of infrastructure maintenance is not eliminated but transferred to the succeeding repository. This is demonstrated by three cases of chaining of data migration, meaning that custody of data has been transferred several times. In two of these cases, the chain of data custody remained intact, but in one case it ended because the repository that had ingested the data was shut down without indicating a successor.
Some repositories seem to have established themselves as reliable options for taking over custody of data, given that they have ingested data from multiple repositories that were shut down. These tend to be large repositories with a disciplinary focus and a comparatively long life span. They seem to have become central infrastructures for the research domain they focus on while also serving as a safe haven for data collections from other repositories. Future research should critically investigate the long-term consequences of data migration. For example, data migration conducted within larger reorganization efforts can affect a number of repositories and result in a consolidation of the repository landscape, by combining collections into large, central repositories. Central repositories might be able to use limited resources more efficiently, but could also create single points of failure if they come at risk of being shut down themselves.
Other measures also have the potential to reduce the risk of data loss, for example repository certification. One repository in the sample acquired formal certification from DSA, but was shut down despite the stated commitment to ensuring long-term availability of data, demonstrating that obtaining certification is no guarantee that a repository will stay operational long-term. However, the certification process can require repositories to provide evidence of sustainable operations and succession planning, encouraging them to implement appropriate measures. The same case also highlights the limitations of another measure to ensure sustained access to data: the use of persistent identifiers. Persistent identifiers are very useful for reliably referring to research data, but they have to be maintained to function as intended.
5.4. Documenting Repository Shutdowns and Changes in Data Custody
The analysis demonstrates that most repositories completely disappear after being shut down: They maintain neither limited access to data nor a web page with information on the shutdown process or potential data migration. This lack of information can have serious consequences for data citation and the permanence of the scholarly record. If research data are reused and cited, but later become unavailable due to the repository shutting down, references can break, which erodes the scholarly record. This is especially the case if no information about the repository shutdown or data migration is shared.
An example of how information on repository shutdown can be provisioned is the comprehensive reorganization of toxicology information within the NLM. Even years after the initiative was concluded, the NLM maintains a database transition page, a website that provides information on the reorganization process. The initiative was announced in advance and data migrations are documented in detail, enabling researchers to trace data sets to the sites they are stored at now.
Registries can contribute to making information on repository shutdown and changes in data custody accessible. As services that collect information on research data repositories and their trajectories, they are uniquely positioned to do so. For example, re3data currently indicates whether a repository was closed, and when. The most recent version of the re3data Metadata Schema, version 4.0, introduces the option to reflect transfers of data custody (Strecker et al., 2023a), which will make data migration visible in the registry.
To realize this vision of more comprehensive information on repository shutdown, repository operators should consider being more transparent about the process. By providing information on the shutdown process beforehand and naming the succeeding repository if the data were migrated, they can increase transparency, help researchers trace chains of data custody, and inform other repository operators.
6. CONCLUSION
The analysis showed that research data repository shutdown is not a rare phenomenon but should be considered an integral part of the life cycle of repositories. Repository shutdown poses a real threat to the perpetual availability of research data. Therefore, when planning preservation measures, repository operators should also take into account the infrastructure perspective on the long-term availability of data. Planning ahead increases the chance of saving data if the repository is forced to shut down. Strategies such as data migration can prevent immediate data loss but should not be considered permanent solutions. To more fully reflect the landscape of research data repositories and support the integrity of the scholarly record, registries can document repositories shutting down as well as changes in data custody. Overall, more research is needed to evaluate strategies for preventing data loss and to understand how specific factors affect the risk of repository shutdown: for example, how deeply a repository is embedded in its community, its revenue streams, and its ability to respond to both short- and long-term shifts in user needs.
6.1. Limitations
As re3data took up operations in 2012, re3data records likely do not document all incidents of repositories being shut down before that time. The analysis is based solely on information that is publicly available online. Most closed repositories have provided only limited or no information about the shutdown process; therefore, data migration events or reasons for shutting down are likely underreported.
AUTHOR CONTRIBUTIONS
Dorothea Strecker: Conceptualization, Investigation, Writing—original draft. Heinz Pampel: Writing—original draft. Rouven Schabinger: Writing—original draft. Nina Leonie Weisweiler: Writing—original draft.
FUNDING INFORMATION
This work has been supported by the German Research Foundation (DFG) under the project re3data COREF (Grant ID 422587133).
The article processing charge was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—491192747 and the Open Access Publication Fund of Humboldt-Universität zu Berlin.
Heinz Pampel was partly funded by the Einstein Center Digital Future (ECDF).
DATA AVAILABILITY
The data this document is based on are openly available from Zenodo (Strecker et al., 2023b).
COMPETING INTERESTS
The authors have no conflicts of interest.
Notes
OpenDOAR inclusion criteria: https://v2.sherpa.ac.uk/opendoar/about.html.
re3data: https://doi.org/10.17616/R3D.
Creative Commons CC0 1.0 Universal: https://creativecommons.org/publicdomain/zero/1.0/.
Subject areas defined by the DFG: https://www.dfg.de/en/dfg_profile/statutory_bodies/review_boards/subject_areas/.
BIIACS: https://doi.org/10.17616/R3ZG6K.
Example of a handle that no longer resolves: https://hdl.handle.net/10089/17040.
PubChem: https://doi.org/10.17616/R3KG65.
Gene Expression Omnibus: https://doi.org/10.17616/R33P44.
ArrayExpress: https://doi.org/10.17616/R3302G.
NLM toxicology database transition page: https://www.nlm.nih.gov/toxnet/index.html.
TreeBASE: https://doi.org/10.17616/R3DK58.
REFERENCES
Author notes
Handling Editor: Vincent Larivière