Research e-infrastructures for open science: The national example of CSTCloud in China

ABSTRACT This paper focuses on research e-infrastructures in the open science era. We analyze some of the challenges and opportunities of cloud-based science and introduce an example of a national solution in the China Science and Technology Cloud (CSTCloud). We selected three CSTCloud use cases in deploying open science modules, including scalable engineering in astronomical data management, integrated Earth-science resources for SDG-13 decision making, and the coupling of citizen science and artificial intelligence (AI) techniques in biodiversity. We conclude with a forecast on the future development of research e-infrastructures and introduce the idea of the Global Open Science Cloud (GOSC). We hope this analysis can provide some insights into the future development of research e-infrastructures in support of open science.


LITERATURE REVIEW
As one of the key pillars of open science [7], open science infrastructure refers to open and shared research facilities [8,9]. According to the essential components assembled [10] Endeavors ranging from international organizations and national governments to regional and disciplinary research institutions have contributed to the construction of various types of e-infrastructures [11], with selected examples listed in Table 1. Different open science infrastructures share similar features, such as to be federated, accessible, interconnected, and interoperable [7]. To be federated means that distributed research resources, services, and the supporting infrastructures will be carried out between collaborative facilities with effective cloud solutions linking with each other [12,13]. Accessibility, also enshrined in the FAIR principles (namely findable, accessible, interoperable, and reusable) [14], refers to a better flow of research resources to the science community. International interconnectivity requires research resources to be shared across borders for scientific discovery. And interoperability refers to different levels of capabilities to communicate between resources and services seamlessly, including interoperability in data, technologies, services and policies [15,16]. In the spirit of open science, these e-infrastructures highlight practices of open sharing technologies, open services (i.e., computing, network, and storage) and resources (i.e., data, publications), open governance, and open research community. And this shift to an open-science research paradigm also brings e-infrastructures challenges. For example, silos between e-infrastructures are still common while mutual trust mechanisms and technical interoperability [17] to bridge platforms remain in progress. Unbalanced data scale and data value make scientific research not an easy case [18], such as it is in the Sustainable Development Goals (SDGs) research [19]. Sharing of different research resources [20] is still challenging because of the fuzzy data, complex algorithms or inconsistent software, as well as lots of other concerns, like security, privacy, intellectual rights, and so on. And the interaction between users and infrastructure deployment cannot be neglected as well [21,22]. In CSTCloud, similar challenges exist. Thus, this paper is going to analyze the barriers in specific scenarios, and illustrate possible solutions through case studies.

Overview of CSTCloud
As one of the national research e-infrastructures in China, CSTCloud aims to build robust, interoperable, and sustained research services featuring open-science solutions. The CSTCloud explores cutting-edge information technologies, such as big data management, cloud computing, and artificial intelligence to support massive data transfer, curation, computation, visualization, publishing, and long-term preservation. It enables the use of open data and other research resources for scientific discovery. Federated strategies are implemented for the scheduling, management, and monitoring of research resources and service delivery within the Chinese Academy of Sciences (CAS) and across the country. CSTCloud collaborates with national, regional, and institutional data centers, disciplinary science clouds and thematic demonstrations, to provide services covering layers of IAAS (infrastructure as a service), PAAS (platform as a service), and SAAS (software as a service) (see figure 1).

Selected Challenges
CSTCloud provides resources and services for over one million users in CAS and across the country, with typical challenges identified throughout service delivery processes. First, the gaps between planned and expected service capacity for research e-infrastructures are growing. Research demands are increasing sharply, especially on the performance of Internet connectivity [23], data storage capacity [24], and computing. In many scientific and technological infrastructures, near-future development expectations may be ten times higher than the current service capabilities [25]. However, limited funding models demonstrate the necessity to re-design public platforms with improved service capabilities to meet the needs of those enormous research facilities. In astronomy, for example, the times of supernova explosion, the numbers of photons received, and the intensity of cosmic microwave radiation are all random. Therefore, continuing and timely observations and effective data analysis of all possible astronomical phenomena are necessary to ensure scientific discoveries, both large and small [26]. Thus, robust technologies, such as flexible and scalable cloud services, are needed to advance a healthier and more capable research ecosystem, adaptive to diverse research scenarios [27,28]. Third, data management becomes more complex. In dataintensive areas such as high-energy physics, life-cycle data management is necessary to support the operation of major research e-infrastructures for scientific discovery. Then the data management should focus on the data characteristics. Exploration should go further under a core data management framework, in configurable work flow systems, with proper strategies for cross-domain resource scheduling and cloud services orchestrations. Fourth, distributed research resources require integration for one-stop and tailored services. Currently, the CSTCloud service catalogue provides links to disciplinary platforms. Many of them are operated by third-party service providers in a distributed manner while CSTCloud should keep a dynamic management mechanism to support single sign-on services for end users. Fifth, social alignment should follow the open-science way. Fragmentation of open science is still happening across regions [29]. Thus it is urgent to reach a broader consensus on multi-faceted open science visions, guiding principles, and normative frameworks based on shared values. E-infrastructures like CSTCloud should exploit the potential to the full to break silos and enhance social engagement in open science.

CSTCloud Supported Design and Development of Open Science Exemplars
To tackle the challenges mentioned above, several actions have been taken in CSTCloud, including the enhanced authentication and authorization infrastructures in CSTCloud AAI and the development of a cloud federation service platform, the CSTCloud Federation, to serve the uneven e-infrastructure demand. Examples include the exploration of scalable algorithm engineering to tackle complex astronomical data management, the integration of resources for tailored SDG-13 research, the coupling of AI and citizen science for enhanced data training models in biodiversity, and so on.

CSTCloud AAI and CSTCloud Federation for Enhanced Service Capacity
To embrace open science, CSTCloud AAI  has been re-designed as an evolving platform based on the CSTCloud Passport to deliver sustained services for identity authentication and authorization, enabling access to open and convergent global resources. Besides,the CSTCloud Federation  platform has been developed to manage distributed computing and storage resources between institutions for tailored science deployment. This cloud federation [30,31] supports interoperable resources aggregation, to help reduce costs and enhance service capability. Currently, this system has been providing services for several different projects in which distributed compute and storage resources are neatly managed by the cloud system with  CSTCloud AAI, https://aai.cstcloud.net  CSTCloud Federation, https://fed.cstcloud.cn autonomous resource configuration carried out smoothly. End users can log in the CSTCloud Federation platform and create on-demand virtual machines for tailored research with resources scheduling and orchestration jobs handled by the CSTCloud Federation platform automatically. Besides, unified operation monitoring and usage metrics also guarantee the quality of services and uncover their social impact as well. Moreover, open-science practices are also shaping disciplinary showcases.

Case 1: Scalable Cloud Services in FRB research
Open astronomical data, like the case in FAST (Five-hundred-meter Aperture Spherical radio Telescope) [32], enables significant research outputs worldwide, such as the search for possible ExoMoons [33], pulsars [34], detection of ultra-high-energy particles [35], and so on. The collaboration between CSTCloud and FAST team is to find effective ways of handling complex data for FRB (fast radio burst) research based on large-scale distributed e-infrastructures. Considering this, a cloud service platform entitled ScaleBox has been developed jointly with tailored algorithms developed, and validated by data samples on scientists' laptops. Then boosted by algorithm engineering in the ScaleBox, large-scale streaming data processing and federated data learning are carried out by computing facilities of remote sites smoothly (see Figure 2). And to facilitate the astronomical data work flow in an open-science manner, the ScaleBox connects CSTCloud computer clusters and deploys the CENI (China Environment for Network Innovations) 100G network to ensure timely large-scale data transfer and computing. The new research work flow, covering optimized management of existing FAST radio bursts, pulsar search experiments, and data from the telescope observations, runs steadily to support FAST data distribution, on-line processing, computing, and archiving. It has reduced the potential for redundant data distribution and enhanced the efficiency and efficacy of data processing.

Case 2: Convergence of Research Resources to Support SDG Research
The SDGs adopted in 2015 by all the United Nations member states [36], have been taken as one of the priorities in the CASEarth project [37]. As part of this activity, the CSTCloud team has been engaged in developing an SDG-13 system, to effectively integrate and manage multiple-sourced data, algorithms, tools, for on-demand SDG-13 service delivery. A first demonstration system focusing on Southeast Asia is currently under construction (see Figure 3). Contributed by different institutions, the joint cloud system includes a remote-sensing monitoring system for disaster mitigation and selected projection models and algorithms for both long-term and real-time climate prediction. Centralized computing facilities in CSTCloud and distributed data entities and algorithms from institutions are connected by APIs for on-line deployment.

Case 3: Artificial Intelligence and Citizen Science in Wildlife Monitoring
Dynamic changes to wildlife and their habitats are important to research and operations in the nature reserves. To precisely deal with large-scale and real-time monitoring data for decision making, the CSTCloud collaborates with the Guangdong Chebaling Natural Reserve of China, with artificial intelligence technology and citizen science deployed [38]. As is shown in Figure 4, at the first stage, images and videos are captured by the infrared cameras automatically. Then data are transferred on a frequent basis to the cloud service platform through the 700 MHz FDD-LTE network that include four base stations, encompassing 91 Km 2 in the Chebaling Nature Reserve. The cloud platform is maintained by the CSTCloud team remotely. Image recognition and video content analysis help track the ecological activities accordingly. Considering the role of citizens in this project, crowd sourcing is also included for data capture in addition to the 700,000 valid images captured by infrared cameras. Citizen science will increase the potential of the image and video pool, and somehow help offset the limitations of existing sampling methods, thus contributing to the feature model training.  Table 2 summarizes the CSTCloud exemplars practicing open science. Based on the CSTCloud AAI and CSTCloud Federation, the FAST example in astronomy illustrates how we manage large-scale distributed research e-infrastructures with complex data flows. The CSTCloud re-designed the workflow with scaled-up solutions to handle astronomical data running between labs, computing clusters and data-capturing facilities. Supported by regional datasets validation, the SDG-13 case integrates multiple-source of data and algorithms for tailored cloud services. And inviting citizen science for data capture to enhance AI model performance in case 3 provides another example of open science engagement. However, all these cases just uncovered part of the open science practices. For open governance, the advisory board and the user committee are invited as compulsory parts of the governance model highlighting openness.

Research e-infrastructures for open science: The national example of CSTCloud in China
Although there are no "one-size-fits-all" solutions for open science infrastructures, we may still be enlightened by lessons learned from the cases. First, expanding demand for research facilities can never be precisely measured or fed up, thus cloud federation may be implemented to bridge the gaps. Demands may be exaggerated beyond the actual requirements or constrained by unexpected construction conditions, or under-evaluated for their potential longer-term contributions. To guarantee the robustness of future-oriented facilities, public e-infrastructures should be enhanced to provide backup options, such as cloud federation to connect different facilities contributing to a resources pool. Under unified resources orchestration, short-term jobs will be sent to facilities available at this moment and released for other jobs when necessary. Direct investment for facilities may be partly saved by sharing resources and services between e-infrastructures, especially to meet unsteady demand. Second, systematic approaches and comprehensive solutions should be taken to plan and develop every infrastructure module to facilitate the overall research work flow. For complex data management, one of the efficient ways is to explore effective work flow to help reduce redundancy of management work, thus facilitating transparent access and smart analysis of different resources. Third, innovative ways should be invented to promote the sharing of sundry research resources. The coupling of AI and citizen science provides a way and federated learning is another. Based on a federated model, data holders can retain their data for particular concerns, such as security and privacy, while contributing to the research by training sub-models. Then all sub-models are integrated through the cloud for a completely enhanced model. The federated computing technology strengthens the scalability of the model training framework on the premise of ensuring appropriate data control [39] and is being tested in CSTCloud SDGs research. Furthermore, interoperable persistent identifiers should be deployed to cover all digital research resources, thus the interconnectivity between these resources is depicted and managed appropriately. Fourth, incentives should always be ready to promote sustained open science models. For instance, the implementation of cloud federation should apply loose coupling strategies [40,41] to retain independence for each e-infrastructure. Contributions to the federated resources pool must be precisely monitored and measured with appropriate reward mechanims, such as offset funding, paying back, long-term investment, priority in future deployment supported by the federated resources pool, research reputation, and other social impacts as well. There might be diversified interests embedded in implementing open science models, and robust incentive mechanisms should include reward systems for all stakeholders. Besides, compound interoperability solutions should include both FAIR policy design and trustworthy technologies development to increase the inclusiveness and usability of the collaborative research environment supported by the digital research infrastructures.
And to facilitate open science e-infrastructures development in the long run, particularly to help bridge different open science platforms based on mutual trust, the idea to co-design and co-develop a "Global Open Science Cloud" (GOSC) was proposed in 2019 and is being implemented by CODATA and the CAS Computer Network Information Center [42,43]. The objective of this GOSC Initiative is to call for international collaboration and alignment between open science cloud activities internationally in a robust network of trusted research e-infrastructures to connect digital research resources and all stakeholders. It can thus enable innovative science discovery in the evolving open science environment within global,