Abstract
Machine learning (ML) applications in weather and climate are gaining momentum as big data and the immense increase in High-performance computing (HPC) power are paving the way. Ensuring FAIR data and reproducible ML practices are significant challenges for Earth system researchers. Even though the FAIR principle is well known to many scientists, research communities are slow to adopt them. Canonical Workflow Framework for Research (CWFR) provides a platform to ensure the FAIRness and reproducibility of these practices without overwhelming researchers. This conceptual paper envisions a holistic CWFR approach towards ML applications in weather and climate, focusing on HPC and big data. Specifically, we discuss Fair Digital Object (FDO) and Research Object (RO) in the DeepRain project to achieve granular reproducibility. DeepRain is a project that aims to improve precipitation forecast in Germany by using ML. Our concept envisages the raster datacube to provide data harmonization and fast and scalable data access. We suggest the Juypter notebook as a single reproducible experiment. In addition, we envision JuypterHub as a scalable and distributed central platform that connects all these elements and the HPC resources to the researchers via an easy-to-use graphical interface.
1. INTRODUCTION
The intention of the FAIR (Findable, Accessible, Interoperable, Reusable) principle by Wilkinson et al. [1] was not limited to data, but also targeted other Digital Objects (DO) [2], e.g., algorithms, tools and workflows that lead to data. The FAIR Digital Object (FDO) subsequently introduced by de Smedt et al. [3] provides a framework to have transparent, reusable, and reproducible data [4]. The apparent benefit of reproducible science is that it becomes possible to restore results in a critical situation, increase transparency, trust, interest and the number of citations. It can rise to a level where reusing previous work becomes a routine practice and leads to an increase in productivity, work habit, and continuity [5, 6, 7, 8]. However, the reality of science deviates from these conveyed principles. The concept of Canonical Workflow Framework for Research (CWFR) is proposed by Hardisty and Wittenburg [9] as a solution to expand the adaptation of the FAIR principle to the broader research community. CWFR relies on identifying recurring patterns across disciplines and breaking down workflows into smaller modular components that can be reused and reassembled for other use cases. In this paper, we suggest to use Jupyter notebooks on JuypterHub with connection to an HPC system [10] as a platform to develop a concept for bringing the components of the CWFR together for the application of Machine Learning (ML) in Earth System Sciences (ESS). To develop a functioning CWFR, identifying the challenges and practices of that particular community is essential. ESS in general and climate and weather, in particular have seen significant growth in recent years, thanks to petabyte-size data and exponential increase in computational capability [11, 12]. With a growing amount of data and in light of climate change including its impacts, FAIRness in ESS ensures comprehensible and reliable knowledge of the environment. However, in the particular domain of Earth sciences, more than 60% of surveyed researchers stated that they failed to reproduce someone else's experiment, while more than 40% admitted that they were unable to reproduce their own experiment [13]. The issues above also exist in ML as documented in ML conference publications. At the prestigious Conference on Neural Information Processing Systems (NIPS) in 2017, less than 40% of the publications provided links to the code. As a consequence, some studies highlight the importance of reproducible ML that allows others to apply the contributions and increase the impact of ML research [14]. The data-driven nature of the ML poses unique challenges regarding reproducibility. As more and more data is being used as training and test data, ensuring that presented results are sound and reliable is a significant challenge [8]. In addition, the training process involves randomness. For instance, stochastic gradient descent (which is widely used for ML model updates) uses a randomised procedure that could result in different weights at each run even though an identical code is used [15]. Furthermore, ML frameworks are commonly used to speed up development. The mainstream frameworks such as TensorFlow [16] and PyTorch [17] use mixed precision for accelerating GPUs' training process that could yield different results depending on underlying software or hardware. Furthermore, most ML algorithms use a vast range of libraries and frameworks, configurations and virtual environments that rapidly change so that other versions can lead to different outputs. Challenges mentioned in ESS and ML are compounding when ML methods are applied to ESS data. ESS and ML algorithms rely on large data volumes which require rapid processing and thus high-performance computing (HPC) resources. Well-designed workflows play an important role in handling elaborate data preparation and efficiently utilising tomorrow's exascale computers. The widespread adaptation of workflow applications face two significant challenges: a vast gap between workflow applications utilised in enterprise-scaled IT firms and science labs and shared beliefs that researcher's applications are unique [9]. More than 90% of the researchers surveyed by Stoddart [13] agreed with “more robust experimental design” as a necessity for enhanced reproducibility of scientific results. A well-designed workflow that ensures reproducibility and traceability in every step could greatly enhance the robustness of ESS experiment design. Having reusable software and workflow components doesn't imply that individual scientific ideas can no longer be pursued. By contrast, they will form a solid reproducible basis on which new ideas can build with less potential for errors.
Inspired by the expected benefits of integrating reproducibility in ML applied to ESS, we discuss the particular challenges in our interdisciplinary project, DeepRain, in Section 2. This project is an excellent example of the requirement of interoperable and reproducible research as it aims to improve precipitation forecast using ML. In this context, the benefits of FAIRification on ML applications in ESS are discussed, and we describe how existing concepts and methods can be used in Section 3. Section 4 then introduces our proposed framework based on CWFR, which aims to provide flexible and reproducible ML focusing on big data analysis on HPC systems. A conclusion and an outlook based on our concept are given in the final Section 5.
2. PROBLEM FORMULATION AND PREREQUISITES
The DeepRain project serves as a particular research example for which a canonical workflow framework is crucial. In the DeepRain project [18], more than 1.3 PBytes of meteorological data are exploited to develop complex ML algorithms to predict precipitation in Germany. In particular, historical forecasts of the COSMO-DE (Consortium for Small-Scale Modelling) Ensemble Prediction System (EPS) provided by the German Weather Service (DWD) [19, 20, 21] serve as the input for quantitative precipitation forecasts at station sites and on gridded domains based on tailor-made deep neuronal networks. For the latter, the high-resolution radar-based climatology product RADKLIM acts as a high-quality observational reference dataset [22, 23]. Processing a large amount of data requires massive computational resources that are exclusively available on the HPC systems. Because of bandwidth limitations, the bigdata is usually hosted in the same facility that provides the HPC system to reduce data streaming delay and connection interruptions. So, any suggested solution should provide easy access to HPC systems, stored data, and in-situ processing to avoid unnecessary uploads and downloads of data. In the DeepRain project, where a relatively broad group of researchers from ML, weather and climate and computer sciences (CS) are collaborating, we need a framework that can allow for efficient collaboration and does not require too much CS expertise. The project requires using technical tools, communicating and developing ML models and experiments while reducing the time and energy necessary to get acquainted with methods and terminology used in different disciplines. A FAIR practice helps to make the research interoperable across other disciplines, even in the same project and group. One main obstacle to adopt FAIR practices is that many of them require drastic changes in the researcher's procedures. Thus, any suggested solution should adjust to the researcher's needs rather than being constrained by IT considerations [24]. The changes in the researcher's work should be minimized to serve both the fulfilment of FAIR practices and the acceptance by the researcher. In addition, any pre-designed workflow must allow the use of domain specific language and procedures, for example with respect to the evaluation of results. Therefore, we narrowed down our focus to ML applications for the ESS community; to ensure easy adaptation without extensive development. If the proposed approach is picked up by the ESS community, it can be further developed and generalised to meet the requirements of other science communities. A high degree of flexibility is necessary for any proposed CWFR to adapt to researcher practices. However, designing a flexible solution needs a certain degree of computer knowledge by the researcher. Thus, it is necessary for researchers to be familiar with CS fundamentals such as Unix, bash and version control (e.g., with git). However, much of the higher-level expertise to develop and provide services to maintain a CWFR is out of reach of arbitrary research institutions. Thus, a collaboration between service providers, such as data and HPC centres, and research communities is necessary to maintain a sustainable ecosystem. Such partnership provides expertise, training and infrastructure for research communities. Therefore, a convergence between the research communities and infrastructure providers is necessary. Even though computational experiments should be easier to reproduce compared to physical experiments, the complexities and fast pace of change of today's software and hardware make it surprisingly difficult [25]. Research needs to be reproducible for a human, referred to as scientific reproducibility, and for a machine, referred to as technical reproducibility. As mentioned above, randomised processes, mixed-precision, and hardware-dependency impediment on technological reproducibility. Thus, we focus on scientific reproducibility, where we can reproduce statistical features and underlying distributions. The latter means that the final results may deviate slightly from past experiments even with an identical set-up, but the obtained statistical properties (e.g., model performance in terms of evaluation scores) and the corresponding conclusions must remain the same. In this sense, the deviations of the obtained results must be indistinguishable from random noise.
3. FAIRNESS BUILIDING BLOCKS
In the following, we introduce the components that can help to build a FAIR ecosystem addressing the obstacles mentioned in Section 2, while reducing the cost of the FAIRification.
As Kahn and Wilensky [2] introduced the concept of DO, it provides a framework to identify and trace any digital objects such as data, algorithm, workflow. DOs, alongside Persistent Identifiers (PID) and metadata, are elements that constitute an FDO [24]. FDO as a self-contained, typed, machine-actionable data package can provide basic components for a standardised, FAIR infrastructure [3]. FDO lays out a fundament which can be used to bring FAIRness to all components of science. As the adaptation of FDO is highly domain-oriented, we refer to Lannom et al. [26] for more details on the application for ESS.
Research Object (RO) is a related concept that was introduced by Bechhofer et al. [27] where the main focus is on born-digital objects and aggregation of data and collections. Thus, RO is well suited for application in data-driven sciences. In addition, RO can be associated with DOI, thereby making it findable and accessible over the internet. RO concept relies on the idea that each RO provides a unit of knowledge.
Therefore RO acts as a container of resources (including a series of FDOs) and is shareable within and across different research groups. Our approach uses the FDO and RO as its building blocks to create a FAIR framework.
3.1 Notebook and Git
The emerging pattern of data-oriented research is to use in-situ analysis and visualisation on the HPC system. This enables researchers to act quickly based on outputs and apply modification and restart their workflows [10]. Jupyter notebook is a tool that researchers increasingly rely on [28]. Jupyter is also recognised by Hardisty and Wittenburg [9] as one possible CWFR solutions. Beg et al. [29] also discuss the reproducibility of Jupyter notebooks as a scientific workflow. Jupyter notebook provides a one-study, one-document concept that is easily shareable. In addition, it provides a user-friendly platform to document the software while ensuring reproducibility by combining the data, code and software environment. As Jupyter notebook offers the core, JupyterHub expands the frameworks and brings flexibility to the user group [30]. JupyterHub provides access control and authentication, scalability with support for container and HPC technology, and it is portable from the cloud to a local machine. Despite the benefits, there is some limitations to deploy notebooks as CWFR solution. Any modification to the notebook is immediate and multiple executions of a notebook with different inputs leads to loss of all previous information. In addition, any part of the notebook can be executed or skipped separately. It is known as a problem of undefined state of the notebook. Thus, the constant prototyping and rapid development ecosystem of Jupyter notebook threatens its adaptation as reproducible workflow [29]. Changes applied to notebooks usually fall into two categories of 1) code developments to introduce new features and methods or 2) experimenting in the hyperparameter space. The primary tool to keep track of changes in algorithms will remain version control, particularly git [31]. As there are many comprehensive studies about the application of git for reproducibility, we refrain from repeating and refer the reader to Ram [32]. Each newly committed snippet of code is identified with a unique commit-ID. Thus, we built our approach around git. ML applications often need to experiment with the parameter space as it requires intensive hyperparameter optimisation and search of model space. It is essential to preserve the notebook state and each experiment parameter. We propose an application of a notebook that we call experiment dashboard that is used to initiate any experiment notebook. Dashboard configuration, including a summary of all carried out experiments and paths to data and associated commit-ID for the current instance of the dashboard, is stored as an FDOs. For executing individual instances of the notebook separately, we suggest a python library called papermill that can run the new instance with parameter values passed to it. Any new experiment with new parameter values is passed to a new instance of the notebook, and all will be preserved. As shown in Figure 1, the experiment dashboard passes the desired three values to three independent instances of the notebook. Each of the instances is executed individually and the instances can also be run in parallel.
Integration of git and experiment dashboard utilising papermill to execute each experiment in an individual instance of the notebook with passed parameters and the creation of corresponding FDO with FDOm.
3.2 FDO and RO Modules
As there are no standard or widely used FDO libraries so far, a dedicated FDO module (FDOm) is envisioned to ensure the proper creation of FDO. In the following, we define principle rules for FDOs created by FDOm. Every unique experiment generates one FDO which points to one commit-ID generated by git. FDOs can either point to data or to another FDO; this referencing is called interlocking FDO. Any interlocking FDO appends all the locked FDOs in the one new FDO. Since FDOs are only backwards-looking, they can only be linked to data, commit or other FDOs that exist by the time of creation. Besides, they include metadata and are searchable. In addition, we introduced technical regulations as well. We expect that the FDO provides information about the host system that is used. This contains system configuration data that is provided by the system admin and the environment as well as libraries detected by FDOm. For FDOm to be useful for ML applications, it is essential to document specific ML architecture and initial parameters. As TensorFlow [16] and PyTorch [17] are the main ML frameworks used in our project, FDOm will be tailored to receive the network summary directly. When an experiment ends, a unique PID is generated to identify the created FDO. FDOm is storing this information as JSON-LD which is human and machine-readable. As shown in Figure 1, each instance of the notebook experiment is associated with a unique FDO. Furthermore, we envision a RO module, called ROm subsequently, to create the necessary RO which may encapsulate several FDOs. Similar to FDOm we foresee principle rules for ROm. Each RO has a state attribute which can be open, archived or published. An open RO is mutable and a new FDO can be added to it. An archived RO has been packaged, and is is therefore immutable. A published RO is similar to an archived one, except that a DOI is assigned to it. Any new RO can be created from scratch or based on an archived or published RO. The ROm allows the researcher to inspect other involved FDOs and to select individual ones for later use. For example, any individual ML experiment can be selected and encapsulated as a new RO. We suggest preserving all experiments and their output regardless of whether they have been successful or not or whether they are intended for publication or not. In contrast to FDO, the RO concept is already quite well developed and many implementations exist. From many implementations of the RO, we adapt our proposed ROm to be compatible with Ro-Crate [33]. We believe that the combination of FDOm and ROm as well as their integration with git and RO-Crate could provide the necessary ecosystem to achieve high level granular reproducibility.
3.3 Datacube Management
To address the efficient data management challenge in the DeepRain project, we deployed an array-centric database. After evaluating several candidates, we opted for the Array Database Management System rasdaman [34] which offers geo-semantic query functionalities for multidimensional arrays, also referred to as datacubes [35]. In parallel to traditional file-based data, rasdaman provides efficient management of large-scale objects, and standardized data modeling [36] which contributes to data harmonization and it allows for flexible access, extraction, analysis, and fusion of massive Spatio-temporal datacubes based on a standardized query language. Figure 2 shows an exemplary query used to retrieve data from Deep Rain datacube. Data could be requested as a query while the access pattern is registered in the related FDO.
Example query from DeepRain datacube.
4. THE PROPOSED FRAMEWORK
In this section, we propose a framework built on the tools mentioned previously or modified to fit the research needs in ML application in ESS. Our proposed concept relies on a granular approach that uses FDO and RO as its building blocks. Every single ML experiment is registered as an FDO. The method that is used to produce the training data should be available to achieve reproducibility, complemented by explicit descriptions of the used pipeline and architecture. A series of experiments with corresponding FDOs is then encapsulated in a RO. This method ensures that the entirety of the research remains reproducible and that FAIRness is not limited to publication. Besides, to achieve a granular FAIRness, we suggest a FAIR-test similar to a unit-test as standard practice in CS where small segments of codes such as function, method, class, etc., are tested. The same principle can be integrated into research practices to validate the FAIRness of each segment of carried out research. Our suggested CWFR is built around the JupyterHub as a platform that glues notebooks, FDO and RO together. Jülich Supercomputing Centre (JSC) implemented an instance of JupyterHub called Jupyter-JSC① that provides a suitable platform for our proposed framework [10]. Jupyter-JSC has access to a wide range of data storage, to CPU and GPU on the JUWELS [37] and other HPC systems, and provides an easily accessible integration with git. The proposed framework is presented in the following prototype scenario. As the researcher develops codes, git provides a perfect avenue for preserving them and reusing them in the Jupyter notebook instance running on the HPC infrastructure. As one wants to run ML experiments, the experiments dashboard is used to initiate them. Required data is accessed via a path to the data on the local file system of the HPC system, to an online repository, or via a query to the datacube interface. The experiment dashboard then creates several Jupyter notebook instances according to the number of experiments while also passing the ML experiments parameters. Each model will call the FDOm that sets up the associated FDO for all unique experiments, including data (path), summary ML architectures, random seeding generated, etc. The relevant local copies of downloaded results are also stored and tracked. Any subsets of FDOs associated with ML experiments can be encapsulated as RO with the help of the ROm. The created RO can be used by the researchers to share a holistic view of their work with collaborators, to archive them, to reuse or to continue working based on previous achievements. The work has begun to implement the proposed concept, and we plan to derive a more concrete scheme as a follow-up technical paper.
JupyterHub is a platform that provides access, authentication and interface to the user, data, environments and frameworks and the computing resources in the HPC system.
5. CONCLUSION AND FUTURE WORK
We discussed necessary elements of a canonical workflow for ML applications in ESS to ensure FAIR and reproducible research while exploiting the HPC capability. Our solution builds upon on FDO and RO, and we envision a concept of FAIR unit-test where a researcher can validate the FAIR practices for small segments of codes and experiments. We have introduced some basic rules that ensure that FDO and RO are human- and machine-actionable and that they can achieve scientific reproducibility. For this, we proposed two modules of FDOm and ROm to enforce the suggested basic rules without introducing unnecessary changes to the researcher workflow. The modeules ensure that each experiment is identified by a unique FDO and that series of experiments are encapsulated as RO. In addition to file-based data storage, datacubes provide quick access to data with an integrated FDO pointer function. We have proposed the Jupyter notebook as the core of the CWFR while acknowledging its limitation in a particular undefined state of a notebook. We suggest an experiment dashboard where the researcher can initiate new experiments as an independent notebook. Papermill is a Python-based library that allows us to preserve and document changes in each notebook independently. The approach presented in this study aims to minimize technological barriers for ESS researchers to shift toward integrated FAIR practices. Nevertheless, elevating the fundamental knowledge and skills in CS should remain a goal for ESS communities, because CS developed many concepts and tools to ensure versioning, tracking, reproducibility and portability. These tasks constitute the backbone of FAIR and reproducible research and are the pillars on which canonical workflows can be built.
AUTHOR CONTRIBUTIONS
A. Mozaffari ([email protected]) has initiated the concept study and developed the core ideas in discussions with M.G. Schultz ([email protected]), M. Langguth ([email protected]) and B. Gong ([email protected]). All authors have made meaningful and valuable contributions to writing and revising the manuscript. A. Mozaffari has led the editorial process.
ACKNOWLEDGEMENTS
The authors would like to thank German Bundesministerium fuer Bildung und Forschung (BMBF) for funding the DeepRain project under grant agreement 01 IS18047A-E. The authors gratefully acknowledge the Earth System Modelling Project (ESM) for funding this work by providing computing time on the ESM partition of the supercomputer JUWELS [37] at the Jülich Supercomputing Centre (JSC). The authors would like to thank the two anonymous reviewers and the guest editor for suggestions and comments that significantly improved the manuscript.