Canonical Workflows in Simulation-based Climate Sciences

Abstract In this paper we present the derivation of Canonical Workflow Modules from current workflows in simulation-based climate science in support of the elaboration of a corresponding framework for simulation-based research. We first identified the different users and user groups in simulation-based climate science based on their reasons for using the resources provided at the German Climate Computing Center (DKRZ). What is special about this is that the DKRZ provides the climate science community with resources like high performance computing (HPC), data storage and specialised services, and hosts the World Data Center for Climate (WDCC). Therefore, users can perform their entire research workflows up to the publication of the data on the same infrastructure. Our analysis shows, that the resources are used by two primary user types: those who require the HPC-system to perform resource intensive simulations to subsequently analyse them and those who reuse, build-on and analyse existing data. We then further subdivided these top-level user categories based on their specific goals and analysed their typical, idealised workflows applied to achieve the respective project goals. We find that due to the subdivision and further granulation of the user groups, the workflows show apparent differences. Nevertheless, similar “Canonical Workflow Modules” can be clearly made out. These modules are “Data and Software (Re)use”, “Compute”, “Data and Software Storing”, “Data and Software Publication”, “Generating Knowledge” and in their entirety form the basis for a Canonical Workflow Framework for Research (CWFR). It is desirable that parts of the workflows in a CWFR act as FDOs, but we view this aspect critically. Also, we reflect on the question whether the derivation of Canonical Workflow modules from the analysis of current user behaviour still holds for future systems and work processes.


INTRODUCTION
In addition to long-term meteorological measurements, results obtained from simulations performed with climate models form the main basis for research and statements on past and possible future global, regional and local climate. These models are based on mathematical equations that express fundamental physical relationships and laws, such as the laws of conservation of mass, momentum, and energy [e.g., 1,2,3]. Running these models e.g., at high spatial resolution is computationally expensive and requires highperformance computing (HPC) infrastructure. Depending on the experiment, the data volume for simulations over long periods of time and/or at high spatial resolution can reach sizes of TeraBytes (TB) to PetaBytes (PB) [4,5]. Therefore, the geoscience community is working to develop and establish effective and reproducible scientific workflows [6]. The aim is to enable researchers to spend more time on the actual scientific work [7]. Furthermore, for example, several people or teams often work in a project of coordinated climate simulations in which tasks are shared. This once again highlights the need for standardised workflows.
With the help of climate models, it is possible to investigate and determine interactions of the individual components of the climate system, but they are also input data for impact models, i.e., calculations on how climate change can affect ecosystems, urban development or various other systems [e.g., 8,9,10]. Climate models and climate data are the major components of different disciplines of climate science and they form the basis for assessing the risks and opportunities of future climate change and for developing adaptation measures [e.g., 11,12,13]. Various statistical methods are used for the scientific evaluation, but also for the processing and use of climate model data [e.g., 14,15,16]. Further, the ever increasing data amount and complexity of analysis workflows has spawned the development of novel data access and analysis infrastructures [e.g., 17,18,19]. Upcoming major funding initiatives facilitating a step-change in climate science already cast their shadows [e.g., 20] and will allow for the major investments in infrastructure needed to transform the way simulation-based climate sciences is performed [21].
High Perfomance Computing (HPC) Centers, like the German Climate Computing Center (DKRZ), are major partners for climate research, as they provide essential infrastructure, like HPC resources, data storage and tailored services to support simulation-based climate science. It should be pointed out here that climate scientists can and do perform the entire suite of their data intensive workflows using the DKRZ infrastructure and services-ranging from planning and settingup of model simulations, analyzing the model output, reusing existing large-volume datasets to data publication and long-term archival. This allows to analyse DKRZ user behavior for the purpose of devising generalized analysis workflows in simulation-based climate science. These are then amenable to further abstraction to support the development, adaptation and dissemination of the Canonical Workflows For Research (CWFR) [22] concept.
In the first step, we present the results of our user analysis. Based on this, we present the CWFR modules we derived from it. We then take a critical look at the use of FAIR Digitial Objects (FDOs) [23,24] in the CWFR process with its possibilities and limitations. Finally, we summarise and give first solutions.

User Types Analysis
As a basis for deriving CWFR modules in simulation-based climate science, we first analysed the current workflows of DKRZ users. We distinguish two main types of DKRZ users: the modellers and the data (re)users. However, it is also possible that the modeller switches roles to become a data (re)user. Modellers are further divided into modellers with little to no data use in the modelling process itself and modellers with unconditional data use. These are those users who run models requiring existing input data. Both subgroups of modellers are further subdivided into modellers pursuing a scientific goal, model developers and modellers performing production runs. The latter two groups usually do not pursue a scientific objective. Of course, there are also users here who nevertheless pursue a research goal afterwards, i.e., change their role.
The group of data (re)users is divided into three subgroups: researchers pursuing a scientific goal, (climate) service providers and impact modellers. It should be noted that the above presents an idealised view of the DKRZ user groups. Each group has been analysed in detail. In each group there are sub-groups that may not carry out parts of the workflows or may carry them out repeatedly. This can lead to very complex, interlocking and interdependent workflows.

Detailed Workflow Examples
In the following, we illustrate the idealised workflows of just two specific user groups for the sake of brevity: climate modellers with unconditional use of existing data in the modelling process itself and data re-users. In each of the described cases, both users pursue a scientific goal.

Example 1: Climate Modeller pursuing a scientific question
The climate modeller in this example ( Figure 2) (re)uses data to drive the climate model. This data is either already present in the local infrastructure or is copied from external resources (Step 1 in Figure 2).

Figure 2.
Idealised linear workfl ow of a climate modeller pursuing a scientifi c question. Steps from left to right: reusing existing data, carrying out model simulations at HPC-system, storing model output data, analysing model results and comparing existing data (reuse), storing output data from analysis, publishing model output data and analyzing data in the repository, and generating knowledge; light grey boxes indicate the converting processes.
The data is then converted (e.g., by an input model) to comply with the input format required by the climate model. Before executing the model simulation, the model has to be adapted and compiled according to the HPC system specifications. In the example shown, the scientist runs several model simulations with the same input data but different model configurations. Output data is obtained for each model run. The model needs a converter to write the output data in certain formats. In subsequent analyses and calculations on a computer server, the scientist compares the data of the different simulations, but also to other already existing datasets, e.g., observation data. This comparative data usually also requires reformatting (converting) prior to analysis. Eventually, new data is created, stored, and, like the output data of the simulations, published. A converter adapting the data to the standards required by the repository is usually required as well. The workflow is finalised with answering the scientific question and generation of new knowledge.
The scheme shown in Figure 2 would look identical if not one model but different models were used, as for example in a Model Intercomparison Project like e.g., CMIP6 [4].

Example 2: Data (re)user for reasons of analysis
The workflow of a scientist reusing existing data and pursuing a scientific question is shown in Figure 3. The data either exists in the local infrastructure or is copied from external sources (light orange object and dashed line in Figure 3). The data, usually available following different (meta)data formats/standards are read-in for analysis by specific routines. They work as converter. Reuse of such routines or converter is uncommon, i.e., almost every scientist conceives an individual solution. Next, the prepared data are analysed by the scientist. For this, the scientist employs the local software environment and computing infrastructure, e.g., analysis servers or even HPC if massive parallel operations are needed. As a result of the analyses, new data is generated, stored and published. Another converter is used to adapt the data to the standards required by the repository. The scientist answers the scientific question and thus generates new knowledge by e.g., summarizing the research results in a scientific publication. Idealised linear workfl ow of data reuse following a research question. Parts from left to right: reusing existing data, analysing and comparing existing data (reuse), storing output data from analysis, publishing analysis data in the repository, and generating knowledge; light grey boxes indicate the converting processes.

DERIVING CANONICAL WORKFLOWS FROM USER BEHAVIOUR
The two illustrated workflows show apparent differences, but also similarities. Specifically, we see that steps 4 to 7 in example 1 (Figure 2) are identical to steps 2 to 5 in example 2 ( Figure 3). Therefore, "Canonical Workflow Modules" (as we call them) can be clearly made out and are amenable to abstraction.
The modules we define are "Data and Software (Re)use", "Compute", "Data and Software Storing", "Data and Software Publication", "Generating Knowledge" (compare Figure 4). In the specific workflow, these modules can be repeated individually or in complex interaction. However, individual modules can be Canonical Workfl ows in Simulation-based Climate Sciences omitted or not run through. Between the Canonical Workflow modules, "Converters" must be used, which can be Canonical Workflows themselves or FAIR Digital Objects [23,24].

Data and Software (Re)use
Data and software (re)use is an important Canonical Workflow module. Data orange is accessed either directly at the local computing infrastructure or obtained from a repository/archive. In the latter case, additional steps are needed to either copy the data to the computing infrastructure or access and process it remotely. In the modelling process, input data must conform to a certain form or standard (see also Section 3.6).
Software sharing is still not very established and can be challenging [25]. Specifically, sharing of model code is not common practice across institutions in climate science [26] due to intricate licensing frameworks and non-trivial code adaptations necessary to apply it on different computing infrastructures.

Compute
The CWFR module "Compute" incorporates all kinds of computing infrastructure use. We distinguish between two types of infrastructure: HPC and Analysis Servers. HPC environments are used for performing model simulations, whereas Analysis Servers are used for pre-and postprocessing of data as well as for analysis. Of course, very computationally intensive analyses, like deep learning approaches, must also be performed on an HPC system. The "Compute" module also includes a distinction between the use of local or external infrastructure, which has an impact on the used converters (compare Section 3.6).

Data and Software Storing
As described in the examples of scientific user workflows, output data but also new software (new parts in the model code or analysis software) are created in the process. Writing and storing data and software is thus another potential CWFR module. Again, we distinguish between the extent to which data is stored close to the user's computer infrastructure or transferred to external resources.

Data and Software Publication
The process of data or software publication is a complex, as many aspects have to be taken into account. Institutional definitions of standards and licences have to be reconciled with those of the funders and the repositories. Software is subject to a different legal framework than data. There are currently discussions within the RDA [27], but also at the national level, on how to deal with research software publication to increase the awareness for its relevance [28] and provide guidance. In the context of publication, converters play a decisive role because they are used for the mapping of metadata.
From a less technical perspective, it should be noted that publishing data is most often only carried out when requested by journals, funders or institutions. This is because data publication (and sharing) is currently still not considered from the beginning of the research project and that the data preparation and publication process is considered too time-consuming given the lack of evident scholarly benefits [29,30,31,32]. The integration of the publication process into the CWFR framework is therefore crucial to incentivise increased data publication.

Generating Knowledge
Achieving new knowledge is not a workflow step in the sense that it leads to technical implementation. Nevertheless, we include it in the workflow chain because it is an important link between research phases in the scientific cycle and plays a role in the further development of theories and solutions. In addition to the pure gain of knowledge, the dissemination, i.e., the sharing of knowledge with other scientists, can also be considered here, for example through a scientific publication or through the presentation of the results at a scientific conference.

Converter
Converters are very important elements in the entire workflow and therefore in a CWFR. They occupy a special position because they are not bound to the individual research workflow, but can be used and reused in a variety of generalised ways as described in Section 2. In Figures 2 and 3 they are shown in light grey. Currently, they are individually generated and applied by the scientists themselves, which takes valuable time that should be more usefully spent on the scientific question. Therefore, converters offer the greatest potential for technical standardisation to make them more transparent and act as FDOs (see also Section 4).
Converters themselves can be individual programmes, but also more complex processes consisting of several technical elements. For example, the process of publishing consists of transforming metadata standards, adapting data formats, assigning PID, creating entries in repository databases, etc.

FDOS IN CWFRS OF DATA INTENSIVE CLIMATE SCIENCE
CWFRs are the overarching constructs containing the identified workflow modules (Section 3). In these, FDOs are the data products resulting from CWFR modules, as well as the converters and the routines used in the CWFR modules (tools). The workflow itself is also an FDO, thereby facilitating reuse and reproducibility.
In simulation-based climate science, the use of FDOs in every step of the entire process chain would be groundbreaking and the advantages are obvious: • Provenance tracking and full reproducibility: For each FDO, complete provenance information is stored (in the PID profile and in the metadata). Source data, converters, processing software or model and model specifications, which are also specified as FDOs, are indexed. Ultimately, the last FDO in the workflow contains the information of the entire workflow. If software is viewed as an FDO, this object contains not only the usual metadata such as authors and licences, but also all information on the infrastructure, compiler and compiler specifications.
• Creation of globally federated data bases: FDOs have a unique globally accessible PID, allowing for global access and data (re)use. Globally federated databases of FDOs building on existing infrastrucutre concepts [e.g., 17] would thus be achieved.
• Supporting BigData approaches: One part of the BigData approaches is the combined use of heterogenic data. To achieve this, data must be standardised and described in terms of its heterogeneity to faciliated machine-actionability. FDOs have a decisive role here.
• Automated selection of analysis tools and libraries: With regard to non-expert users, interfaces which automatically select analysis tools and libraries in the background are necessary for a desired workflow to run through. These interfaces can also be designed as FDOs.
However, the full implementation of the CWFR concept in climate science would be challenging (Section 5.2).

Status Quo Workflow vs. Futur e Workflows
We have identified Canonical workflow modules based on current user behaviour, but do they hold up for the future? The modules are kept so that they continue to exist as such. However, it is expected that users will not actively engage with the implementation or processing of certain modules in the near to distant future. This is especially true for the Data and Software Storing and Publication parts, including the standardisation of metadata. However, we expect that with the further establishment of globally federated data bases, the distinction between "local" and "external" will technically no longer make a difference. The aspect will remain, but the question is how the information of the overall workflow will be preserved. A system is needed to log individual location-independent workflow steps and record them in FDOs.

Critical View on the Use of FDOs in Simulation Based Climate Science
Numerical models are the essential component of simulation-based climate science. The model code is often not freely available and is subject to strict licensing and usage agreements. Only a few models are freely accessible in their source code ( [26]) and can thus really be declared as FDO.
Further, simulation results depend on the infrastructure used for the simulation ( [33]). HPC systems are usually replaced after 4-6 years. With the discontinuation of the infrastructure used, the data are also no longer reproducible. This would mean that indexing of the model and model setup as a FDO would be possible, but no longer repeatedly executable. To overcome this issue, efforts to establish the use of workflow tools and container approaches in HPC environments with concrete applications in simulation-based climate science are being considered [e.g., 34,35,15, and references therein]. Storing the workflows themselves instead of data would also be a very good approach in terms of saving resources (storage space).
In order to be able to really introduce FDOs in the area of software (model and analysis tools), decisive prerequisites must be created, e.g., git commit. Furthermore, there are currently no standards for PID profiles. Without standardisation in this area, the introduction and use of FDOs is very difficult, as PID profiles form an essential basis for the FAIRness of data objects and contain important information about the data or software object.

SUMMARY AND OUTLOOK
We have used the example of DKRZ user behaviour to devise building blocks of CWFRs. DKRZ provides users with computing power and storage capacity but is also a repository for climate data (model and observation data), which enables data publication but also direct data reuse. Thus, scientists can carry out all the steps of their research work in one overall system. This makes it possible to develop and implement the CWFR framework in an interlocked manner.
We identified different users and user groups based on their reasons for using the supplied resources and then examined the workflows they apply in their work. Our analysis shows, that the resources are used by two primary user types: those who require the HPC-system to perform resource intensive simulations to subsequently analyse them and those who reuse, build-on and analyse existing data stored either on local file systems or reuse data from external resources. We found that there are superordinate CWFR modules that several user groups use in the same way or that are repeated again and again. These offer the potential for automation or the use of FDOs. However, we have also taken a critical look at the use of FDOs in this subject area as well as the possible future changes to workflows. Because climate science is rapidly proceeding towards exa-scale computing [20], current scientific workflows will soon not be fit-for-purpose anymore [21]. Furthermore, the reproducibility of simulation results is very limited due to constant changes in HPC infrastructures [33,25].
We are currently facing up to the challenges: The Free Evaluation System Framework for Earth System Modelling [Freva, 18] is being further developed at DKRZ. Freva can be used to orchestrate all steps of scientific workflows necessary to perform research in data intensive climate science. It builds on a standardised data base, provides a programming interface, tracks provenance information and can be run from the command line or a web interface. Freva is currently undergoing massive updates and improvements, like direct data access to repositories or handling numerical model simulations, to support future research projects requiring a step-change in the way climate science is performed.
Overall, the complexity of the individual CWFR modules we identified here poses a great challenge, as it is not sufficient to provide individual solutions, but to always keep an eye on the interaction of all components.
Looking at the defined Canonical Workflow Modules and corresponding sub-modules, one finds that they are transferable to all disciplines that follow simulation-based approaches (data-, simulation-and/or analysis-intensive disciplines). These are not only fields from physics or chemistry, but also political science, economic research, etc.