Abstract
The UK Catalysis Hub (UKCH) is designing a virtual research environment to support data processing and analysis, the Catalysis Research Workbench (CRW). The development of this platform requires identifying the processing and analysis needs of the UKCH members and mapping them to potential solutions. This paper presents a proposal for a demonstrator to analyse the use of scientific workflows for large scale data processing. The demonstrator provides a concrete target to promote further discussion of the processing and analysis needs of the UKCH community. In this paper, we will discuss the main requirements for data processing elicited and the proposed adaptations that will be incorporated in the design of the CRW and how to integrate the proposed solutions with existing practices of the UKCH. The demonstrator has been used in discussion with researchers and in presentations to the UKCH community, generating increased interest and motivating further development.
1. INTRODUCTION
Experimental and computational simulation techniques developed to understand the nature of materials and their practical applications in catalysis research rely on the use of data for building and validating complex models (such as the example in Figure 1). The UK Catalysis Hub (UKCH) enables cutting-edge research in catalytic science, by facilitating access to state-of-the-art resources and expertise. UKCH provides access to well equiped laboratories, central facilities provided by the Science and Technology Facilities Council (STFC) and offers expert advice for processing and analysis of the data produced from experiments and theoretical models.
Diagram of an in-situ XAFS analysis experiment [11]. The target of this proposal are the processes and outputs after XAFS microspectroscopy①, on the lower rightmost branch of the experimental process using ATHENA and ARTEMIS for processing and analyses of XAFS data.
UKCH researchers use advanced processing and analysis software such as Mantid [3], DAWN [4], Larch [29], and Demeter [33] to handle the data produced by their research projects. These tools allow scientists to process and analyze data interactively. Additionally, each scientist has a choice of analysis software such as MATLAB, R, and Excel, to further analyze data and to format results for publishing. STFC facilities (CLF [7, 8], Diamond [13, 20], and ISIS [15, 21]) operate 24 hours a day and have the capacity to performing thousands of readings which produce large datasets that require further processing and analysis. Naturally, the time employed in processing and analyzing data increases with the size of the datasets. Moreover, new experiment proposals aiming to collect even larger quantities of data push the boundaries of the processing capacity of analysis tools [37].
Having in mind the current and future requirements for processing and analysis of increasing data volumes, the UKCH started designing a virtual research environment, the Catalysis Research Workbench (CRW). The development of this platform requires identifying the processing and analysis needs of the UKCH members and mapping them to potential solutions. In this requirements collection phase, the UKCH implemented a workflow demonstrator to foster further discussion and analysis of the requirements for the CRW. The goal of the demonstrator is to introduce the concept of managed scientific workflows and discuss their integration in the day-to-day practices of UKCH researchers. The demonstrator has been used in discussion with researchers and in presentations to the UKCH community, generating increased interest and motivating further development.
2. RELATED WORK
The use of software prototypes is an established software engineering practice [6, 22, 28, 34, 35, 36]. A demonstrator is a type of functional prototype which is used in proof-of-concept studies to support the illustration of complex design proposals to a wide range of system stakeholders. The demonstrator can be presented by the designer who describes the details of the implementation while performing a specific set of tasks, often scripted, and then requests feedback from the user community.
There are various cases in which prototypes (and demonstrators) have been used successfully to present implementation proposals and to refine and prioritize user requirements. Prototyping has been used for multiple purposes such as the description of architechtural decisions, discussion of interface design, and presentation of new functionalities. Davis et al. use a Web service-based e-science demonstrator to explain the architectural design for a text mining platform [10]. Klampanos et al. describe the implementation of an information registry prototype to demonstrate how it can enable collaboration and ensure consistency across the distributed infrastructure for Dispel and dispel4py [23]. Leong et al. present the implementation of three use cases to demonstrate the feasibility and benefits of applying a cloud driven approach to supercomputing ecosystems [25] for large scale experimental facilities.
In the workflow domain, Goble et al. used a demonstrator to present the design principles and functionality of the myGrid middleware suite, to facilitate the work of bioinformaticians [17]. Nieva et al. describe the use of different prototypes in the review of alternative designs for a web interface for the Taverna workflow management system [28]. Watkins et al. present Workspace, a scientific workflow system that includes rapid prototyping features enabling the testing of different components and configurations during the design of complex workflows [38].
3. PROBLEM FORMULATION
Large scale research facilities such as the Central Laser Facility (CLF [7, 8]) Diamond Light Source (Diamond [13, 20]), and ISIS Muon and Neutron Source (ISIS [15, 21]) have an operational framework supported by their Data Management Policies. This framework governs their Laboratory Information Management (LIM) systems and the Data Management System (DMS). The main commonality of these facilities is that they use ICAT, an advanced catalogue system that combines LIM and DMS functionalities [16]. ICAT is developed by the Scientific Computing Department of the Science and Technology Facilities Council (SCD-STFC) and other institutions. The ICAT system contains complementary data for each experiment like proposal, PI, Experimenter, Grant(s), device(s), experiment metadata and experiment results. As a result, the extended workflows of CLF, Diamond and ISIS can be generalized as shown in Figure 2.
The generic processing workflow of CLF, Diamond, and ISIS (Adapted from [27]). The data reduction and analyses tasks highlighted in red are traditionally performed by facilities users, decoupled from facilities.
As Figure 2 indicates, Data Reduction and Data Analyses tasks are entrusted to the facilities users, i.e., scientists who have been awarded experimental time at the facilities. These tasks are the ones which require further support, as researchers report that processing and analysing data after the experiment requires substantial amount of time and processing resources. The research facilities provide software for collecting and formatting the data generated (for instance Mantid [3] and DAWN [4]), however, the researchers still need to handle the data and combine it with other data according to their objectives. Researchers rely on a combination of data and software resources (own and shared) in their daily work. In this context, there are several issues that the researcher needs to handle, such as mastering the use of several types of analysis tools including lab equipment, processing software and databases; converting data so that it can be used at different stages; and ensure the reproducibility of the results by tracking equipment and software used, entry parameters, intermediate results, and versions of completed runs.
4. THE PROPOSED APPROACH
The need for supporting users in the processing and analysis of research data has gained higher priority because the size and complexity of datasets is constantly increasing. This is the case of XAS analysis with the development of higher throughput analysis devices [37] and longer running times. Up to now, researchers have managed using interactive software for formatting, processing, reducing, and summarizing experimental data. However, researchers are spending longer hours processing and formatting data, which distracts them from their experimental work. The target of the workflows proposed are these time-consuming activities. We aim to build on the experience gathered in the adoption of workflow technologies and proposed the creation of concrete examples which demonstrate the advantages of using scientific workflow management tools when compared to current processing practices.
4.1 Example for the Workflow Demonstrator
The explicit definition of processing workflows provides a complete view of the activities performed, the software used, and the data consumed and produced. After defining the workflow, its individual tasks can then be implemented modularly, allowing the combination, and swapping of components. The processing X-ray Absorption Spectroscopy (XAS) data is relevant because of the number of experiments performed and the quantities of data produced. Normally a scientist use Artemis and Athena [33] in a well-defined structured fashion. Moreover, the XAS processing workflow is well documented and there are several examples and tutorials on the use of Artemis and Athena for performing the workflow tasks [31, 33]. Athena and Artemis tasks can be scripted in Perl using Demeter. Additionally, there are alternative tools which have been proposed and can also be automated through scripting (e.g., Larch [29]). All these considerations made the XAS processing workflow the selected target to implement as an example for the demonstrator.
4.2 The XAS Processing Workflow
The XAS processing workflow consists of three tasks: Process Raw Data, Normalise Data, and Analyse Data. This division of the tasks is derived from the Ravel's online courses [31, 32], the DAWN tutorials [14] and from discussion with coauthors about processing practices. Figure 3A presents an overview of the three tasks of the XAS processing workflow. At this level, we can name the software, inputs, and outputs for each task of the workflow. The analysis of the workflow can be further refined to identify the sub-tasks within each task, providing a modular view of the workflow components. Figure 3B shows a finer grained description of the sub-tasks of the workflow. At this level, tasks are better defined as modules which can be implemented independently. This representation of the subtasks, including their relationships and precedence, including the inputs, resources, and outputs is the stating point for the implementation of the workflow.
Overview and detailed view of the XAS processing workflow.
4.3 Implementing the Workflow
After identifying the core tasks to implement, sequence, as well as the expected inputs and outputs from each sub-task, it was possible to decide the alternative ways to implement the workflow and to define which metrics to use to analyse the performance of the different instances. Three types of workflow configurations have been implemented: Manual (Interactive/User driven), Scripted (automated with scripts), and Managed (Semi-automated/User supervised). The manual workflow is just a reproduction of the textbook example, using the sample data from literature [31, 32] and repeated using an example from our experimental colleagues. The manual version is also used to calculate the baseline time for execution of one full cicle of the workflow, from raw data to fitted data results. The two versions of the scripted workflow were implemented using Demeter and Larch (one for each). The Demeter version is scripted in Perl and allows running the same process as the manual workflow. The main difference is that the interface is text based and the operations are presented in a text menu. The Larch version of the scripted workflow is implemented using Larch and Jupyter Notebooks (Python). Finally, two managed versions of the workflow were designed to be executed using Nextflow [12] in combination with Larch and Demeter.
The first three versions of the workflow were fully implemented and used in demonstrations while the Nextflow managed version is in the process of being implemented for execution on a high-performance computing environment.
5. EXPERIMENTS AND ANALYSIS
Each of the three alternative implementations of the workflow was designed and tested using sample data from the textbook examples. The comparison of processing times was then made using two datasets containing data form actual UKCH experiments. The running times were then averaged and used for comparing the proposals and demonstrated to scientists to get feedback on the implementation.
5.1 Datasets and Experimental Setup
As described previously, three datasets were used. The first dataset is used for development and testing and is the dataset of Ravel's textbook example [31, 32]. The textbook dataset consists of one file containing the crystallographic data for Iron Sulfide (pyrite, FeS2) and a transmission scan of FeS2 taken at room temperature at beamline 13BM at the Advanced Photon Source [32]. The other two example datasets consist of a nexus file from containg Rh4CO spectral data gathered from the I20 Energy Dispersive EXAFS (EDE) beamline at Diamond Lightsource and a crystallographic data file of tetrarhodium dodecacarbonyl obtained from the Crystallographic Open Database [9, 18].
The software used for implementing the workflows included DAWN V.2.16.1, Demeter V. 0.9.26 (which includes Artemis and Athena), Larch V. 0.9.47, Perl V. 5.12.3, Python V. 3.6.10, and Jupyter V. 6.0.3. The system used for running the experiments was a laptop computer with Windows 10 64-bit operating system, Intel Core i5-8250 1.60 GHz Processor, and 8 GB Memory.
5.2 Overall Comparison of Results
The three implemented examples of the workflow were individually timed for comparison of potential for speeding up the processing and analysis of XAS data. The manual version of the workflow based on the textbook example takes about 24 minutes to produce one complete run from raw data to fitted data results. This average time was taken from performing the workflow activities manually with ten samples of the Rh4CO spectra and then averaging the processing time from start to finish. Using these data, we calculate that processing a dataset of 3,790 readings would take about 63 days. The experts in the group consider that they can perform one complete run in 10 minutes, which would require approximately 23 days to process the 3,790 readings dataset.
The first scripted version of the workflow uses Demeter and Perl and it allows fast processing in about 22 hours for a dataset of 3,790 groups (~1 day). This is a considerable improvement from the manual workflow. The second scripted version of the workflow uses Larch, Jupyter and Python. It is slower than the Demeter version, but still can reduce the processing time to 103 hours (4.3 Days), taking only 20% of the time required for manually processing a full dataset.
The results in Table 1 were obtained using a laptop computer with limited memory and processor. In comparison, the initial results of a NextFlow-Larch version of the workflow reduced processing time to 7 hours and 21 minutes for the largest dataset (4000 groups) when executed in the ARCCA-HPC cluster. At this stage, the presentation of the demonstrator to stakeholders indicates that the approach could be applied to real life scenarios, as positive reviews and suggestions for improvement indicate.
Manual Workflow | Task | Software | Time | Input | Output |
Process raw data | DAWN | 8 min | 1 nexus [.nxs] file | 3580 – 4000 files | |
Normalise data | Athena | 3 min | 1 data [.dat] file | 1 Athena file | |
Analyse data | Artemis | 21 min | 1 Athena [.prj] file | 1 Artemis file | |
1 Crystal [.inp/.cif] file | |||||
Novice user processing 1 dataset | ~63 days | ~ 24 mins to produce 1 fit | |||
Expert processing 1 dataset | ~26 days | ~ 10 mins to produce 1 fit | |||
Demeter-Perl Scripted Workflow | Task | Software | Time | Input | Output |
Process raw data | DAWN | 8 min | 1 nexus [.nxs] file | 3580 – 4000 files | |
Normalise data | Demeter | 64 min | 500 data [.dat] files | 500 Athena files | |
Analyse data | Demeter | 21 min | 500 Athena [.prj] file | 500 Demeter [.dpj] files | |
1 Crystal [.inp/.cif] file | 500 Fit [.fit] files | ||||
500 Log files | |||||
Processing a dataset with 3,790 groups | ~22 Hours | ~ 21 sec. to produce 1 fit | |||
Larch-Jupyter Scripted Workflow | Task | Software | Time | Input | Output |
Process raw data | DAWN | 8 min | 1 nexus [.nxs] file | 3580 – 4000 files | |
Normalise data | Larch | 8 min | 4000 data [.dat] files | 4000 Athena files | |
Analyse data | Larch | 814 min | 500 Athena [.prj] file | 500 Demeter [.dpj] files | |
1 Crystal [.inp/.cif] file | 500 Fit [.fit] files | ||||
500 Log files | |||||
Processing a dataset with 3,790 groups | ~103 Hours | ~ 1.5 min. to produce 1 fit |
Manual Workflow | Task | Software | Time | Input | Output |
Process raw data | DAWN | 8 min | 1 nexus [.nxs] file | 3580 – 4000 files | |
Normalise data | Athena | 3 min | 1 data [.dat] file | 1 Athena file | |
Analyse data | Artemis | 21 min | 1 Athena [.prj] file | 1 Artemis file | |
1 Crystal [.inp/.cif] file | |||||
Novice user processing 1 dataset | ~63 days | ~ 24 mins to produce 1 fit | |||
Expert processing 1 dataset | ~26 days | ~ 10 mins to produce 1 fit | |||
Demeter-Perl Scripted Workflow | Task | Software | Time | Input | Output |
Process raw data | DAWN | 8 min | 1 nexus [.nxs] file | 3580 – 4000 files | |
Normalise data | Demeter | 64 min | 500 data [.dat] files | 500 Athena files | |
Analyse data | Demeter | 21 min | 500 Athena [.prj] file | 500 Demeter [.dpj] files | |
1 Crystal [.inp/.cif] file | 500 Fit [.fit] files | ||||
500 Log files | |||||
Processing a dataset with 3,790 groups | ~22 Hours | ~ 21 sec. to produce 1 fit | |||
Larch-Jupyter Scripted Workflow | Task | Software | Time | Input | Output |
Process raw data | DAWN | 8 min | 1 nexus [.nxs] file | 3580 – 4000 files | |
Normalise data | Larch | 8 min | 4000 data [.dat] files | 4000 Athena files | |
Analyse data | Larch | 814 min | 500 Athena [.prj] file | 500 Demeter [.dpj] files | |
1 Crystal [.inp/.cif] file | 500 Fit [.fit] files | ||||
500 Log files | |||||
Processing a dataset with 3,790 groups | ~103 Hours | ~ 1.5 min. to produce 1 fit |
5.3 Results Analysis
The three versions of the workflow have been showcased and discussed with researchers in two separate occasions, providing valuable feedback, suggestions for improvement and future developments. The workflows are not intended to be fully operational processing and analysis tools, instead the functionalities and details of the examples is intended to illustrate the benefits of adopting a workflow-oriented design approach. The workflows were first demonstrated at a workshop with our coauthors and served to demonstrate the feasibility of automating repetitive tasks and provided some recommendations for improvements for the workflows. The second presentation of the demonstrator during one of the monthly UKCH seminars, exposed the workflows to a larger community and prompted for suggestions and queries about implementing other analyses using workflows.
At this stage we can highlight the advantages and disadvantages of each of the workflow implementations, including the ones which are still under development. The scripted and managed versions of the workflows are faster for the processing and analysis of data. Moreover, expert users recommended improvements such as monitoring output values to determine if the executions should be terminated early.
Workflow . | Software . | Type . | Issues . | Advantages . |
---|---|---|---|---|
Manual | Artemis, Athena | Manual | Slow processing Limit on number of datasets loaded Individually processing datasets | Interactive visual interface Fine tuning control |
Demeter | Demeter, Perl | Scripted | Limited by processing resources Text interface | Processing large quantities of data |
Larch | Larch, Python, Jupyter Notebook | Scripted | Limited by processing resources Requires Demeter for one key task | Interactive visual interface Fine tuning control Processing large quantities of data |
Nextflow 01 | Demeter, Perl, Nextflow | Managed | Limited by processing resources Text interface | Processing large quantities of data Unsupervised execution |
Nextflow 02 | Larch, Python, Nextflow | Managed | Limited by processing resources Text interface Requires Demeter for one key task | Processing large quantities of data Unsupervised execution |
Workflow . | Software . | Type . | Issues . | Advantages . |
---|---|---|---|---|
Manual | Artemis, Athena | Manual | Slow processing Limit on number of datasets loaded Individually processing datasets | Interactive visual interface Fine tuning control |
Demeter | Demeter, Perl | Scripted | Limited by processing resources Text interface | Processing large quantities of data |
Larch | Larch, Python, Jupyter Notebook | Scripted | Limited by processing resources Requires Demeter for one key task | Interactive visual interface Fine tuning control Processing large quantities of data |
Nextflow 01 | Demeter, Perl, Nextflow | Managed | Limited by processing resources Text interface | Processing large quantities of data Unsupervised execution |
Nextflow 02 | Larch, Python, Nextflow | Managed | Limited by processing resources Text interface Requires Demeter for one key task | Processing large quantities of data Unsupervised execution |
5.3 Relevance of the Canonical Workflow Framework for Research
The Canonical Workflow Framework for Research (CWFR) is aimed at facilitating the interoperation of data management workflows across institutional boundaries [19]. In order to achieve this, the CWRF aims to explicitly document the repetitive tasks which are common across diverse institutional data management workflows (Figure 4). For the management and exploitation of Catalysis Research data, the CWRF model allows stepping back and looking at possibilities for integrating the facilities workflows to the workflows of other institutions accessing the facilities. In Figure 5, the diagram shows how the STFC workflow is aligned with the workflows of other institutions, and how the tasks of these workflows can be mapped to the CWRF tasks.
Common tasks identified in the first version of the CWFR (Adapted from [19]). The tasks are not all carried by one institution, they are complementary and can be fulfilled by different institutions collaborating in the research effort. This is shown in the example provided in Figure 5.
Parallels between the experimental workflows of STFC facilities and other institutions mapped to CWFR Tasks. The fiure shows two workflows, and the activities performed in parallel during research collaborations. The upper workflow is the same presented in Figure 2, while the lower one stands for the workflow performed by institutions accessing and collaborating with STFC facilities. The numbers in orange represent the tasks identified in the CWFR (see Figure 4). The five tasks highlighted in red correspond to activities supported by the type of workflows described in this paper.
The extended workflow for STFC facilities provides the basic scaffolding for integrating with other workflows. The main tool underpinning this workflow is the ICAT system [16]. ICAT combines the functionalities of Laboratory Information Management (LIM) system and the Data Management System (DMS). ICAT registers complementary data for each experiment like proposal, PI, Experimenter(s), Grant(s), device(s), experiment metadata and experiment results. ICAT is common to many facilites (UK and Overseas). ICAT supports the required management functionalities implementing the Core Scientific MetaData model (CSMD) [26]. This model captures metadata about the experiments and datasets produced at the facilities managed by STFC. By design, the operational workflows of the facilities rely on the CSMD, for managing experiments from the proposal stage to the collection and distribution of experimental data.
6. CONCLUSION AND FUTURE WORK
The implementation of the demonstrator with three versions of the XAS workflow, is a first attempt to promote greater usage of the Scientific Workflow approach at UKCH. This first version of demonstrator has stimulated the interest for further research on workflow management platforms. We plan to continue the development by completing the nextflow version [12] and possibly adding examples for Galaxy [1], and Taverna [39], which provide different benefits. These will be then evaluated in further demonstrations to gather more requirements for implementation.
Looking forward, the UKCH will try to standardize the procedures for describing and implementing other processing workflows to support data processing and analysis. For this, we are considering new examples, such a Quasi-Elastic Neutron Scattering (QENS) and X-Ray Powder Diffraction (XRD) processing workflow. In the longer term, the evaluation of workflow implementation alternatives will help the UKCH in better defining the requirements and design constraints to be followed for the development of the Catalysis Research Workbench (CRW).
ACKNOWLEDGEMENTS
UK Catalysis Hub is kindly thanked for resources and support provided via our membership of the UK Catalysis Hub Consortium and funded by EPSRC grant: EP/R026939/1, EP/R026815/1, EP/R026645/1, EP/R027129/1 or EP/M013219/1(biocatalysis)). We acknowledge the support of provided by Advanced Research Computing at Cardiff (ARCCA) for the implementation and testing the NextFlow version of the workflow, ARCCA is part of the Supercomputing Wales project, which is part-funded by the European Regional Development Fund (ERDF) via Welsh Government.
AUTHOR CONTRIBUTIONS
A. Nieva de la Hidalga ([email protected]) contributed to the implementation and development of the workflow versions (Published at https://github.com/UK-Catalysis-Hub/XAS-Workflow-Demo). D. Decarolis ([email protected]) contributed to the definition of the problem and provided the experimental data used to evaluate the models. S. Xu ([email protected]), S. Matam ([email protected]), W.Y. Hernández Enciso ([email protected]), and J. Goodall ([email protected]), collaborated in discussions about the design and functionalties provided. B. Matthews ([email protected]) and C.R.A. Catlow ([email protected]) provided feedback on the design of the experiments. All authors reviewed the paper and provided further ideas for improving its contents.
Performed at Diamond Beamline B22: Multimode InfraRed Imaging And Microspectroscopy (MIRIAM).