Abstract
We introduce the concept of Canonical Workflow Building Blocks (CWBB), a methodology of describing and wrapping computational tools, in order for them to be utilised in a reproducible manner from multiple workflow languages and execution platforms. The concept is implemented and demonstrated with the BioExcel Building Blocks library (BioBB), a collection of tool wrappers in the field of computational biomolecular simulation. Interoperability across different workflow languages is showcased through a protein Molecular Dynamics setup transversal workflow, built using this library and run with 5 different Workflow Manager Systems (WfMS). We argue such practice is a necessary requirement for FAIR Computational Workflows and an element of Canonical Workflow Frameworks for Research (CWFR) in order to improve widespread adoption and reuse of computational methods across workflow language barriers.
1. INTRODUCTION
The need for reproducibility of research software usage is well established [1, 2, 3], and adaptation of workflow management systems (WfMS) together with software packaging and containers [4] have been proposed as key ingredients for making research software usage FAIR and reproducible [5, 6, 7]. Recently it is also argued that computational workflows should also be treated as FAIR Digital Objects [8] in their own right, with identifier, metadata [2] and interoperability requirements [9].
BioExcel, a European Centre of Excellence for Computational Biomolecular Research, has a particular focus on the research domains molecular dynamics simulations and bioinformatics with use of High Performance Computing (HPC) to approach Exascale performance, while also improving usability. The BioExcel Building Blocks (BioBB) [10] have been created as portable wrappers of open-source computational tools identified as useful for BioExcel workflows, forming several families of documented and interoperable operations that can be called from multiple workflow systems. This interoperability is shown with the BioBB demonstrator workflows, along with multiple tutorials and notebooks.
We propose that these building blocks and their families themselves can be considered composite Digital Objects: collections of software packages and their source code, guides and tutorials, as well as workflow management system integrations and workflow examples. In addition, the building blocks, as wrappers of upstream open source tools, benefit from and refer to the tools’ existing documentation, support forums, academic publications and wider development context.
Given BioBB as a starting point, we define a generalized methodology of Canonical Workflow Building Blocks(CWBB), through the definition of a set of requirements and recommendations for how to formalize and develop a family of compatible computational tools as Digital Objects. These building blocks let researchers instantiate a Canonical Workflow in multiple workflow management systems, while also benefiting from the FAIR aspects of the CWBB Digital Objects.
2. METHODS
The BioExcel Building Blocks library [10], created and implemented within the BioExcel CoE, is a collection of portable wrappers of common biomolecular simulation tools. The BioBB library is designed to i) increase the interoperability between the tools wrapped; ii) ease the implementation of biomolecular simulation workflows; and iii) increase the reusability and reproducibility of the generated workflows. To achieve these main goals, the library was designed following the FAIR principles for research software development best practices [7].
The result is a collection of building block modules, divided in sets of tool wrappers focused on similar functionalities (e.g. Molecular Dynamics, Virtual Screening). Each of the modules is built from a combination of (i) software packaging (Pip, BioConda, BioContainers), (ii) documentation (ReadTheDocs), (iii) interactive tutorials (Jupyter Notebooks, Binder), (iv) registry & findability (bio.tools, BioSchemas, WorkflowHub), (v) WfMS integration stubs (CWL, Galaxy, PyCOMPSs), (vi) source Code (GitHub) and (vii) REST APIs (OpenAPI, Swagger). Notably all building blocks follow the same pattern of installation, configuration and interaction.
Since the publication of the library, several new building block modules (Chemistry, Machine Learning, AMBER MD, Virtual Screening, etc.) have been added, and the set of operations for the existing BioBB families have been expanded. While we previously provided curated adapters (Figure 1) for running BioBB in workflow systems using Common Workflow Language (CWL) and PyCOMPSs, along with Galaxy Toolshed bindings, we have now started auto-generating these bindings, along with command line wrappers and REST web service APIs, using annotations within BioBB's Python docstrings as source. These annotations include sufficient information for a WfMS to launch a particular building block: input and output parameters (including mandatory/optional flags), compatible formats (including from EDAM ontology [11]), example files (essential for testing purposes), default values and dependencies. This ensures human-readable documentation, FAIR metadata and programmatic accessibility can be generated consistently and comparably.
Code snippets for the BioBB WfMS bindings: CWL, PyCOMPSs, Galaxy and KNIME.
The library is showcased through a collection of demonstration workflows [12]. Here, each workflow introduces individual building blocks as needed to explain a particular scientific computational method. We primarily expose the workflows as Jupyter Notebooks [13], which has been highlighted as a valuable tool for reproducible scientific workflows [14]. This offers a graphical interactive interface, including documentation (integrated markdown) related to the workflow and the building blocks used, but also to the biomolecular simulation methods used in the pipeline. Moreover, as we have demonstrated with our own Binder [15] hosting, these workflows are reproducible across platforms, assisted by BioConda [16] packaging of the building blocks and their software dependencies.
This assembly of available demonstration workflows has been successfully used in the BioExcel CoE for dissemination with a range of training events (e.g., BioExcel Summer & Winter School, webinars and virtual training). In training we particularly utilised the Binder infrastructure of the BioExcel Cloud portal [17] to give users a web-based first experience of the building blocks before they try them in other workflow systems.
We can observe that workflow building blocks such as BioBB are necessarily composed of a comprehensive list of digital objects, encompassing source code, packaging, containerization, documentation, attributions, citations, registry entries, WfMS integrations and REST APIs.
We propose to consider building blocks as composite digital objects in their own right: gathering the above software components along with their metadata, identifiers and operations then forms a Canonical Workflow Building Block (CWBB). We suggest this concept as a fundamental element of FAIR Digital Objects for Computational Workflows: researchers use the building blocks computationally as functional operations across WfMSs, while the FAIR aspect of CWBB propagates information and resources that are essential for reproducibility, reuse and understanding by anyone discovering the workflow.
2.1 Interoperability across Different Workflow Languages
The concept of Canonical Workflow Building Blocks is here showcased with the BioBB library, by using a transversal workflow present in many different computational biomolecular projects: a Molecular Dynamics (MD) protein setup. This workflow prepares a protein structure to be used as input for an MD simulation, going through a series of steps where the protein is completed (adding hydrogen and missing atoms), optionally introducing a residue mutation, then submerging the protein in a virtual box of water molecules with a particular ionic concentration, and finally energetically equilibrating the system (so that solvent and ions are well accommodated around the protein at the desired temperature).
This simulation process involves a non-negligible number of steps, using a variety of biomolecular tools. The BioBB library was used to assemble this workflow, interconnecting building blocks using Python functions (Jupyter Notebook, Command Line Interface), auto-generated bindings (Galaxy [18], CWL [19], PyCOMPSs [20]) or manually generated bindings (KNIME [21]). Corresponding workflows for the different WfMS can be found in WorkflowHub [22, 23, 24, 25, 26] and graphical extracts can be seen in Figure 2.
Protein MD Setup transversal workflow, assembled in with 5 different workflow managers using BioBB canonical building blocks. From top-left: Galaxy [22], KNIME [23], CWL [24], Jupyter Notebook [25], and PyCOMPSs [26].
This example demonstrates how the same canonical building blocks can be used in different WfMS. Wrappers and tools executed behind the workflows are exactly the same, but the workflows are built using different WfMS, some of them in a graphical way (Galaxy, KNIME), some in a command line way (Jupyter Notebook, PyCOMPSs, CWL); workflows can be focused on short/interactive executions (Jupyter Notebook), or on High Throughput/High Performance Computing (HT-HPC) executions (PyCOMPSs); some of them prepared for a particular WfMS installation (Galaxy), others completely system-agnostic (CWL).
The current WfMS bindings include Jupyter Notebook, PyCOMPSs, CWL, Galaxy and KNIME WfMS, in addition to a command line mechanism. Thanks to the extensive documentation added in the source code as Python docstrings, new bindings for available WfMS can be generated. We are also experimenting with generating a REST API exposing the building services as Web services. However, it should be noted that such automatic generation of bindings is not always practically feasible. As an example, KNIME nodes require a complete Java skeleton code, as well as a definition of new data types for all inputs/outputs required, which makes their automatic generation a heavy and potentially error-prone task. Bindings for workflow languages with a domain-specific language (DSL) for tool definitions (e.g. Galaxy, CWL) can, on the other hand, be generated in a more straightforward fashion.
The transversal protein MD setup workflow was chosen as a real example that is readily understandable by domain experts. More complex pipelines involving a broader set of wrapped biomolecular tools have been developed using the BioBB library, primarily as Jupyter Notebooks. A selection of these have similarly been assembled for different WfMS using the auto-generated bindings and uploaded to the WorkflowHub repository.
3. DISCUSSION
Early work on libraries of workflows fragments include Web Service-based approaches where tools are wrapped and exposed using common, interoperable data types in BioMoby [27] for bioinformatics and similarly caBIG [28] for cancer genomics. While these efforts were interoperable across WfMSs they required a large up-front investment in agreeing to and adapting native data to common RDF or XML representations.
The notion of abstract workflows [29], structural workflow descriptions separated from their concrete execution realisations and augmented with Linked Data annotations, have been emphasised as essential for reuse and consistency across workflow systems. Identifying common motifs for workflow operations [30] (e.g. Data preparation, Format transformation, Filter, Combine) are important to simplify and understand otherwise fine-grained workflow provenance traces.
Most other efforts to standardise a set of disparate analytical tools have been done within the scope of a single WfMS, allowing customised user interaction, data visualisation, configuration and findability, for instance Taverna components had prototypical building blocks [31] which were instantiated at runtime by reference from a registry. KNIME components and metanodes, shared on the KNIME Hub are frequently designed to be interoperable, but with a perhaps weaker notion of component families. The Galaxy toolshed [32] is likewise populated with different sets of tool wrappers that are largely made to be interoperable within a category.
The Common Workflow Language (CWL) [19] has a strong emphasis on interoperable command line tool descriptions, with support for containers and Conda packaging, as well as support for FAIR metadata like contributors, license and EDAM ontology type annotations. With multiple leading workflow engines now supporting CWL, and experimental Galaxy support, this seems perhaps the most promising candidate for both making and describing canonical workflow building blocks, however we've identified a few stumbling blocks.
One obvious challenge is that the implementing WfMS needs to have CWL support, along with support for either containers or Conda packaging to find the described executables. While it is possible to run a CWL tool directly using a #!/usr/bin/env cwl-runner shebang on POSIX systems, this still requires pre-installation and possibly configuration of a CWL engine like cwltool or Toil [33]. However workflow engines have multiple dependencies and often cannot easily be run from a container themselves①.
Within the CWL community it was originally envisioned that a wider set of workflow systems would adopt CWL for tool description/execution, with a subset implementing full CWL workflow support. This would allow shared community effort for describing tools, say in the Common Workflow Library, rather than each WfMS needing to duplicate this tool wrapping in separate repositories and languages. However, with the exception of experimental tool support in Galaxy, in practice all CWL implementers have gone for full workflow support.
Another challenge is that making a set of building blocks frequently requires the use of shims, for instance file conversion, small search/replace operations or file renames. In a CWL approach these can either be performed with an Expression using JavaScript snippets which only has limited access to file content, or as an additional workflow step added before or after the main tool step. This combination could then be nested as a subworkflow, similar to KNIME's metanodes, and would also be flexible by allowing different containers or packages for any pre- or post-steps. Such a CWL building block however becomes harder to access from a non-CWL WfMS, because of lack of control over configuration/execution options for the now nested CWL tools. In practice②, executing a nested CWL workflow from a native WfMS language would require the engine to implement full CWL Workflow support (or delegate to a CWL engine).
For the main BioBB building blocks we implemented demonstrator workflows that highlight how the tools should be used in different workflow management systems; each having a primary exemplar using Jupyter Notebook, which can be explored interactively using the BioExcel Binder. If we consider the abstract demonstrator workflows as canonical workflows they are therefore very much active objects, but can also be seen as workflow templates, as any real use case will need to specialise the workflow to tweak parameters, data selection etc.
We therefore also provide such workflow templates for multiple WfMS, including CWL, PyCOMPSs and Galaxy. These are fairly disparate workflow languages, yet by the use of the same canonical workflow building blocks (which again invoke the same software binaries), such WfMS-specific workflows effectively are instantiations of the same canonical workflow.
One challenge found is how to publish such canonical workflows in registries like the WorkflowHub. The hub supports the registration of Digital Objects in the form of RO-Crate [34], with the option of abstract CWL for describing the canonical workflow template, along with direct references to the workflow's GitHub repository.
For instance in the RO-Crate for https://doi.org/10.48546/workflowhub.workflow.200.1 [26], which can also be rendered from GitHub, we have an entry for the main workflow according to the Workflow RO-Crate profile, detailing each canonical workflow building block used (e.g., biobb-md metadata). Here the FAIR aspect of the building blocks to help software citation is exercised, as the building block wrapper has one set of authors, documentation and licence (Apache-2.0), while the wrapped software (e.g. GROMACS metadata) has different authors, licence (GPL-2.1+) and documentation.
However, the deposit of such RO-Crates in WorkflowHub results in one registration entry per workflow language, which are not otherwise related and may not even share the same source code repository. Thus, we've identified the need for adding an overall canonical workflow entry, which can bring in workflow documentation and references shared across WfMS implementations, including a set of links to the more granular canonical workflow building blocks used by the workflow, but also to the individual WfMS implementations as separate digital objects.
A similar question of granularity applies at the workflow tool level [4], particularly for Findability and Accessibility, as we can consider at lowest granularity the scientific method in general (e.g., any algorithm for sequence alignment), followed by an application suite (bio.tools entry [35], homepage, documentation), instantiated as a particular software installation (Debian package, Docker container) with its dependencies at same level. The installation includes one or more software executables (a particular binary, a running service), providing at the highest detailed granularity level the specific types of software functionality (a particular mode of operation, choice of analysis), for instance using certain command line flags.
For canonical workflow building blocks, with a focus on pluggable composability, this is mainly defined at this high granularity level of specific software functionality: explicit operations from an installed tool, which are then combined in a workflow. This is indeed the level WfMS tool definitions are typically done, e.g., a CWL Command Line Tool specifies a particular way to run a particular software binary. However, to be an actionable CWBB, the building block needs to additionally convey the lower granularity levels; particularly to support multiple options for interoperable installation and execution, as well as metadata at the most general level, such as documentation and scholarly citations.
While workflow management systems typically only operate at the highest granularity levels for execution details, and are frequently unaware of (or not exposing metadata at) the more general levels, we argue that in order for a Canonical Workflow [36] to follow and support FAIR principles for itself and its data, the workflow management system needs to propagate structured metadata about the tools used by the workflow. We propose that in order to support the workflow's applicability to multiple WfMS, the tools themselves must also have a consistent packaging and formal description that enables computational invocation.
At the most general level, a canonical workflow built using such CWBBs is even conceptually reproducible because the FAIR documentation of the workflow, through its canonical workflow building blocks, identifies how individual tools and software applications are composed, which in worst case can be rebuilt using different installation methods in a different WfMS, or in best case inspected to detect and cross-link the same canonical workflow appearing in different WfMS instantiations. This view of software as composition of other software typically also applies at individual tool level, which themselves depend on programming language runtimes, libraries, services and reference data.
4. REQUIREMENTS FOR CANONICAL WORKFLOW BUILDING BLOCKS
Building on the experiences with BioBB, we here propose requirements and recommendations for establishing Canonical Workflow Building Blocks (CWBB) as implementations of canonical steps introduced for Canonical Workflow Frameworks for Research [36].
The core purpose of a CWBB is to wrap a command line tool or other software that can perform an operation as part of a computational workflow. As such, the general advice for making software workflow-ready applies [37] (e.g., easy to install, documented, parallelizable, reproducible output), however a CWBB is also permitted to make use of additional scripts or shims to further adapt a third-party tool for workflow use and for data interoperability across blocks.
The way tools are installed or invoked varies slightly across WfMS and operating systems, therefore a CWBB should provide multiple methods for distributing software; currently containers (Docker, Singularity) and distribution-independent packaging (e.g., Conda, Homebrew) are promising by having reproducible install recipes and a wide range of available open source dependencies (e.g. Java, Python). Additionally building blocks should allow overriding execution paths, e.g., for use with HPC module system and hardware-optimised binaries.
The CWBBs should have sufficient annotations to be able to generate bindings for different WfMSs and REST APIs, e.g. parameter names and descriptions, types and default values; enumerators for options, file formats for inputs/outputs.
Building blocks should be grouped into families that are interoperable through common data structures and file formats, as well as having joint naming conventions for configuration options. A CWBB family should be released as a single version following semantic versioning rules, which should have a corresponding persistent identifier (PID) [38].
Metadata for CWBBs should be captured following FAIR guidelines, and distributed as part of the block family and resolvable from the PID as a FAIR Digital Object. Metadata should include references to the CWBB software distributions (e.g., quay.io container URL) as well as attributions, citations and documentation for the wrapped tool.
Example workflows showing CWBB usage should be included in a WfMS-neutral language such as Jupyter Notebooks, which may have equivalent variants for each workflow binding. These workflows should be registered in a workflow registry like WorkflowHub or Dockstore, and assigned their own PIDs.
5. CONCLUSIONS
The proposed concept of Canonical Workflow Building Blocks can bridge the gap between FAIR Computational Workflows, interoperable reproducibility and for building canonical workflow descriptions to be used and described FAIRly across WfMSs.
The realisation of CWBBs can be achieved in many ways, not necessarily using the Python programming language together with RO-Crate as explored here. In particular, if the envisioned Canonical Workflow Frameworks for Research become established in multiple WfMSs with the use of FAIR Digital Objects, the different implementations will need to agree on object types, software packaging and metadata formats in order to reuse tools and provide interoperable reproducibility for canonical workflows.
Likewise, to build a meaningful collection of building blocks for a given research domain, a directed collaborative effort is needed to consistently wrap tools for a related set of WfMSs, chosen to target particular use cases (a family of canonical workflows).
For individual users, a library of Canonical Workflow Building Blocks simplifies many aspects of building pipelines, beyond the FAIR aspects and data compatibility across blocks. For instance, they can benefit from training of a CWBB family using Jupyter Notebooks, and then use this knowledge to utilise the same building blocks in a scalable HPC workflow with a CWL engine like Toil, knowing they will perform consistently thanks to the use of containers.
While we have demonstrated CWBB in the biomedical domain, this approach is generally applicable to a wide range of sciences that execute pipelines of multiple file-based command line tools, however it may be harder to achieve with more algebraic “in memory” types of computational workflows, where steps could be challenging to containerize and distinguish as separate block.
We admit that biomolecular research is quite a homogenous field with respect to computational analyses and now becoming relatively mature in terms of tool composability in workflows, building on the experiences of the “FAIR pioneers” in the field of bioinformatics. Other fields, such as social sciences or ecology, can have a wider variety of methods and computational tools, often with human interactions, and may have to adapt the software to be workflow-ready [37] before using them as Canonical Workflow Building Blocks. Domains adapting CWBB approach (or workflow systems in general) should take note of the great benefits of hosting collaborative events where developers meet each other and their potential users, demonstrated in our field with events such WorkflowsRI [39] and Biohackathons [40].
The Common Workflow Language shows promise as a general canonical workflow building blocks mechanism: gathering execution details of tools along with their metadata and references, augmented with abstract workflows to represent canonical workflows. However, this would need further work to implement our CWBB recommendations in full. Future work for the Canonical Workflow Building Blocks concept includes formalising and automating publication practises, to make individual blocks available as FAIR Digital Objects on their own or as part of an aggregate collection like RO-Crate.
ACKNOWLEDGEMENTS
This work has been done as part of the BioExcel CoE (https://www.bioexcel.eu/), a project funded by the European Union contracts H2020-INFRAEDI-02-2018 823830, and H2020-EINFRA-2015-1675728. Additional work is funded through EOSC-Life (https://www.eosc-life.eu/) contract H2020-INFRAEOSC-2018-2 824087, and ELIXIR-CONVERGE (https://elixir-europe.org/) contract H2020-INFRADEV-2019-2 871075.
AUTHOR CONTRIBUTIONS
Stian Soiland-Reyes (soiland-reyes@manchester.ac.uk): Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing—original draft, Writing—review & editing
Genís Bayarri (genis.bayarri@irbbarcelona.org): Software, Software Documentation
Pau Andrio (andriopau@gmail.com): Methodology, Software, Validation, Software Documentation
Robin Long (r.long@lancaster.ac.uk): Software, Software Documentation
Douglas Lowe (douglas.lowe@manchester.ac.uk): Software, Software Documentation
Ania Niewielska (aniewielska@ebi.ac.uk): Methodology, Resources, Software
Adam Hospital (adam.hospital@irbbarcelona.org): Methodology, Project administration, Resources, Software, Validation, Visualization, Writing—original draft, Writing—review & editing
Paul Groth (p.t.groth@uva.nl): Supervision, Writing—review & editing
The authors would also like to acknowledge contributions from: Felix Amaladoss, Cibin Sadasivan Baby, Finn Bacall, Rosa M. Badia, Sarah Butcher, Gerard Capes, Michael R. Crusoe, Alberto Eusebi, Carole Goble, Josep Lluís Gelpí, Modesto Orozco, and Geoff Williams.
To execute the wrapped tool, a containerized workflow engine would need nested containers which are not generally recommended for security reasons. It is possible to work around this limitation using Singularity or Conda.
It is worth mentioning that it would also be possible to generate WfMS-specific bindings from CWL descriptions (e.g. as demonstrated with cwl2script for Bash, gxargparse for Galaxy, cwl2wdl for WDL), although this necessitates constraining the tool and workflow definitions to a limited mappable subset of CWL.