Computational workflows describe the complex multi-step methods that are used for data collection, data preparation, analytics, predictive modelling, and simulation that lead to new data products. They can inherently contribute to the FAIR data principles: by processing data according to established metadata; by creating metadata themselves during the processing of data; and by tracking and recording data provenance. These properties aid data quality assessment and contribute to secondary data usage. Moreover, workflows are digital objects in their own right. This paper argues that FAIR principles for workflows need to address their specific nature in terms of their composition of executable software steps, their provenance, and their development.
In data intensive science, e-infrastructures and software tool-chains are heavily used to help scientists manage, analyze, and share increasing volumes of complex data . Data processing tasks like data cleansing, normalisation and knowledge extraction need to be automated stepwise in order to foster performance, standardisation and re-usability. Increasingly complex data computations and parameter-driven simulations need reliable e-infrastructures and consistent reporting to enable systematic comparisons of alternative setups [2,3]. As a response to these needs, the practice of performing computational processes using workflows has taken hold in different domains such as the life sciences [4,5, 6], biodiversity , astronomy , geosciences , and social sciences . Workflows also support the adoption of novel computational approaches, notably machine learning methods , due to the ease with which single components in a processing pipeline can be exchanged or updated.
Generally speaking, a workflow is a precise description of a procedure – a multi-step process to coordinate multiple tasks and their data dependencies. In computational workflows each task represents the execution of a computational process, such as: running a code, the invocation of a service, the calling of a command line tool, access to a database, submission of a job to a compute cloud, or the execution of data processing script or workflow. Figure 1 gives an example of a real workflow for variant detection in genomics, represented using the Common Workflow Language① open standard .
Computational workflows promise support for automation that scale across computational infrastructures and large datasets while shielding users from underlying execution complexities such as inter-resource incompatibilities and software dependencies. From an execution perspective, workflows are a means to handle the work of accessing an ecosystem of software and platforms, managing data, securing access, and handling heterogeneities. From a reuse and reproducibility perspective, the automated methods can be packaged and ported across computational platforms, easing how we can create and execute workflows in different environments and across the diverse expertise levels of users. From a reporting perspective, they are a means to specify and document the experiment design and report the methodology: accurately recording the data inputs, parameter configurations and history of their runs and the provenance of their output data products . The provenance of a result (i.e., why and how a given result has been obtained by an analysis) enables the comprehension, comparison and verification of multiple results, and hence facilitates the exchange, standardisation, trust and reusability of those results.
The rise in the use of workflows has been accompanied by a range of diverse systems by which they can be implemented. At one end of the spectrum reside ad-hoc scripts (shell code, Python, Java, etc.) and interactive notebooks which provide an intuitive interface to quickly interact with the analysis results (e.g., Jupyter②, RStudio③, Zeppelin④). At the other end are Workflow Management Systems (WfMS) that provide a feature-rich infrastructure for the definition, set-up, execution and monitoring of a workflow. Some WfMS are aimed at general applications (e.g. KNIME⑧ ) whilst others have been adopted by specific communities with specialised features and component collections (e.g. Nipype⑨ for neuro-bioimaging ). Different aspects of FAIR principles will apply across this range of implementation choices.
WfMS can roughly be divided into two types, namely coarse-grained, with a prime focus on chaining locally hosted or distributed tools (e.g. Galaxy⑩ , KNIME , Taverna⑪ ) and fine-grained focusing on optimising computational resources over Distributed Computing Infrastructures (DCI) or High Performance Computing (HPC) for applications (e.g. Pegasus⑫ , SnakeMake⑬ , Nextflow⑭ , Dispel4Py⑮ ) and cloud-based container orchestration (Kubernetes⑯). Many WfMS mix the two kinds . All WfMS aim to handle common cross-cutting concerns on behalf of the workflow execution. Concerns include: resource scalability (optimisation, concurrency and parallelisation), secure execution (of tools in their environment, monitoring and fault handling), tracking (process logging and data provenance tracking) and data handling (secure access, movement, reference management). WfMS vary in how their users interact with them, for example by ABIs/APIs, scripting or command lines, or a GUI using “drag, drop and linking”. WfMS may execute over HPC or geographically distributed clusters, cloud environments across systems, or even from desktops. They consequently vary in their mechanisms to prepare their components to become executable steps and must manage portability and dependencies on the infrastructure used to run them.
Computational workflows are composed of modular building blocks that have been prepared with standardised interfaces to be linked together and run by a computational engine. Thus, the key characteristic is the separation of the workflow specification from its execution. Capturing the control flow order between components explicitly exposes the dataflow and data dependencies between the inputs and outputs of the processing steps. This explicit separation is fundamental to supporting workflow comprehension, design modularity, workflow comparisons and alternative execution strategies. Scripting environments tend to interleave data and computational processes, although systems such as YesWorkflow  provide the means to annotate existing scripts with special comments that reveal their hidden computational modules and dataflows. Interactive notebooks make the distinction when organized appropriately by defining their dataflow in the form of interactive computational cells; i.e., input and output variables are explicit in each cell, data dependencies are explicit on each cell, and the steps are executed in order. Notebooks can also be used as “meta workflows” when steps run a script or a WfMS.
We propose that FAIR Principles apply to workflows, and WfMSs, in two major areas:
• FAIR data: Properly designed workflows contribute to FAIR data principles, since they provide the metadata and provenance necessary to describe their data products and they describe the involved data in a formalized, completely traceable way.
• FAIR criteria for workflows as digital objects: Workflows are research products in their own right, encapsulating methodological know-how that is to be found and published, accessed and cited, exchanged and combined with others, and reused as well as adapted.
These two aspects are explored in the rest of the article. References to FAIR principles  are given in brackets.
2. FAIR DATA FOR AND FROM WORKFLOWS
Computational workflows are enablers of automated data processing. For automation to be most effective the data the workflows act on should be FAIR: unique resolution of identifiers, explicit data organisations, structures and semantics, machine-readable licenses and access permissions, high data quality and so on. FAIR data would enable a WfMS to automatically make informed choices from the phase of the workflow design (e.g., by suggesting tools fitting data features when several alternative tools could be considered for a given data analysis step) to the phase of workflow execution (e.g., by validating the data against a step's expected type). A WfMS needs to be able to access precise information on data origin, the way of accessing it, and a set of associated metadata, which is ideally described by established vocabularies and computer-interpretable semantics.
Research domains such as the life sciences have developed open ontologies, vocabularies and services for data interoperability (I1, I2, I3) and identifier resolution (F1, A1). Efforts such as the Breeding API (BrAPI)⑰ standardise the interface for exchanging data between applications and the EDAM ontology⑱  precisely specify the input and output of tools executed in a workflow (see FAIRsharing.org  for examples). Formalised and fine grained annotation of data is still considered costly to produce. Consequently, a significant amount of workflow processing still deals with metadata wrangling, format transformations and identifier mapping .
Workflows are in turn key contributors to FAIR data compliance. In a world of expanding and diversifying processing tools and computational operating environments, they encode standardised data practices and capture formal computer-interpretable provenance data. The workflow specification itself can be thought of as a recipe to produce a data product that exposes the extent of the effort made to make the data FAIR and it is formal enough to be validated against emerging FAIR indicators . Determining whether the data produced by a workflow is FAIR is not straightforward and requires concrete criteria, which should be provided by both the FAIR indicators and the workflow specification.
The combination of FAIR data and FAIR tools within a supportive FAIR e-infrastructure would significantly aid the operation and quality assurance of workflows. Machine processable metadata describe the inputs, outputs and performance of tools and the underlying resources for running workflows and managing results. Examples include annotations on tools and libraries (e.g. Bio.tools⑲, Bioconductor, CRAN, PyPI) and on software containers (e.g. Biocontainers⑳). Standardised specifications on handling data formats and executables, automated handling of tool dependencies, versioning and other explicit metadata on computational resource-needs would aid harmonisation of successive software tools execution and efficient job scheduling and data movement throughout different FAIR e-infrastructures.
For data generation a standardised workflow specification and automated execution contributes to transparency, reproducibility, analytic validity, quality assurance and the attribution and comparison of results. Well-designed workflow management systems can automate the production of metadata descriptions of data products (F2, I2, I3, R1.3) and the deposition of data in searchable resources (F4). Identifiers, licensing and access present interesting challenges in workflow execution:
• Identifiers (F1, F3, A1): concerns include the propagation of identifiers through the workflow, tracking data attribution  and the minting of identifiers for large numbers of intermediate results. Minids  are proposed as light-weight identifiers to unambiguous name, identify and reference research data products. Identifiers can then optimise data exchange by reference and reduce unnecessary or insecure data movements. Workflows need to move data references through their engines and not the data itself.
• Licensing (R1.1): As workflows often access and combine data from different sources, data licenses need to be respected, honoured and propagated, as do licenses on the software used by the workflow tasks. Combining licenses is particularly tricky and can impact on the ability to license the workflow itself or its data products, often neglected or delayed because of this challenge.
• Data access (A1.1, A1.2): Single sign-on for workflow constituents requires harmonised Authentication and Authorization Infrastructure (AAI) propagation through the different tasks, which may be hosted by different service providers using different systems.
Workflows intrinsically provide precise documentation of how the data has been generated (R1.2). The details of every executed process together with comprehensive information about the execution environment used to derive a specific data product is retrospective provenance, either observed by the WfMS or disclosed to it by the computational task itself. A great deal of work has focused on provenance tracking of computational workflows , leading to standardisation efforts such as the W3C PROV model and ontology  (I1, I2). Challenges remain: provenance standards have yet to be fully embraced by WfMS, there are still shortages of provenance processing tools. Automated provenance collection can be too fine-grained and too detailed to be of service to researchers , indicating a need for ontological abstractions. So, the computational steps are themselves unFAIR. Although open source tools allow us to inspect procedures, many codes are proprietary software or opaque boxes (especially those only available as run-time binaries or representing deep learning algorithms) that do not disclose the link between their inputs and outputs, breaking the provenance lineage of data. Steps in coarse-grained workflows are often wrapped applications with buried sub-workflows and manual (not tracked) steps within. Data resources and tools do not always report basic metadata such as their version or licence in a standardised, machine interpretable way. Bioschemas㉑ aims to get such metadata marked-up in resources in a lightweight way. A greater problem is unFAIR service provision, whereby the components change their interfaces without notice, breaking the workflows that use them. Given their data focus, the FAIR principles are chiefly focused on the availability of metadata rather than the quality of service of the databases, tools and the e-infrastructures in which data can exist.
3. FAIR CRITERIA FOR WORKFLOWS AS DIGITAL OBJECTS
The initial FAIR criteria have been envisioned for data. As workflows are digital objects in their own right, it is natural to draw an analogy with data and to try to apply the FAIR Principles to them. The majority of workflows are not yet registered in specialised repositories or are stored in software repositories indistinguishable from other software. Conventions for naming workflows still have to be devised (F1). Workflows vary in the quality of their documentation, as happens with software, described using proprietary or native programming languages.
Researchers have been actively exploring ways for workflows to be FAIR. Workflow registries and repositories typically cater for specific WfMSs, such as KNIMEHub㉒ for KNIME and nf-core㉓ for Nextflow Life Science pipelines, to support findability and accessibility (F4), with description and metadata associated with deposited workflows (F2) and in some cases persistent, unique identifiers (F1). Access is typically baked into the workflow systems (A1), for example only Galaxy workflows are available in dedicated Galaxy installations such as Workflow4Metabolomics㉔. Others such as WINGS㉕ provide means to export workflows augmented with semantic representations as Linked Data . Workflow findability in repositories has been studied  alongside workflow similarity  where workflows are compared based on their metadata and structure. For workflows to be accessible in the same way as data, they need to be archived and cited just as data is archived and cited using citation metadata . In the schema.org mark-up used by citation infrastructures such as Datacite, terms indicate data that is derived from other data. Ideally there should also be terms indicating the software or service used to perform a transformation, harmonised with workflow provenance.
myExperiment㉖  is an attempt at a WfMS agnostic repository, pioneering approaches for workflow finding, sharing and publishing with licenses. It credits authors when workflow designs were reused or repurposed, and packages workflows into collections and with other digital objects such as associated data files and publications. The work laid the foundations for workflow based Research Objects㉗  that allows for bundling of all the artefacts associated with an investigation or piece of research into one whole that can also be cited. Figure 1 workflow's description files and links to executable containers and data files can be downloaded in a Research Object zip-based bundle along with citation metadata and assigned a DOI on deposit. The European Open Science Cloud for Life Sciences㉘ has started work to build a workflow registry (F4) using the CWL standards with Research Objects federated (I3) with registries for tools (bio. tools) and containers (Biocontainers).
Several attempts have been made to standardise workflow descriptions in order to aid discoverability (F1) and enable interoperability (I1). The Interoperable Workflow Intermediate Representation  was proposed as a common bridge for translating fine-grain workflows between different languages, independent of the underlying distributed computing infrastructure . The Workflow Description Language㉙ and the Common Workflow Language  are recent community efforts to describe workflows. The CWL open standards are used to describe workflows and command-line tool interfaces in a way that makes them portable and scalable across a variety of software and hardware environments and runnable by other CWL-compliant engines. This last point is critical. As descriptions of processes workflows inherit properties of FAIR data, but as executable processes they inherit properties of software. Workflows as processes challenge the FAIR principles by their structure, forms, versioning, executability, and reuse.
Structure. Workflows are often inherently composite. Their components can be workflows in a nested, fractal way, i.e., (interdependent or sets of) sub-workflows that can be executed as part of complex workflows (see Fig.1). The distinction between a workflow and its component steps is blurred . FAIR principles can thus be applied simultaneously on multiple levels. To render composite workflows and sub-workflows findable relies also on the findability of the involved tools and data types as researchers often use these as search attributes. FAIR properties on the components – metadata, licensing, author credit, access authorization and so on – propagate to the workflow level and may be incompatible. Fundamentally, how we identify, cite and credit composite, multi-authored objects is an open question .
Forms. When we speak of a FAIR workflow what do we mean? A workflow can be a CWL specification with test or exemplar data; an implementation of that design in a WfMS; an instantiation of that implementation ready to be run with input data and parameters set and computational services spun up; a run result with intermediate and final data products and provenance logs . Workflow-centric Research Objects attempt to create a metadata framework to capture and aggregate each form, but each may have different FAIR criteria.
Versioning. Like software, workflows are living artefacts to be maintained, updated, and eventually deprecated. The code components, the WfMS itself and the underlying computational infrastructures they run on evolve and change. The evolution of a workflow is a form of provenance (R1.2) that tracks any alteration of an existing workflow, resulting in another version that may produce the same or different results . Moreover, to make methodological variants, workflows can be recycled and repurposed: cloned, forked, merged and dramatically changed. Workflow repositories such as nf-core㉓ embrace this software nature, building on top of collaborative development environments such as github that natively support versioning as well as testing and validation. Thus, FAIR principles have to address versioning and “fixivity”; that is, the need to snapshot a workflow and its dependencies to fix its reproducible state and associate a persistent identifier.
Executability. Workflows are executable objects. To be interoperable and reusable they need to be portable, encapsulating all their runtime dependencies. Lightweight container-based virtualisation solutions to distribute software (e.g. Docker㉚, Singularity㉛) and platform independent software packaging and distribution (e.g. Conda㉜) revolutionise workflow reusability. Nevertheless, workflows and the software tools they use are time limited objects whose active lifespan is dependent on that of their components and WfMS as much as on their scientific relevance. Consequently CWL addresses both explicit support for containerised execution and the lifting of workflow descriptions from the WfMS or application it is embedded in so that it may be runnable in other CWL-compliant engines, even when no longer executable in its native form. In workflow e-infrastructures (e.g. in local workstations or cloud environments), resource limitations need to be defined by the workflow. The stability of the execution environment needs to be covered by the infrastructure and its components. Security and access control issues are crucial as workflow systems often execute codes in shared distributed resources.
Reuse. Reusing workflows turns out to mean different things depending on the purpose of the reuse. Workflow redo/re-run re-executes the exact same workflow, data, parameter settings and tools with the aim to re-create identical results for testing the robustness of the process. Workflow replication allows minor changes, usually in the workflow environment and/or parameter settings, but the results are expected to be the same. Workflow reproduction aims to retain the same analysis but with variations to the means (steps) or data to test the robustness of the results. Workflow reuse may use all or part of the original workflow with new data, with a possible different aim in workflow repurposing/recycling [33,44]. Regardless of intent, the workflow user must be confident that the expected results are generated. R1 is fostered by robust software practices [45,46, 47] that entail:
• Proper testing of the computational workflow and its modules as well as the software tools that are invoked during workflow runtime;
• Validation of interoperability claims that tests workflow replication on different platforms;
• Validation of parameters to preclude workflow failure and faulty or unsafe results. The formulation of parameters must therefore be FAIR and must include documentation and explanation of their purpose and range definitions (testing of parameter ranges). The BioCompute Object specification  emphasises detailed representation and validation of parameters for regulatory approval of reusable computational pipelines for precision medicine.
These thoughts lead to two conclusions. First that treating FAIR workflows as data artefacts only goes so far. Their characteristics as software artefacts means that they should also be subject to appropriate FAIR principles for software, incorporating best practices for software maintainability, maturity and computation reproducibility . Second, that the individual parts, forms, versions and execution environments of a workflow need to be FAIR by themselves, leading to complex interdependencies which need to be covered by additional FAIR metrics aligned to their nature. Generally a compromise has to be found between the amount of work users have to invest in annotating their workflow and the amount of metadata needed for verification by FAIR indicators. Automatic annotation strategies and annotation tool support will be crucial to ease the burden of workflow developers.
Computational workflows capture complex multi-step methods that require FAIR practices in order to be properly published, findable, accessed, cited, reused, and combined with others. FAIR principles for data, and for software, are generally applicable, but need to be extended in order to address the processual nature of workflows. Consequently new FAIR indicators will also need to be developed. A framework for FAIR workflows will enhance reproducibility, quality and transparency of the data generated, but also of the processing path that lead to the data results. When used to document the provenance of new data products, workflows become a powerful component for FAIR data practices and provide new capabilities such as better findability, ideally based on the intrinsic methods used to generate data. The FAIRification of workflows along all these lines will pave the way for trustable data with the added value of being ready for secondary data reuse and exploitation by third parties.
C. Goble (email@example.com, corresponding author) and D. Schober (Daniel.Schober@ipb-halle.de) initiated the effort and conceived the paper. C. Goble co-ordinated and led the writing, and edited the manuscript. C. Goble, S. Cohen-Boulakia (firstname.lastname@example.org), S. Soiland-Reyes (email@example.com), D. Garijo (firstname.lastname@example.org), Y. Gil (email@example.com), M.R. Crusoe (firstname.lastname@example.org), K. Peters (Kristian.Peters@ipb-halle.de), D. Schober all contributed to the concepts, arguments, and written text. All reviewed the text. D. Schober and M.R. Crusoe contributed examples.
Carole Goble acknowledges funding by BioExcel2 (H2020 823830), IBISBA1.0 (H2020 730976) and EOSCLife (H2020 824087). Daniel Schober's work was financed by Phenomenal (H2020 654241) at the initiation-phase of this effort, current work in kind contribution. Kristian Peters is funded by the German Network for Bioinformatics Infrastructure (de.NBI) and acknowledges BMBF funding under grant number 031L0107. Stian Soiland-Reyes is funded by BioExcel2 (H2020 823830). Daniel Garijo, Yolanda Gil, gratefully acknowledge support from DARPA award W911NF-18-1-0027, NIH award 1R01AG059874-01, and NSF award ICER-1740683.