Realising Data-Centric Scientific Workflows with Provenance-Capturing on Data Lakes

Abstract Since their introduction by James Dixon in 2010, data lakes get more and more attention, driven by the promise of high reusability of the stored data due to the schema-on-read semantics. Building on this idea, several additional requirements were discussed in literature to improve the general usability of the concept, like a central metadata catalog including all provenance information, an overarching data governance, or the integration with (high-performance) processing capabilities. Although the necessity for a logical and a physical organisation of data lakes in order to meet those requirements is widely recognized, no concrete guidelines are yet provided. The most common architecture implementing this conceptual organisation is the zone architecture, where data is assigned to a certain zone depending on the degree of processing. This paper discusses how FAIR Digital Objects can be used in a novel approach to organize a data lake based on data types instead of zones, how they can be used to abstract the physical implementation, and how they empower generic and portable processing capabilities based on a provenance-based approach.


INTRODUCTION
Data pipelines are widely used (i) in order to collect data from heterogeneous sources, (ii) to perform a series of transformations on them, and (iii) to ingest the transformed data into a destination system for data analytics. Such a data pipeline can also be referred to as an ETL (extract, transform, load) process, which is often used to ingest clean and transformed data into a data warehouse [1]. Data pipelines are still Although no commonly accepted concept for a data lake exists in literature [5], an agreed upon outline, as it was described by Dixon, requires a scalable storage system for heterogeneous data where scientists can explore and analyze these data sets. These requirements go hand in hand with the need for low-cost technologies and at first led to a strong association of data lake implementations with Apache Hadoop [6]. This was then superseded by proprietary cloud solutions based on Azure or AWS [4], which introduced the advantage of separating storage and compute resources into the concept of data lakes.
Although a data lake implements a schema-on-read semantic, some modeling is mandatory to ensure proper data integration, comprehensibility, and quality [7]. Such data models, created by extracting descriptive metadata from the ingested raw data, are then stored in a central data catalog. This data catalog does not enforce a fixed schema and therefore allows for frequent changes [8]. Although many studies primarily focus on this data catalog and the gradual improvement of the provisioned metadata [9,10], such as semantic information about a data set [11], provenance data is not yet extensively covered by current approaches. However, since the raw data in a data lake is very likely subjected to many consecutive transformations, resulting in several artefacts, which will be ingested back into the data lake, maintaining concise provenance information is very challenging but crucial to retain the manageability of a data lake [12]. Processed data, which is just being stored back into the data lake, will be hard to find afterwards and probably impossible to comprehend and to reproduce, potentially rendering it useless.
In this paper we discuss (i) how the strong association between a single data entity and its associated metadata can be expressed within a FAIR Digital Object [13] in particular within the context of the data lake, (ii) how the usage of typed digital objects can conceptually organize a data lake architecture, (iii) how packaging completely generic analysis tasks within Fair Digital Objects can help to monitor fain grained provenance information, and (iv) how exploitation of these concepts in the context of Canonical Workflow Frameworks for Research (CWFR) can help to increase cross discipline collaboration. https://osf.io/9e3vc/

REL ATED WORK
Various solutions tailored for specific purposes have been proposed for provenance capturing in data lakes. Goods [14] analyses log files in a post-hoc manner to determine which jobs created data sets based on which input, an approach which requires that the application writes suitable log files. Similarly, Komadu was integrated into a data lake [12] to support the messaging of provenance information via RabbitMQ, also relying on the explicit support of the application. DataHub [15] was equipped with ProvDB [16] for provenance auditing. ProvDB is based on the analysis of shell scripts, user annotations, post-hoc log analysis from frameworks like Caffe and table-based input files for SQL-based provenance capture. This approach, however, is error-prone and the monitoring of system calls introduces additional run-time overhead.
An common data lake architecture is the so-called zone architecture [5,17]. The general idea is to divide the data lake into different zones and store data depending on the amount of processing it was subjected to. In this way, raw data is stored in a raw zone whereas (pre-)processed data are stored in their own dedicated zones. Although there has recently been great progress in the definition of a standardized zone reference model [18], this architecture lacks a homogeneous and standardized interface to integrate processes and users, a central data catalog to maximize re-usability, and a reliable provenance auditing by design.
It is by now common sense that scientific publications are be complemented with an organized aggregation of all related digitized entities, in order to promote comprehensibility of the presented work and further reusability of the data and methods. Research Objects [19] have been proposed as an structured aggregation, which semantically links its entities. It also defines a standard regarding repetition, traceability, and the essential metadata. Different tools, like Sciunits [20], have been developed to automate the process of building a research object from a workflow enactment. Here, application virtualization, based on the monitoring of system calls, is used to build one ready-to-use container including all software, its dependencies and the used data.
All of these above mentioned developments tackle independently of each other different problems in the area of collaborative data analytics. However, the full potential can only be realized if these individual steps are set into relation to each other and are provided to the users as an integrated solution.

IMP LEMENTATION
In this section, we present a data lake implementation that is particularly tailored to be used in data and compute intensive research areas. We put a special focus on the traceability and reproducibility of the different processing steps, like data ingestion or analysis operations. Since the presented data lake implementation is a generic framework aiming at supporting various use cases, we separate the implementation of the aforementioned processing steps from the data lake implementation and thus make it configurable and extendable for individual use cases. In addition, the data lake includes a central role-based user access management, which supports scalability and large teams.

Arc hitecture Overview
The core our data lake implementation is the central web application, which i.a. provides the REST-API of the data lake. It also implements the orchestration functions, so that a consistent state of the data lake and the essential minimum of (meta)data quality is being enforced. Here, the actual mapping between the logical and the physical organization of the data lakes happens.
A schematic view is provided in Figure 1. Here, the web application as the single point of contact for data sources to ingest data, but also for users to interact with the data lake is shown. That means, that all services which are needed to realize the complete data lake are not directly accessed by the users, but those are always situated behind the web application. Therefore these back-end services can be easily swapped or added without requiring any adoptions by the users for existing workflows and pipelines. There is, however, one exception: The direct interaction of users with version control tools, like GitLab, is tightly integrated into our concept of a data lake in order to support widely used development workflows. Another advantages of this data lake implementation, in addition to scalability, consistency, and maintainability, are on the one hand the fine grained control and on the other hand the high-level abstractions of the underlying processes, which empowers non-experts to use the data lake.

Inge stion
It is widely accepted that a data lake, as the central data repository of an institution, should accept any data, i.e. any type and format, from any source [4]. Therefore, the ingestion layer of a data lake has to be highly flexible. This has in particular implications on two parts. First, the data transmission should use common protocols like http. Here, it can be even favorable if a data lake is not only a passive data sink, but can actively pull data from other systems in order to stay up-to-date. This can also reduce maintenance

Realising Data-Centric Scientifi c Workfl ows with Provenance-Capturing on Data Lakes
overhead, since a lot of different and possibly heterogeneous data pipelines can be managed from a single system and interface. Second, the storage and metadata systems need to be able to cope, or at least easily adopt to, any kind of input data in order to be able to automatically extract descriptive metadata to update the central data catalog. In order to be able to upload a new file containing raw data, the user first has to configure a new metadata type, containing the key-value pairs with their respective mappings. Then, another API of the data lake is used to upload containers, which contain a metadata extraction tool, which can be e.g. a simple Python script. The kind of metadata that should be extracted at this point from the ingested raw data can be categorized into the pre-visualization and summary and semantic metadata [10], which primarily contain informative and descriptive information.

Data Analytics
After the raw data has been ingested into the data lake, the next step is to transform and analyse these data sets. To achieve this, a job manifest has to be written, which unambiguously describes the job that has to be performed. This job manifest is then being interpreted by a dedicated adapter and the defined computation is performed. Currently, two different adapters have been implemented to execute a job either by using local resources or by remotely connecting to a HPC system using HPCSerA [21]. New adapters can easily be defined and thus new compute environments can be added, like cloud-based function-as-aservice (FaaS). Due to the compute intensity of most analytics tasks, outsourcing to an HPC-System is preferred over local computation. In order to execute an job on an HPC system, first some pre-processing is needed to setup the environment, exactly as specified in the job manifest. Then, the run script is submitted to the HPC batch system, where the actual computation is being performed. The entire environment with all its dependencies is completely virtualised within a container. In order to guarantee reproducibility, container images are uploaded into the data lake beforehand, and will be retained there as long as there are processed data linking to that particular image. These images are represented as FAIR Digital Objects, like the job manifests and all kinds of data and all other entities. In addition, only input data that has been specified in the job manifest are available to the analytics job. Since all parts of the computations are therefore completely encapsulated, there is on one hand a high portability and on the other hand no need for run time provenance auditing, since all possible degrees of freedom are either specified in the job manifest or are determined during the pre-processing (like the executed git commits).
After the computation some post-processing is performed, where data artifacts, which result from the analysis, can be ingested back into the data lake. There are two different options from which the user can choose: (i) data are ingested simply as an processed data type, thus only having descriptive metadata in form of their retrospective provenance information, or (ii) the manual ingestion process as described in Section 3.2 is executed again. In the latter case descriptive metadata, in addition to automatically captured and attached retrospective provenance information, can be extracted to allow for queries based on the content and not only on the data lineage.

Cons tantly Evolving Metadata
In order to achieve a consistent metadata quality despite a constantly evolving metadata schema, a rather simple modeling approach is being taken. First, four top level data types are defined, raw data, processed data, container and manifest, from which other data types can be derived. Users can register derived data types by defining a new schema which consists of the associated attributes and an assigned name for the new data type. Through this, only known data types with known and typed attributes are being indexed and ingested into the data lake. Based on this environment of well controlled attributes one can start to maintain an data lake wide ontology, defining categories, the associated properties and relations between those data types and their attributes.
By versioning the data types, continuous change is enabled and, at the same time, the necessary provenance information is maintained to keep full control of the state of the data lake. Performing data analytics on a raw data type and re-ingesting a processed data type automatically causes the data lake to build up a data provenance-centered graph model. Here, the derived data entities, i.e. the aggregated data and metadata, are connected to their individual input data by the job manifest.

EXAM PLE WORKFLOW
To apply our generic data lake implementation to a particular, project-specific use case, users first need to perform several configuration steps (see also Section 3.2). First of all, some data modelling is required resulting in a JSON document specifying a metadata template of a FAIR Digital Object. This template contains the typed attributes, which a raw or processed data type can have, the type of the FAIR Digital Object itself, and some information to derive a Digital Object Identifier. Afterwards, a metadata extraction container needs to be uploaded to the data lake. A subsequent configuration is necessary to configure the execution command and related options. At this point everything is prepared to upload data as the previously defined FAIR Digital Object. As a result of the upload, the metadata extraction container automatically extracts the metadata, which itself is added to the data lake in JSON format and checked for consistency. Finally, the metadata is indexed in a NoSQL data base, like ElasticSearch, and the actual date, i.e. the actual bit sequence, can be saved in a scalable storage like an S3-based object store. Now, analyses can be performed on this data set, in this case on an HPC system. Analog to the metadata extraction container, an analysis container image is uploaded to the data lake. Assuming that the necessary adapter, like HPCSerA, is configured for the particular user, a job manifest can be send to the data lake to manually trigger the execution of an analysis job (see also Section 3.3). The respective container image is then copied to the HPC system along with the input data. If the container does not ship with the necessary software, a git repository can be dynamically cloned and build. All necessary information like the used commit or build commands are captured. After processing, specified artefacts are automatically ingested into the data lake. Here, based on a user provided configuration, the data lake will either perform the full ingestion process to create a Digital Object of a specific type from these artefacts, or they will be be ingested as a generic Processed Data-Type. In the latter case, no descriptive metadata except the provenance information will be available.

Realising Data-Centric Scientifi c Workfl ows with Provenance-Capturing on Data Lakes
Currently, all provenance information is written as JSON and indexed in ElasticSearch, but a refined model compliant with the W3C PROV specification using a graph database is under development.

TOWA RDS A GENERIC DATA ANALYTICS PIPELINE
The current implementation of the data lake exclusively covers the data processing part and therefore fulfils the requirements of the primary use case. More complex research workflows, which may e.g. include experiments, can be mapped onto the data lake by adding elaborate descriptions of experimental protocols to the metadata of the measured data or by uploading the experimental protocol itself as raw data into the data lake and linking it to the corresponding data sets. In order to support collaborations there is the need to share developed analysis tools. A convenient way to do so is by providing containers. Here, we go one step further and allow for a precise configuration which enables machine actionability on the available data. Since scientific workflows usually consist of a sequence of steps, there is a need to link single tasks to a workflow.

Publ ishing Analysis Tasks
The general analysis process as described in Section 3.3 is highly flexible and supports the rapid development of the involved algorithms and software. Once such a software has reached maturity, the developers usually publish it. This process can be completely achieved within the data lake. Here, similar to a job manifest, a slightly changed analysis service needs to be defined. This service can then be used by other data lake users by uploading suitable data into the data lake and starting this analysis service on it. For this process the access policies of the uploaded data can be completely private. The analysis service is accessed by its own unique resource identifier which can be resolved by the data lake. Although using such published analysis services will be fairly easy for other users, since most of the configuration has been done by the actual developers, such an analysis service should be able to provide, upon request, a minimum set of documentation of the available options. Here, particularly the specification of suitable input data is of out most importance.

Linking Tasks to Workflows
In order to obtain an automated workflow these single tasks need to be chained together. Since the essential element of a data lake is the data itself, it seems reasonable to focus on data centric workflows which are often represented by a directed acyclic graphs. Having all the necessary task definitions in place, i.e. the generalized job manifests as described in Section 5.1, the independent tasks simply need to be linked by their respective input and output data. This can be done in an workflow manifest, where tasks should be either generalized job manifests or published analysis tasks. Here, the data lake functions again as an additional abstraction layer between the workflow definition and the actual workflow implementation. https://www.w3.org/TR/prov-overview/

Realising Data-Centric Scientifi c Workfl ows with Provenance-Capturing on Data Lakes
This requires adapters to map the workflow manifest onto suitable workflow engines, like CWL [22] or Apache Airflow. These mappings also support portability outside of the context of the data lake to re-run and verify a workflow locally. However, the intermediary workflow manifest concepts makes the data lake highly adaptable as it enables tailor-made solutions for individual projects while still enforcing well-defined and homogeneous provenance information and artefact handling.

DATA ANALYTICS PIPELINE USING FAIR DIGITAL OBJECTS AND CWFR
Expressing all entities in a data lake as FAIR Digital Objects (DO) is a novel approach, which provides a practical guideline to become compliant with the general requirement that a data lake needs to have a logical and a physical organization [4]. This is realised through direct user interaction with the different FAIR Digital Objects by calling functions that are specifically defined for the corresponding type. Furthermore, all FAIR DOs have their own governance, which can be integrated into a data lake-wide homogeneous governance layer to realise high scalability across different back-end services.

Example Interaction with FAIR Digital Objects
As previously stated, the fundamental building block of our data lake architecture are FAIR Digital Objects. Here, users do not work directly on the services, but call functions which are part of the specific Digital Object instance. This process is discussed by an example of working with an metadata extraction container.
First of all, a user registers a Digital Object Identifier that resolves to the metadata extraction container. On this level, it is an encapsulation that can contain multiple images and configurations. This step also creates an corresponding object representation in the NoSQL database, where information about every Digital Object is kept. Then, a specific container image is uploaded into this Digital Object, using a predefined upload function of that instance. Although this image is itself a Digital Object again, we will not discuss it as it is mostly immutable. The previous function call has stored the container image in our backend-storage and has added the particular image to the record of the metadata extraction container Digital Object. This record is part of the metadata of this Digital Object and is currently implemented as a SQL table. Part of this record are also image-specific configurations, like an execution command, or an optional bind mount. Since the exact configuration is part of the provenance information of the ingested data, a function to update the configuration can be exposed without loosing the ability to reproduce the results. An update image function is able to add a new image to the Digital Object representing a metadata extraction container with a new configuration. If this function is used, the Digital Object will contain more then one image, with the different configurations logged in the record. A delete function, defined on both the wrapping metadata extraction Digital Object and the image Digital Object can only be executed, if this particular Digital Object Identifier is not present in any provenance information, to ensure reproducibility.
In this example, users only interact with the predefined functions of the Digital Object trough the REST API exposed by the web application. There is no need that users interact directly with the underlying database or storage systems. This ensures a consistent state across the entire data lake.

Providing Canonical Steps
As described in Section 5.1, data scientists and solution developers can provide their analysis and metadata extraction tools to other users of the data lake. To achieve this, a more generic service manifest is required, which primarily deviates from a job manifest that input data is implicitly selected by specifying the data types and metadata attributes of the input data to restrict the possible input data to a suitable subset. In a job manifest input data is selected explicitly by lists or queries. Depending of the configured specification of those typed metadata attributes and their exact value, fine grained conditions can implemented to control the execution of such a service. Since this process is relying on typed attributes, it can be automated and interoperability between the data and the analysis tasks is achieved. Furthermore, those services can be encapsulated in an FAIR DO and functions can be associated with them. Since data scientists can develop and share those pre-configured analysis and metadata extraction tasks, which are the canonical steps of a data lake workflow, we foster re-usability, sharing, and collaborations.

Encapsulation and Aggregations of Data and Workflows
Upon the ingestion of raw data into the data lake, data is associated with a certain data type and as such encapsulated with the associated metadata. The actual data, and as such their associated bit sequence, which is physically stored on any suitable storage system, and the extracted metadata which can be split into different logical sub-units and be indexed individually in different database systems, are made resolvable by a single unique resource identifier. The information about the physical organization can be completely abstracted from the user by only allowing interactions on the data entities by specific services, which promotes the objectification of data. Important in the context of a data lake is that the association of a data type to the raw data must not lead to a fixed schema but integrates smoothly with the required data catalog. Further aspects, like the access control lists, which might be defined on an DO, also need to be integrated in the overarching governance layer of a data lake. Based on these implicitly build data objects a data lake wide ontology enables an automated classification of this new data into the context of the data lake. Further analysis tasks can add more metadata and context to the data entity.
The beginning of a data science project is often marked by an ad-hoc data exploration phase. Particularly here, users often neglect development and analysis workflows, which enable reproducibility, in favor for development time. Treating data and its associated metadata strictly as a unified object eases the way to integrate ad-hoc data science capabilities. Here, users can work interactively on their data in the data lake by using a library which imports the data objects from the data lake into their programming environment. The challenge is to have a provenance monitoring system transparently running in order to be able to store resulting artifacts with their respective provenance data in the data lake. One example to achieve this would be to transparently tap into the data collected by tools like ProvBook [23] and send the collected data along with the artifact itself when it is being saved. This artifact, along with its metadata, is then an object itself in the data lake. However, it is desirable to access the entire workflow as a whole. For this arbitrary aggregations of the different entities of the data lake are necessary. Analog to this example, the requirement of aggregations also persist in automated workflows where raw data should be bundled with a workflow manifest, the resulting artifacts, the used environment and the job manifests in order to promote reproducibility and reusability [21]. This can be done completely in metadata space since upon performing an analysis a provenance-centered graph model is automatically constructed. Here, a simple backtraversion towards the raw data of all nodes will guarantee to yield all involved entities. Aggregating them into a single wrapper entity represents the exact state of a workflow. In order to work more easily with those entities, the general taxonomy of the data types of the data lake can be used to offer the user an overview and access to the constituents in the desired granularity. Since every action is already abstracted, e.g. performing an analysis is defined in a job manifest, the concept of adapters can be reused to enable portability of those aggregations and re-execute tasks on different systems using different software stacks. In order to promote a high interoperability, the data lake should export those aggregations in well established formats, like research objects [24]. Here we envision standardized operations which can be performed on such aggregation, like inspecting the different entities or re-executing the workflow or singular steps.

Challenges
The analysis of the current data lake approach identified two challenges which resulted from our explicit implementation, but may also require a more formal consideration in general.
Automating workflows, using FAIR Digital Objects to achieve interoperability between the data and the analysis tasks, have the inherent risk of the difficulty to control data multiplication and the corresponding surge in compute demand. Assuming an potentially ill-configured analysis task, which takes in n different data types and outputs m different artifacts. If a new data source would be made available to the data lake resulting in a huge increase in available entities of the before mentioned n data types. In case this automatically triggers innumerable compute and storage intensive analysis tasks on the data lake, although the author of this analysis task never intended the use on such a big data set, the resources of the data lake, or the personal quotas, can be quickly reached, entailing an undesirable interruption of the service.
Currently wrappers, called manifests, are used to abstract the definition and description of a task or workflow from the underlying software stack. In order to use the underlying local software stack, adapters are needed which map the abstract instructions into the language of specific tools. This entails the necessity to develop and maintain a number of adapters. This problem can only partly be mitigated by using a suitable inheritance hierarchy. In addition, the different software stacks differ in functionality. Thus, it is not possible to support all state-of-the-art features of one technology and still offer the same amount of portability, as compared to a manifest using a stripped-down-version of the available functionalities.

CONCLUSION
In conclusion, we present a novel data lake architecture, which is based on typed digital objects. Our approach shows significant advantages as it allows to easily adapt the architecture by defining new data types and offers a uniform and research oriented user and process interface while keeping a certain technology independence. It is important to stress that no prior assumptions about the specific use case are

Realising Data-Centric Scientifi c Workfl ows with Provenance-Capturing on Data Lakes
made which will maximize re-usability of the ingested data, also supported by maintaining a central data catalog which contains all data independent of the degree of the processing. All FAIR Digital Objects are linked in a single provenance graph, from the raw data to the final product. The additional abstraction layers, i.e. the manifests and the corresponding adapters, make our data lake highly portable and provide homogeneous retrospective provenance auditing independent of the chosen workflow engine or computational environment. Furthermore, representing all entities of the data lake as objects supports the integration into application-or user-specific processes, starting with an ad-hoc data exploration phase towards fully automated workflows. Here, data analytics tasks can be packaged as services itself, facilitating exchange and collaboration across teams and scientific domains.