Scaling Notebooks as Re-configurable Cloud Workflows

Abstract Literate computing environments, such as the Jupyter (i.e., Jupyter Notebooks, JupyterLab, and JupyterHub), have been widely used in scientific studies; they allow users to interactively develop scientific code, test algorithms, and describe the scientific narratives of the experiments in an integrated document. To scale up scientific analyses, many implemented Jupyter environment architectures encapsulate the whole Jupyter notebooks as reproducible units and autoscale them on dedicated remote infrastructures (e.g., highperformance computing and cloud computing environments). The existing solutions are still limited in many ways, e.g., 1) the workflow (or pipeline) is implicit in a notebook, and some steps can be generically used by different code and executed in parallel, but because of the tight cell structure, all steps in the Jupyter notebook have to be executed sequentially and lack of the flexibility of reusing the core code fragments, and 2) there are performance bottlenecks that need to improve the parallelism and scalability when handling extensive input data and complex computation. In this work, we focus on how to manage the workflow in a notebook seamlessly. We 1) encapsulate the reusable cells as RESTful services and containerize them as portal components, 2) provide a composition tool for describing workflow logic of those reusable components, and 3) automate the execution on remote cloud infrastructure. Empirically, we validate the solution's usability via a use case from the Ecology and Earth Science domain, illustrating the processing of massive Light Detection and Ranging (LiDAR) data. The demonstration and analysis show that our method is feasible, but that it needs further improvement, especially on integrating distributed workflow scheduling, automatic deployment, and execution to develop as a mature approach.


INTRODUCTION
The study of many scientific problems, e.g., significant environmental challenges or cancer diagnosis, requires large data volumes, advanced modeling techniques, and distributed computing facilitates [1,2].The literate computing environments such as Jupyter notebook, JupyterLab, and JupyterHub have exploded in popularity and emerged as a de-facto standard across different engineering and science domains [4,5,6], e.g., ecology [10,11], biology [12], and medical research [13,14].They provide an excellent combination of explanatory text, software code, computational output, and multimedia resources in an executable interactive document, in which users input programming code or text in rectangular cells in the Jupyter notebook.
The narratives of the scientific experiment in a notebook can be described as scientific pipeline steps or workflow contained with executable code fragments.However, the workflow usually is implicit in a notebook without an explicit, structured workflow-oriented description.For instance, some steps can be represented as atomic tasks in conjunction with the dependencies regarding inputs/outputs.Still, because of the tight cell structure of the notebook, the workflow patterns cannot be well extracted, which further influences the reusability of some generic code fragments [24].Besides, from the perspective of input/output dependencies, there are still some performance bottlenecks in the workflow execution, e.g., some steps need to be parallelized or scaled out, especially when the inputs are a large volume of data.
Current approaches usually enable the computational notebook as the whole job to be scaled out on a dedicated remote infrastructure, e.g., high-performance computing (HPC) settings and cloud environments [7,15,16,17,18,19,20,21].Generally, to bridge the gap between exploratory scientific analysis and computing environments, many implemented Jupyter environment architectures are mainly coupled with a pre-configurable infrastructure (e.g., via IaaS Cloud) to dynamically deploy and manage notebook-based application instances.However, widely-used solutions ignored the workflow-oriented representation and management in the single notebook (i.e., workflow structure regarding internal dependencies and efficient parallelism for task's input sizes).As a result, managing such a workflow in the notebook becomes a major bottleneck to meet the demands of scaling scientific experiments from domain researchers.
In light of this, we are motivated by the research question: how to seamlessly manage workflow in a notebook?In this paper, we give the first answer to this question and propose the Notebook-as-a-Workflow (NaaW) method.Our contributions mainly include: 1. we propose and prototype the component containerizer, which can encapsulate the reusable code fragments (cells) as RESTful services and containerize them as science portal components.The rest of the paper is organized as follows.Section 2 states the problem with a use case from the ecology and earth science domain, including the challenges and requirements, a review of the existing work regarding the topic of scaling scientific applications on distributed computing infrastructures, and a summary of the limitations of existing solutions.Section 3 provides the prototype design of our method and implementation details of our prototypes; in Section 4, we demonstrate the prototype from an example of processing massive Light Detection and Ranging (LiDAR) data and discuss the limitations of current work, and, finally, Section 5 concludes this paper and points out future research directions.

Problem Statement
Consider the example from the ecology and earth science domain, where researchers aim to monitor the changes of ecosystem structure over time or derive metrics of vegetation structure to model animal distributions and habitat suitability.For this purpose, they process raw data (3D point clouds) from countrywide, airborne Light Detection and Ranging (LiDAR) dataset (~16TB), using the Laserfarm workflow [11] to generate LiDAR-derived metrics related to ecosystem height, ecosystem cover, and ecosystem structural complexity.Two examples of statistical metrics that can be derived from LiDAR point clouds are shown in Figure 1, i.e., the 95 th percentile of normalized height (as a measure of ecosystem height) and pulse penetration ratio (as a measure of ecosystem cover).The 95 th percentile of normalized height quantifies the vegetation height of tallest plants in a given grid cell (e.g., trees in a forest patch) whereas the pulse penetration ratio describes vegetation openness as the ratio between the number of ground points relative to the total number of points within a given grid cell.By measuring them in two time periods from two country-wide LiDAR datasets, researchers can derive measures of ecosystem structural change.It is achieved by extracting statistical properties of LiDAR point clouds that are provided by Airborne Laser Scanning (ALS) surveys, usually in LAS/LAZ format.
Challenges.Researchers design algorithms and prototypes using a Jupyter environment such as a Jupyter notebook and conduct the scientific experiment on the local experimental platform or a small cluster.Those studies rely on appropriate algorithms, but even more on large-scale data, e.g., handling multi-terabyte datasets.However, the local compute and storage capacities might be inefficient in conducting large-scale scientific analysis, e.g., it is burdened by large-scale data inputs and the need for efficient and scalable computing.Besides, the notebook's pipeline steps or component modules can be presented as a workflow, as shown in Figure 2. Some modules as generic algorithms can be reused by different code and parallelized and scaled out on dedicated remote infrastructures, e.g., Cloud environments.Nevertheless, managing such a workflow in the Jupyter notebook is still lacking.Issues of the (un)expected overheads or performance bottlenecks caused by data transfer between containerized components in the NaaW are outside the scope of this paper; however, we indeed have considered this problem.When processing the large volume data with approximately 16TB, we are using a splitter module and a merger module to cope with this issue at present.The splitter module can partition the large data volume into multiple more minor data predefined by the user to remove the large data volume transfer between containerized components.The merger module is responsible for merging these distributed processed results.As a result, it does not involve very frequent transmissions of enormous data volume between containerized components; on the contrary, it improves the parallelism of data processing.The default workfl ow in the Jupyter notebook using the example of the Laserchicken software [25].LiDAR datasets (e.g., in LAS/LAZ/PLY format) are loaded and a set of target points defi ne the environment point cloud (EPC) and the target point cloud (TPC), respectively.The LiDAR point clouds are normalized, neighbors are computed and features (i.e., statistical properties of the point cloud) are subsequently extracted over neighbors and appended to the associated target points.This forms the enriched target point cloud (eTPC), which can be exported in several formats.Optionally, the EPC can be fi ltered before further processing [10].

Re lated Work
Many implemented Jupyter environment architectures are mainly based on HPC settings and cloud environments [15,16,18,19,20,21].For instance, the work of Milligan et al. [15,16] implemented the classic architectures on which researchers integrated the Jupyter platform with supercomputing resources to bridge the gap between exploratory data analysis and HPC, especially leverage Jupyter for interactive data-intensive supercomputing services.Their solutions offer HPC notebook service, a science portal for the GEMS platform for argo-informatics data sharing and analysis, and BinderHub services.The HPC notebooks service implemented at MSI (Minnesota Supercomputing Institute) permits the user to seamlessly run the interactive Jupyter notebook web application using normal batch-scheduled clustered computing resources.The BinderHub is a component for automatically launching Jupyter-ready Docker containers built by JupyterHub with repo2docker within a Kubernetes cluster using public cloud resources.Likely, Zonca et al. [19] proposed three deployment strategies for scaling up scientific applications on XSEDE resources at different levels of scalability, i.e., JupyterHub on HPC via the batch scheduler, JupyterHub on XSEDE Jetstream with Docker Swarm mode and with Kubernetes.The differences between deploying Docker in Swarm mode and Kubernetes mainly reflect the former providing notebooks with persistent storage and quota while the latter provides a fault-tolerant JupyterHub deployment with elasticity.

Scaling Notebooks as Re-confi gurable Cloud Workfl ows
copying or editing the notebook each time) to simplify the execution processes.Although the work of Henderson et al. [18] is close to NaaW since it also has the potential for scaling up different data inputs in the notebook and running these steps in parallel across multiple HPC nodes, it differs heavily from NaaW.Because it is based on NERSC HPC resources instead of a cloud environment with on-demand resource provisioning, it also performs limitations of the flexibility to reuse critical components in a notebook as part of a distributed workflow.
Yin et al. [17] proposed the CyberGIS-Jupyter framework for scalable geospatial analytics.The alternative solution adopts Jupyter notebooks instead of web GIS as the front-end interface to provide a consistent and agile playground for both developers and users.It uses JupyterHub and Docker Swarm as the CyberGIS-Jupyter server.Encapsulating CyberGIS capabilities within a pre-configured and containerized environment achieves on-demand resource provisioning through Openstack to elastically deploy and manage multiple virtual machine (VM) instances of the applications.The hybrid computing environment called ROGER integrated HPC and cloud resources offers the underlying infrastructure support for the reproducible deployment.Nevertheless, their approach still has several limitations related to the scientific workflow representation.For instance, their work did not emphasize pipeline or workflow.Also, the notebook performed like the whole job instead of a workflow-oriented application to be packaged out and submitted onto the CyberGIS-Jupyter cloud environment.
Most of the existing solutions focus on the execution of the notebook on a pre-configured infrastructure, e.g., HPC cluster or virtual machines provisioned in the cloud.In those solutions, there is no explicit workflow management in the notebook; users control the execution of the experiment steps via the cells in the notebook.The workflow logic and data flows are implicitly described in the order of the cells in the notebook, which may hamper the reconfiguration of the notebook at the cell level for different purposes.We aim to overcome those limits by providing extra Jupyter extensions to extract the notebook cells as reusable components, to compose new logic, and to automate the execution on remote infrastructure.Those extensions cover the key steps of workflow management.

Re quirements
To tackle the above challenges and gaps, we identify the following requirements for managing the workflow in notebooks and provide basic design ideas for our solution named NaaW.
• The workflow management process, e.g., the decomposition of a single notebook containing code fragments and workflow composition, must be flexible alongside a native Jupyter environment for scientific research.Users should not be restricted in their code fragments' selection and workflow design choices for their experiments.

Scaling Notebooks as Re-confi gurable Cloud Workfl ows
• The workflow building blocks, i.e., the reusable components from the notebooks, must be interoperable with self-defined workflow logic.The tool should enable users to select those reusable components to construct scientific pipelines and guarantee the dependency constraints between components.
• The approach must provide scalable solutions for large-scale scientific experimental analysis, especially for large datasets or complicated computation demands, and automatically execute the workflow on remote infrastructures such as the cloud.The scientific experimental process must be scalable and parallel to solve performance bottlenecks.
• The solutions should be embedded as part of the current Notebook ecosystem so that the scientists do not have to drop the advantages of the native Jupyter.

Pr oposed Jupyter Extensions
Based on the requirements, we proposed four Jupyter extensions.
Component Containerizer creates reusable workflow building blocks from the notebook cells.Technical details of this extension have been discussed in our earlier paper [22].This Jupyter extension allows users to select code cell(s) from Jupyter Notebooks effectively and generate reusable workflow building blocks (REST services) in a WYSIWYG manner.The extension can also containerize the services as self-contain deployable containers (i.e., Docker) and store them in a local catalog or remote Docker Hub.
Experiment Manager allows users to compose workflows using containerized notebook code (using the component containerizer).Users can compose dependencies among different blocks and construct distributed workflows by visually connecting the input and output parameters within the metadata description.This extension can save the generated workflow as workflow specification documents such as YAML or CWL syntax.For example, Figure 3 presents a straightforward vision of the experiment manager usage.Distributed Workflow Bus plans and schedules the workflow deployment and execution on remote infrastructures.The automation of the infrastructure services provisioning and deployment will be provided by the extension called remote infrastructure automator, to be discussed next.This extension enables users to submit the workflow specifications (created by the experiment manager), together with input data sources, resource budget (for using cloud services).The workflow bus first invokes the remote infrastructure automator to initialize the run-time infrastructure and then uses its built-in scheduling algorithms to schedule the workflow execution on the infrastructure, as shown in Figure 4. Currently, this function is built atop the Argo workflow engine, enabling developers to select built-in scheduler policies or even add their selfdefined scheduling strategies.Remote Infrastructure Automator plans the cloud infrastructure capacity and automates the resource provisioning and service deployment.This extension is extended from earlier work of Dynamic real-time infrastructure planner [24].

How do They Work Together
The four key components will be installed as Jupyter extension, as shown in Figure 5.The grey boxes present the four key components, i.e., the component containerizer, experiment manager, distributed workflow bus, and remote infrastructure automator.We also highlight the catalog that contains workflow building blocks and the dedicated remote infrastructure for scalable experiments.
Users prototype scientific pipeline steps using Jupyter environments (e.g., Notebooks) to conduct their experiments at a small scale, e.g., on a local computer or a small cluster.Based on the native Jupyter notebook front-end, users can use the component containerizer to encapsulate a cell as a REST service and containerize it (step 1), store the containerized components in a local catalog (step 2), and make further changes to the components (step 3).Using the experiment manager extension, the user can design a new workflow experiment (step 4) by selecting containerized components from the local catalog (step 5) and create a representation of the workflow (step 6).The workflow description will be executed by the distributed workflow bus (step 7) by first initializing virtual infrastructures (e.g., VMs or Kubernetes cluster) from providers (step 9) via the remote infrastructure automator extension.The runtime status and the results can be monitored by the workflow bus via a dashboard (step 10).

SYSTEM PROTOTYPE AND DEMONSTRATION
As discussed before, Figure 2 presents a sample workflow of the Laserchicken software from the ecology domain, which mainly consists of six functional modules, including the load, normalize, filter, compute neighbors, features, and export, each of them has different code fragments with additional input and output parameters.Practically, the whole process is placed and has to be run sequentially with a tight cell structure in the notebook, which strongly influences the performance and scalability of the experiment, e.g., reducing the parallelism and reusability of the code fragments.To validate our method, we demonstrate the critical components of workflow-oriented management in a Jupyter notebook environment using the above use case.
Figure 6 presents the interface of the component containerizer alongside a Jupyter notebook.The single notebook contains scientific pipeline steps with different component functions.Users can select the code fragments to encapsulate them as a component.And as shown in the left bar, it also automatically inspects the corresponding metadata, e.g., inputs, outputs, parameters, and package dependencies for each component encapsulation.And as workflow building blocks, they can be added to the catalog for further usage.
Figure 7 shows the function of experiment manager.The local catalog contains the metadata of encapsulated components.Users can select different components as reusable workflow building blocks to compose self-defined workflow logic, e.g., tasks and dependencies, and export the workflow.The desirable output of the experiment manager is the generated workflow specification document, which is one of the inputs of the distributed workflow bus module.
As shown in Figure 8, currently, the distributed workflow bus is implemented atop the Argo workflow engine for Kubernetes.It uses the default scheduler for the workflow planning and scheduling; the underlying infrastructure, such as virtual machines, still manually allocate virtual machines for launching the Kubernetes platform.Figure 9 shows the interface of the remote infrastructure automator (also called Cloud-cells).This Jupyter notebook extension allows the user to deploy dockers generated by Cloud-Cells to the cloud.But it has not yet been completely integrated into the NaaW method.

Disc ussion
In this paper, we discussed the key components in Notebook-as-a-Workflow (NaaW) solution.The four components are prototyped as extensions embedded in the Jupyter working environment of the scientists.This solution extends some of our previous work [22,24] and further extends to a workflow-oriented management approach to bridge the gap between the Jupyter environment and workflow management at scale.By demonstrating via a LiDAR processing example from the ecology and earth science domain, the proposed solution can achieve the workflow management process mentioned in the requirements.Using those Jupyter extensions, a scientist can 1) interactively encapsulate code fragments (one or more notebook cells) in the Jupyter notebook as reusable workflow components, 2) compose a workflow using those workflow components and customize the execution logic based on data volume and locations, 3) automate the cloud infrastructure based on the workflow requirements (data volume, and resource budgets), and 4) interactively execute the composed workflow on the cloud infrastructure to achieve the required scale.In the paper, we only focus on the key steps in workflow management, e.g., workflow component containerization from the notebook, workflow composition, infrastructure automation, and workflow execution.There are still several features that have not yet been implemented or can be improved.For instance, the experiment manager tool could not inspect the correctness of dependencies between two workflow building blocks.It needs an inspection before users save their generated workflow specifications and submit them to the distributed workflow bus module.In addition, the design and implementation of distributed workflow bus are not developed maturely now since it is still based on a third party, e.g., the Argo workflow engine.Many workflow scheduling algorithms are not well developed as well, e.g., we still use the default scheduler of the Argo workflow engine to deploy and execute the submitted workflows.The Scaling Notebooks as Re-confi gurable Cloud Workfl ows integration of distributed workflow bus and remote infrastructure automator does not work seamlessly.These two modules could not cooperate smoothly though both of them as Jupyter extensions can be built-in Jupyter environment.Some steps have to be completed manually.

CONC LUSION AND FUTURE WORK
In this work, we focus on managing the workflow in a Jupyter notebook architecture.We propose four core components to achieve our goal, i.e., the component containerizer, experiment manager, distributed workflow bus, and remote infrastructure automator, respectively.We 1) encapsulate the reusable cells as RESTful services and containerize them as portal components in the component containerizer module, 2) provide a composition tool for describing workflow logic of those reusable components, and 3) automate the execution on remote cloud infrastructure according to the distributed workflow bus and remote infrastructure automator.We validate the usability of the solution via a LiDAR use case from the ecology and earth science domain.This work is still developing and in continuous improvement.The missing features mentioned in the discussion section will be on our agenda for the next steps.

Figure 1 .
Figure 1.Two examples of statistical metrics derived from airborne LiDAR point clouds across the Netherlands to measure changes in ecosystem structure, including a visual comparison with images from Google Maps.(a, b) Changes in ecosystem height (measured from LiDAR point clouds with a metric called 95 th percentile of normalized height) to detect deforestation.The blue color indicates the decrease of vegetation height (i.e., 95 th percentile of normalized height) from 2011 to 2018, while the red color indicates the opposite scenario.(c,d) Changes in ecosystem cover (measured from LiDAR point clouds with a metric called pulse penetration ratio) to map the succession and re-growth of forests).The red color indicates the increase of pulse penetration ratio from 2011 to 2018, describing the vegetation cover has decreased during this time period, while the blue color indicates the opposite scenario.

Figure 3 .
Figure 3.A conceptual diagram of the experiment manager extension.

Figure 4 .
Figure 4.The data fl ow of the distributed workfl ow bus and remote infrastructure automator.

Figure 5 .Figure 6 .
Figure 5.The conceptual model of our proposed solution.

Figure 7 .Figure 8 .
Figure 7.The interface of the experiment manager modular.

as Re-confi gurable Cloud Workfl ows 2
Downloaded from http://direct.mit.edu/dint/article-pdf/4/2/409/2012400/dint_a_00140.pdf by UVA UNIVERSITEITSBIBLIOTHK SZ user on 21 October 2022 .we provide a workflow composition tool for presenting the workflow logic of those reusable components with visualization.3. we integrate the infrastructure automation tool to automate the workflow execution on remote cloud infrastructure.