Curriculum Development for FAIR Data Stewardship

Abstract The FAIR Guidelines attempts to make digital data Findable, Accessible, Interoperable, and Reusable (FAIR). To prepare FAIR data, a new data science discipline known as data stewardship is emerging and, as the FAIR Guidelines gain more acceptance, an increase in the demand for data stewards is expected. Consequently, there is a need to develop curricula to foster professional skills in data stewardship through effective knowledge communication. There have been a number of initiatives aimed at bridging the gap in FAIR data management training through both formal and informal programmes. This article describes the experience of developing a digital initiative for FAIR data management training under the Digital Innovations and Skills Hub (DISH) project. The FAIR Data Management course offers 6 short on-demand certificate modules over 12 weeks. The modules are divided into two sets: FAIR data and data science. The core subjects cover elementary topics in data science, regulatory frameworks, FAIR data management, intermediate to advanced topics in FAIR Data Point installation, and FAIR data in the management of healthcare and semantic data. Each week, participants are required to devote 7–8 hours of self-study to the modules, based on the resources provided. Once they have satisfied all requirements, students are certified as FAIR data scientists and qualified to serve as both FAIR data stewards and analysts. It is expected that in-depth and focused curricula development with diverse participants will build a core of FAIR data scientists for Data Competence Centres and encourage the rapid adoption of the FAIR Guidelines for research and development.


INTRODUCTION
In 2019, the World Economic Forum estimated that, by 2025, an average of 463 exabytes of data (Tweets, email messages, Facebook posts, WhatsApp messages, clinical data, and music files, etc.) will be created every day [1]. This data will be in different formats, like images, text, or audio, and from different domains. In response, the big data landscape is redefining requirements for data curation infrastructure, which is evolving to meet the challenges [2]. By employing data analytics, the metadata of curated health data can provide insights into solving health problems, gearing the industry toward value-based healthcare and opening doors to remarkable advancements, while reducing costs. However, constraints, such as the misrepresentation of data, privacy issues, siloed data, security, and data not being machine-readable, among other things, can lead to false inferences being drawn from data analytics. While the FAIR Guidelines [3] -that data should be Findable, Accessible, Interoperable and Reusable (FAIR) -tend to mitigate some of these constraints, these principles are foreign to most of the stakeholders whose devices, infrastructures and research generate such data. Thus, there is a need to train data stewards using customised training to equip them with the skills required to implement the FAIR Guidelines. Accordingly, an appropriate curriculum needs to be designed, validated and deployed, which is the subject of this article.

Curriculum Development for FAIR Data Stewardship
The design of any curriculum has four critical components that address four questions: · Why is instruction initiated? · What needs to be taught to achieve the set intent and objectives? · How can we connect all target learning outcomes? · What has been realised and what other actions need to be taken in relation to the instructional programme, learners, and teachers?
Worldwide, these components are usually addressed differently depending on the philosophy of the domain curriculum and model on which a design is based [4]. The goal of curriculum development is to communicate knowledge effectively to learners. This article explores the frameworks implemented in data stewardship programmes, towards designing a curriculum for training data stewards, in an effort to equip them with the relevant skills.

Data Stewardship: Description, Roles and Goals
Data stewardship is a concept that is deeply rooted in the sciences and should be considered in any funded research. It relates to the procedure for gathering, sharing, and analysing data and reflects the values underpinning fair information practices [5]. Principally, data stewardship involves all activities related to research data management over the research lifecycle. It has the potential to improve research, as it improves data management approaches for the collection, storage, aggregation, and de-identification of data, as well as procedures for data release and use [6]. In 2020, Wildgaard [7] posited that the position of a data steward is trust-based. Data stewards are responsible for the administration, management and manipulation of data belonging to researchers or enterprises. However, the professionalization of data stewardship can only progress with improved data steward education opportunities [7]. Therefore, as an activity that is part of performing creative research, data stewardship encompasses the design of all activities to do with (digital) data throughout the research project lifecycle, with the aim of optimising the usability, reusability, and reproducibility of the resulting data [8]. The study and practice of data stewardship is necessary for FAIR and open research. The European Open Science Cloud for Research Pilot Project [9] explains data stewardship as the shared responsibility of the professional groups involved in data management: data management and curation, data science and analytics, data services engineering and domain research [9]. Competences, skills groups, and organisational roles are defined around typical processes and stages in data management: planning and design, capture and processing, integration and analysis, evaluation and presentation, publishing and release, exposure and discovery, governance and assessment, scope and resources, advice and enabling.
Collins et al. [9] point out that transitioning to FAIR data stewardship requires education programmes for both data scientists and data stewards. In fact, both pedagogy and curricula are needed. Some of the Curriculum Development for FAIR Data Stewardship popular existing curricular frameworks for digital curation and data science are EDISON [10], EOSCPilot [9] and DigCurV [11]. These curricular frameworks could be implemented as postgraduate degree programmes in universities [12] to increase the accessibility of professional data science and stewardship programmes.
Wildgaard et al. [12] explain that the major roles for a data steward are administrator, analyst, developer and agent of change. Like the roles of the data system developer, the role of the data steward is to optimise the data through good project management, advise on FAIR Guidelines, create a data plan, facilitate collaboration and knowledge sharing to raise business intelligence, innovate, and develop procedures and guidelines. These authors also propose three models for data steward education: The first model is for students with bachelor degrees. This model spans one year for students with programming skills and two years for students without. The second model consists of PhD students or equivalent from any university faculty. And the third model is for students with professional studies or technical education and vocational training. Some of the other training options that could be explored for data stewardships as part of continuing professional development are: summer schools, on-the-job training, workshops, training-of-trainers, and online learning [9]. FAIR-themed programmes, like workshops, conference sessions, lectures, webinars, hackathons, workshops, visiting scholar programmes and so forth, could also be adopted to enhance FAIR data stewardship. All of these methods have proven to be effective in training students from all disciplines on the foundational data skills they need to be professional data stewards. For examples, CODATA-RDA [13] organised a short course programme in the form of a summer school in 2019 to upskill the research community for professional FAIR data stewardship. Some of the subjects taught were research data science, research data management, software and data carpentry, machine learning, visualisation and computational infrastructure.
This requires universities and other data-rich facilities to invest in Data Competence Centers (DCCs). In the FAIR Data Science environment, these are called Data Stewardship Competence Centers (DSCCs), which are established to embed professional, institution-wide research data stewardship and its related infrastructure, and which collaborate with the data processors in their institutions to enable better data management and comply with the FAIR Guidelines (Go-FAIR). Rosenbaum [6] agrees that the majority of data stewards have good research data management and domain-specific knowledge, but notes that it would be beneficial to provide pedagogical training to impart the soft skills required to efficiently engage with researchers and meet their needs [6]. Accordingly, this article proposes designing a digital skills curriculum for FAIR data stewardship. The proposed curriculum is divided into three main courses: computing and information technology, analytics, and FAIR data.

EDISON Data Science Framework
The EDISON Data Science Framework (EDSF) provides a basis for the definition of data science and enables the definition of other components related to data science education, training, organisational roles, and skills management, as well as professional certification. This framework contains five main components:

Curriculum Development for FAIR Data Stewardship
· Data Science Competence Framework (CF-DS) [10] · Data Science Body of Knowledge (DS-BoK) [14] · Data Science Model Curriculum (MC-DS) [15] · Data Science Professional Profiles (DSPP) and occupations taxonomy [16] · Data Science Taxonomy and Scientific Disciplines Classification The CF-DS provides the overall basis for the EDSF. The core CF-DS competences and skills groups identified by the EDISON Community [10] as essential for data scientists in different workplaces include: · Data Science Analytics (DSDA) -which uses suitable statistical methods and predictive analytics (such as statistical analysis, machine learning, data mining, and business analytics, etc.) on presented data to deliver insights and discover new relations. · Data Science Engineering (DSENG) -which uses engineering principles to research, design, develop and implement new instruments and applications for data collection, analysis and management. · Data Management and Governance (DSDM) -which relates to the development and implementation of a data management approach (using techniques such as software and applications engineering, data warehousing, big data infrastructure and tools for data stewardship, curation, and preservation) for data collection, storage, preservation, and availability for further processing. · Data Science Research Methods and Project Management (DSRMP) -which relates to the research domain, and Data Science Business Process Management (DSBPM), which creates new understandings and capabilities by using scientific methods (such as hypothesis, test/artefact, and evaluation) or similar engineering methods to discover new approaches to create new knowledge and achieve research or organisational goals. · Data Science Domain Knowledge (DSDK) -which uses the domain knowledge (scientific or business) to develop relevant data analytics applications and adopt general data science methods for domain specific data types and presentations, data and process models, organisational roles and relations.
The DS-BoK defines the knowledge areas (KA) required for building a data science curriculum that supports identified data science competences. The DS-BoK is organised by knowledge area groups (KAG) that correspond to the CF-DS competence groups. These are Data Science Analytics, Data Science Engineering, Data Management, Research Methods and Project Management, and Business Analytics [14] The MC-DS is built based on CF-DS and DS-BoK, for which learning outcomes are defined based on CF-DS competences and learning units are mapped to knowledge units in DS-BoK. Three mastery (or proficiency) levels are defined for each learning outcome to allow for flexible curricula development and profiling for different data science professional profiles.
The DSPP is defined as an extension of the European Skills, Competences, Qualifications and Occupations (ESCO) to the ESCO occupations taxonomy, using the ESCO top classification groups. The definition of DSPP provides an important instrument for defining effective organisational structures and roles related to data science positions -and can be also used for building individual career paths and corresponding competences and skills transferability between organisations and sectors.
The Data Science Taxonomy and Scientific Disciplines Classification serves to maintain consistency and links between the four core components of EDSF (CF-DS, DS-BoK, MC-DS, and DSPP).

ESCO Framework and Platform
The ESCO classification identifies and categorises skills, competences, qualifications and occupations relevant for the European Union labour market, education and training. It systematically shows the relationships between the different concepts [17]. The ESCO Data Science Professional Profiles (DSPP) occupation hierarchy is: managers, professionals, technicians, and associate professionals, and clerical support workers. The ESCO DSPP taxonomy can be extended to situations where proposed profile competences and organisational roles are similar to CEN Workshop Agreement (CWA) 16458 ICT profile definitions, such as to: · Managers who are production and specialised services managers (data science/big data infrastructure managers) whose role spans DSP01-DSP03 · Professionals from three major groups: -Science and engineering professionals (data science professionals) whose roles span DSP04-DSP09) -Information and communication technology (ICT) professionals (data science technology professionals) whose roles span DSP10-DSP13 -Science and engineering professionals (database and network professionals) whose roles span DSP14-DSP16 · Technicians and associate professionals, such as science and engineering associate professionals (data science technology professionals) whose roles span DSP17-DSP19 · Clerical support workers, such as general and keyboard clerks (data handling and support workers) whose roles span DSP20-DSP22 Figure 1 illustrates the existing ESCO hierarchy and the proposed new data science classification groups and corresponding new data science related profiles. The table in this figure shows competence groups relevant to each profile by indicating competence relevance from 0 to 5 (0 -not relevant, 5 -very important). The profile definitions for specific roles for DSP01-DSP22 are detailed on the EDISON Community website [16]. For example, the profile for data steward is DSP10 under the hierarchy of data science technology professionals. Mapping 'data steward' with CF-DS competences and skills groups, the relevance level with DSDA, DSENG, DSRM and DSDK is 3. Data steward is most relevant to DSDM. Data steward is well mapped with the CF-DS competency groups with an average value of 3.  The importance of the role of the data steward is recognised in the European Commission's High Level Expert Group report on European Open Science Cloud (October 2016) [18], which identifies the critical need for core data experts and data stewards in particular. The definition of data steward competences and training in these is an important component of the GO FAIR initiative [19,20], as well as the Horizon 2020 EOSCPilot project activity [21,8].

NUFFIC (the Dutch organisation for internationalisation in education) Digital Innovations and Skills Hub (DISH) is a distance education programme sponsored by the Dutch Ministry of Foreign Affairs under the
Orange Knowledge Program in conjunction with 12 partners from different countries in East Africa. The project targets learners with low opportunities, such as marginalised youth, including refugees and displaced persons from the Tigray region (Ethiopia), Garowe and Mogadishu (Somalia), Kassala and Khartoum (Sudan), Wau and Juba (South Sudan), and other conflict affected areas from East African region.

Course Curriculum: Topics and Description
Given the demography of the targeted learners, the training curriculum for this data stewardship specialisation programme is designed with the assumption that the students have little or no prior computer science skills. Thus, the training curriculum starts from a beginner's perspective and is divided into three courses of five to seven modules, with each course being a prerequisite for the next. The course details are given in Tables 2-4.  This course is designed to introduce learners to the role of an IT support specialist in an organisation. It intends to prepare them for an entry level role with an IT help desk or support. Learners are introduced to how to identify and verify installed software, and how to update and/or uninstall computer software. Learners are introduced to the hardware components of a computer system. This is followed by an explanation of how the components are arranged and interact within the system. In this module, learners are also introduced to how to resolve slow boot times, device failures, and other machine issues using 'Task Manager', 'Device Manager', 'Windows Defender', and 'System Performance' tools. Other aspects covered are the roles performed by information technology (IT) help desks such as ticketing systems and customer service, etc. Information technology service management (ITSM) processes and components are explored too. Week 11

Curriculum Development for FAIR Data Stewardship
-Overview of project management and related terms -Phases and processes of project management -Project methodologies -Importance and advantages of project management -Project management standards -PRINCE2 -PIMBOK -Contemporary issues in project management -Human resources and staffi ng -IT project risk management -IT project cost management -Change management  The course context will be contextualised for business and agriculture, i.e., how Python programming can be used to build systems that make it easier for businesses and modern farms to operate effi ciently. Examples will be based on different problems that occur within the daily operations of a small business and how to creatively solve these problems with programming. It will also cover examples of how programming can be applied in an agricultural context and give a big picture overview of how technology powered by Python has been able to improve agricultural systems. At the end, students should understand how to frame business/process questions and how to solve these problems using Python.

Curriculum Development for FAIR Data Stewardship
Introduction to data and data science -Defi nition, types and sources of data -Lifecycle of data science -Python lists and NumPy arrays -NumPy use cases Introduction to data analysis with Pandas -Series and data frames -Importing and exporting dataset, data cleaning and pre-processing -Exploratory data analysis -Computing descriptive statistics -Combining and merging datasets Introduction to data visualisation -Data visualisation with Seaborn -Plotting continuous data and categorical data with Seaborn The course context will be contextualised for business and agriculture, i.e., how Python programming can be used to build systems that make it easier for businesses and modern farms to operate effi ciently. Examples will be based on different problems that occur within the daily operations of a small business and how to creatively solve these problems with programming. It will also cover examples of how programming can be applied within an agricultural context and give a big picture overview of how technology powered by Python has been able to improve agricultural systems.
At the end, students should understand how to frame business/process questions and how to solve these problems using Python.

CS2.3 Introduction to Business Intelligence
Week 8 The module is aimed at equipping learners with the skills to mine data from a relational database, extract valuable information and create meaningful dashboards that can be used by business owners to make day-to-day decisions.
In addition, the module will give an introduction to some of the opensource business intelligence software and how to quickly set up and use it.  Introduction to statistical thinking -Population data vs sample data -Parameter vs statistics -Descriptive statistics -Scale of measurement -Inferential statistics Introduction to machine learning -Supervised learning methods -regression, classifi cation -Unsupervised learning methods -clustering, factor analysis, principal component analysis -Training and evaluating models -Regression metrics -Classifi cation metrics This module builds on the previous introduction to data science. Learners will be taught how to make statistical inferences to draw clear conclusions from data. It also introduces machine learning, supervised and unsupervised learning. Learners should be able to create machine learning models and discover underlying clusters in a dataset.

CS3.2 Regulatory Framework
Week 2 -Key concepts of data regulatory framework -General Data Protection Regulation (GDPR) and its principles -Data regulation specifi c to Sudan, South Sudan, Somalia and Ethiopia -FAIR Guidelines The emergence of the Internet as a global telecommunications network has had a huge impact on how we view and apply data protection and regulations. Before the massive expansion of the Internet, data was of minor interest and did not generate signifi cant global interest. This module provides participants with an understanding of what a regulatory framework is and what it is used for. Learners will understand general data protection principles, national data regulations, and the basics of FAIR Guidelines, as well as be able to explain why we need FAIR Guidelines and the benefi ts for their country.

Mode and Duration
This programme will span 36 weeks (12 weeks per course) with a total of 3 courses: Computer Science I (CS1), Computer Science II (CS2) and FAIR Management Principles (CS3). The core topics that pertain to data stewardship will be Introduction to Data Science I and II, Regulatory Framework, FAIR: Data Management, Data Point Installation, and Data for Health and Semantic Data.
The weekly activities summary for each course is as follows: • Week 1 -Registration and Orientation • Weeks 2 to 11 -Learning Activities and Interaction • Week 12 -Examination Considering the possible locality of the target participants and the limited infrastructure available in such places, the distance education model will take a blended learning approach, in which online learning is combined with face-to-face interaction at partner universities. Each participant is expected to devote a minimum of 12 hours a week, of which 4 hours is for self-study of the provided learning resources, 4 hours for online activities and interactions, and 4 hours for assessments and assignments.

Expected Learning Outcomes, Activities and Assessments
In addition to registration in week 1, learners are mandated to participate in two short modules: Peace Building and Conflict Resolution (to expose them to the skills needed to coexist and resolve conflicts in Table 4. Continued order to maintain peace in their communities) and Trauma and Mental Health (to help them to cope with the violence and trauma that they might have experienced in times past). To enable them to access the enormous opportunities inherent in the IT world, the modules on Digital Technologies, Computer Networks, IT Service Management, and Project Management will be designed to teach a wide range of skills on digital technology, contents creation, software installation, basic cyber security, IT productivity tools, hardware coupling and troubleshooting, maintenance of local area networks, and other relevant topics. The learners will be facilitated via a learning management system using activities such as video conferencing, chats, online forums and so forth for interaction between teachers and learners and also for peer-to-peer communication. Practical sessions will be organised for students to demonstrate the skills acquired. Quizzes and assignments will also be given to gauge outcomes and these will be graded. It is expected that the course will not only qualify learners for IT-related jobs, but that they will also be capacitated to perform exceedingly well in other areas using the skills acquired.
The Computer Science Level 2 (CS2) modules were developed to teach the learners intermediate skills such as computer programming with Python, Introduction to Data Science, Business Intelligence, Digital Marketing, Front End Web Development with Angular, Docker and React (Web), and React (Native). Learners are expected to be able to write Python programs, as this is essential for data science. The Data Science and Business Intelligence modules will groom learners in the world of machine learning, data analytics and business analytics. Tech skills, which can provide a career path, are also taught. Marketable skills will be taught, such as skills in using digital marketing concepts to manage the digital platforms of business organisations and create digital advertising campaigns for small and medium scale businesses; skills in JavaScript to teach front end web development; and skills in Angular, Docker, and React to expose learners to software engineering. Similar activities of facilitation, engagement, practical and assessment as in CS1 will be introduced to teach, assess and encourage learners.
Computer Science Level 3 (CS3) modules are an extension of Computer Science Level 2 (CS2). Learners will be exposed to statistical thinking, supervised and unsupervised learning, and regression. Another interesting topic in FAIR Data is called FAIR Data Trains. The students will be exposed to FAIR Data for Health, which explores how linked health data drives research, better use and learning from data, and further contributions to patient care. In addition, learners will be taught the FAIR Guidelines for data management as well as FAIR Data Point installation, Docker installation, the creation of machine-readable metadata, catalogues, datasets, and distribution. Students will also be shown how to FAIRify existing datasets using linked data and semantics modelling. The main objective of the course at this level is to understand the role of a data scientist in the industry and become acquainted with different data presentation formats, understand basic statistical thinking, understand machine learning techniques (such as supervised and unsupervised learning), understand basic concepts such as (sensitive) personal data and FAIR Guidelines, apply the FAIR Guidelines, know what data management and a data management plan (DMP) are, know the content elements that make up a DMP, be able to develop a FAIR DMP, and learn tools and techniques for the FAIRification of data.

CONCLUSION AND FURTHER DEVELOPMENTS
This article reviewed existing curriculum, such as the EDISON framework, for Data Science Professionals. The presented profiles are defined based on the ESCO taxonomy and include the following groups: managers (DSP01-DSP03), professionals (DSP04-DSP09), professional data management/handling (DSP10-DSP13), professional (database) technical (DSP14-DSP16), professional technicians (DSP17-DSP19), and support and clerical workers (DSP20-DSP22). This framework defines data steward relevance and profile as DSP10. It is anticipated that all educational requirements of a data steward were met in the curriculum provided, which blends the skills involved in data stewardship and the FAIR Guidelines. A student that has satisfied all requirements will be certified as a FAIR data scientist and will be able to serve as both a FAIR data steward and analyst. In-depth and focused curricula development with diverse participants will build a core of FAIR data scientists. This will encourage the rapid adoption of FAIR Guidelines for data for research and development.