Pilot Study on the Intercalibration of a Categorisation System for FAIRer Digital Objects Related to Sensitive Data in the Life Sciences

Abstract Sharing sensitive data is a specific challenge for research infrastructures in the field of life sciences. For that reason a toolbox has been developed, providing resources for researchers who wish to share and use sensitive data, to support the workflows for handling these kinds of digital objects. Common and community approved annotations are required to be compliant with FAIR principles (Findability, Accessibility, Interoperability, Reusability). The toolbox makes use of a tagging (categorisation) system, allowing consistent labelling and categorisation of digital objects, in terms relevant to data sharing tasks and activities. A pilot study was performed within the Horizon 2020 project EOSC-Life, in which 2 experts from 6 life sciences research infrastructures were recruited to independently assign tags to the same set of 10 to 25 resources related to sensitive data management and data sharing (in total 110). Summary statistics of agreement and observer variation per research infrastructure are provided. The pilot study has shown that experts were able to attribute tags but in most cases with a considerable observer variation between experts. In the context of CWFR (Canonical Workflow Frameworks for Research), this indicates the necessity for careful definition, evaluation and validation of parameters and processes related to workflow descriptions. The results from this pilot study were used to tackle this issue by revising the categorisation system and providing an updated version.


INTRODUCTION
The Horizon 2020 cluster project EOSC-Life brings together the 13 Life Science 'ESFRI' (European Strategy Forum on Research Infrastructures) research infrastructures (RIs) to create an open, digital and collaborative space for biological and medical research (https://www.eosc-life.eu/). Sharing sensitive data is a specific challenge within EOSC-Life. For that reason a toolbox has been developed, providing resources for researchers who wish to share and use sensitive data, to support the workflows for handling these kinds of digital objects (1). The sensitivity of the data normally stems from it being personal data, but can also be caused by intellectual property considerations, biohazard concerns, or because it falls within rules or protocols such as the Nagoya protocol (2). The toolbox does not create new content. Instead, it allows researchers to find existing resources that are relevant for sharing sensitive data across all participating research infrastructures (F in the FAIR Guiding Principles for scientific data management and stewardship, providing guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets (3)). The toolbox provides links to recommendations, procedures, and best practices, as well as to software (e.g., tools and scripts supporting workflow) to support data sharing and reuse. It makes use of a tagging system, allowing consistent labelling and categorisation of digital objects, in terms most relevant to data sharing tasks and activities. The first version of the categorisation system was developed and its description published before executing the study (4). Components of the system were based upon existing classification systems (5)(6)(7)(8)(9)(10)(11). As a result of this pilot study, the categorisation system has been improved and simplified.
The work is relevant i) for developing ideas about tagging workflows within CWFR (Canonical Workflow Framework for Research) and ii) to the characterisation and use of FAIR digital objects (12). Handling of sensitive data and digital objects is a major issue and a challenging topic for many research infrastructures in the life sciences. The workflows around sensitive data are extremely complex and highly dependent on the regulatory environment. The toolbox is targeted at the discovery of resources dealing with sensitive data through a tagging system, which characterises the resources using cross-community approved categories. One of these categories provides the stage of digital objects in the data sharing life cycle. The toolbox also characterises, in broad terms, the data type investigated, contributing to the description of data collections as required in the context of CWFR. Because the work is spanning all life-science infrastructures, it also offers valuable input into the discussion of the applicability of generic workflow fragments across

Pilot Study on the Intercalibration of a Categorisation System for FAIRer Digital Objects Related to Sensitive Data in the Life Sciences
infrastructures and an inter RIs approved tagging workflow system. Thus, the toolbox and the underlying categorisation system will contribute to the further development of CWFR in the area of digital objects with sensitive data.

RELATED WORK
Various resources have been developed to support the management of sensitive data. These include codes of conduct, guidelines, recommendations, policies, and descriptions of best practice, but also computer tools and services (e.g., for de-identification of data). For identifying these resources, various databases, catalogues and repositories are available and can be used, often with their own internal tagging system. For specific areas dedicated portals have been developed and implemented, such as the ELSI (Ethical, Legal and Social Implications) Knowledge Base from BBMRI (Biobanking and Biomolecular Resources Research Infrastructure) (https://www.bbmri-eric.eu/elsi/knowledge-base/) or the RDA (Research Data Alliance) COVID-19 Recommendations and Guidelines for Data Sharing (https://www.rd-alliance.org/group/rda-covid19-rda-covid19-omics-rda-covid-19-epidemiology-rda-covid19-clinical-rda-covid19-0). The latter incorporates a tagging system developed to characterise documents, allowing better support for searching and filtering. What is still missing is a system providing guided access to resources dealing with sensitive data in general and spanning the full area of life sciences.

PROBLEM FORMULATION
A large part of the compliance with FAIR principles, and the creation of minimal metadata sets required for FAIR data, are driven by community level processes (17,18). Any toolbox being developed should be based on common and community approved metadata, and it should refer to standardised workflows and an integrated approach to arrive at community approval for FDO and CWFR concept requirements. The categorisation system under study is a crucial part of these common, appropriate and sufficient metadata (19). The objective of the pilot study was to evaluate the first version of the categorisation system by human experts from different life science RIs with respect to usefulness, reliability and consistency. From the results of the study an improved version of the categorisation system was provided and this will be used as a central component within the toolbox to support rapid and intuitive retrieval of resources by future users. Additionally, the study was intended to provide insights into handling discrepancies in developing and applying metadata, through a defined testing and community approval process.

APPROACH AND METHODOLOGY
The pilot study was performed within the EOSC-Life Work Package (WP) dedicated to sensitive data, supported by the partners of this WP and coordinated by the European Clinical Research Infrastructure Network (ECRIN) (https://ecrin.org/). The study protocol was registered before the study started on 8 December 2020 (16). The pilot study was organised using a strict methodology. Each involved infrastructure nominated two experts, willing to perform the individual assessment of a number of typical resources

Pilot Study on the Intercalibration of a Categorisation System for FAIRer Digital Objects Related to Sensitive Data in the Life Sciences
around sensitive data to be included in the EOSC-Life toolbox. The experts from the involved infrastructures selected a given number of resources-25 for each of ECRIN, BBMRI, and EATRIS (European Advanced Translational Research Infrastructure in Medicine), and at least 10 for each of EMBRC (European Marine Biological Resource Centre), ERINHA (European Research Infrastructure on Highly Pathogenic Agents), and Euro-Bioimaging. In all cases they spanned a wide range of resource types (e.g. legislation & regulations, position papers, policies & principles, background & explanatory material, tools). The experts selected the resource types that were relevant for their research infrastructure and then assessed these resources using the categorisation system independently of each other.
As a technologically agnostic first step, the evaluation was supported through the bibliographic tool Zotero Tags (https://www.zotero.org/). The categorisation system used in the pilot study consisted of eight dimensions: resource type, research field, research design, data type, stage in data sharing life cycle, geographical scope, specific topics or keywords and targeted group (see the short form of categorisation system in supporting material S1 and (4)). It was a requirement that all dimensions of the categorisation system were applied to each resource. Multiple tags per dimension were possible.
Summary statistics of agreements/disagreements between the two experts of each research infrastructure were generated. Separately for each research infrastructure and per category, cases where both experts agreed to assign a tag to a resource (yes-yes) and cases where one expert used a tag and the other not (yes-no or no-yes) were counted. The agreement rate was calculated as the percentage of "yes-yes" on all tags that were used at least once. For measuring interrater reliability between the assessment of experts of the same infrastructure, the kappa coefficient developed by Fleiss was applied (17). The statistical analysis was performed using the statistical software R version 4.0.2.

EXPERIMENTS AND ANALYSIS
The agreement rates between experts per infrastructure and category are presented in the supporting material (see supplementary material S2: Final study report). The rate was very high for BBMRI with 70.2%. For the remaining research infrastructures, the agreement rates were considerably lower with a range between 33.2% (ECRIN) and 20.2% (EMBRC). The agreement rates differed between the categories. Rates higher than 40% were achieved for category 2-research field (49.3%), category 6-geographical scope (44.5%) and category 1-resource type (41.8%). The lowest rates were measured for category 3-research design (18.3%) and category 5-stage in data sharing life cycle (25.1%). It is important to note here that some experts assigned many more tags from specific categories than other experts, contributing significantly to the disagreement.
The findings were confirmed by the interobserver reliability analysis (Table 1). A high kappa value was found for BBMRI (median kappa for 25 resources: 0.84). The kappa values were much lower for the other research infrastructures with a range between 0.44 for ECRIN and 0.22 for EMBRC (median kappa value). For BBMRI only one assessment was lower than 0.5 and only 9 assessments lower than 0.8. For ECRIN, assessment of 6 out 25 resources resulted in a kappa >= 0.50. This was the case only for 2 out of 12 resources for EMBRC, for 2 out of 25 for EATRIS and for 1 out of 10 for ERINHA. All the other assessments resulted in kappa's less than 0.50. The results are summarised in Table 1. Several experts suggested new tags. This covered additional "resource types" that were not listed (e.g., webinar, template, Q&A), new "research fields" that were not included (e.g., law, ethics, sociology, genomic research), the answer "any" to "research design" and "data type", more generic terms additional to specific "stages of data sharing" (e.g., data transfer, data sharing, secondary processing of data), additional "specific topics" (e.g., code of conduct, ELSI, COVID-19, material transfer agreement and additional "perspectives" (e.g., participant, resource provider).

Pilot Study on the Intercalibration of a Categorisation System for FAIRer Digital Objects Related to Sensitive Data in the Life Sciences
The results from this pilot study were used to revise the categorisation system, generating an improved and simplified updated version 2 ( Figure 1 and supplementary material S3: Short form of the categorisation system (Version 2)).

DISCUSSION
BBMRI had already performed major work related to the handling of sensitive data in the field of health research, so the high interobserver correspondence in the pilot study is not surprising. The BBMRI ELSI Knowledge Base had already been structured using resource categories that included topic, areas of interest and geographical scope. The resources assessed in the pilot study had been selected from this knowledge base and existing tags were used in the assessment. The resources selected by BBMRI are closely related to the BBMRI work and concentrate on ELSI aspects, covering biobanks, de-identification, ethical handling, GDPR, consultation, transfer agreements, informed consent, etc. In contrast, some of the research

Pilot Study on the Intercalibration of a Categorisation System for FAIRer Digital Objects Related to Sensitive Data in the Life Sciences
infrastructures deliberately tried to select a wide variety of resources, to create a more demanding test of the categorisation system (e.g., EATRIS).
With respect to the diverging results of the remaining five research infrastructures, the following mechanisms were identified as possible ways of improving the categorisation system, to achieve the goal of a canonical workflow for cross-disciplinary tagging in the life sciences: • Training: The categorisation was discussed during several telephone conferences by the group and most experts involved in the assessment participated in the development of the system. Nevertheless, there was no systematic training on the application of the system before the pilot study.
• Standard definitions: In the development of the categorisation system, existing terminologies and classifications were taken into consideration (5)(6)(7)(8)(9)(10)(11). However, standardised definitions and a glossary were not provided. This was expressed as a deficit by several of the experts. The first study helped to showcase where the most misunderstandings and polysemic biases exist.
• Elimination of ambiguity: For some of the categories it was not totally clear how they should be applied to the resources. For example, category 8 (perspective) could be considered from the developer's perspective (the person who developed the resource) or from the user's perspective (the person at whom the resource is targeted). Similarly with category 6, a geographical scope could refer to the developers (e.g., produced in one country) or to the target users (e.g., global).
• Guiding the number of tags assigned for a category: The number of tags assigned to resources for a specific category was not uniform between experts. Ways of reducing these discrepancies (e.g. by putting a maximum on tag numbers) should be explored.
• Filling gaps: For several categories, important tags were seen as missing from the pre-specified list, for example, missing resource types (category 1), research fields (category 2) and specific topics (category 7) These gaps were considered in detail during the creation of version 2 of the tagging categories.
• Elimination of meaningless tags: For a considerable number of resources and most of the infrastructures "not applicable", "not specified" or "not clear" was assigned to category 3 (research design). It seems to be that specific study types are only relevant for some research infrastructures (e.g., ECRIN, EATRIS, EMBRC).
• Simplifying research stages: It proved difficult to allocate specific stages in the data sharing life cycle (category 5) to a number of resources despite using a widely used classification of stages in the data sharing life cycle. As a consequence, more generic tags were suggested that cover broader phases. This finding is relevant for the discussion on CWFR.
• Simplifying data type tags: For category 4 (data type), specific combinations of data types were selected for a considerable number of resources, indicating that more generic terms could improve applicability of the categorisation system. It may be worth exploring the idea of a hierarchy of tags, because in some cases both a generic and a specific term could apply. In addition, some specific data types are only relevant for some research infrastructures, in particular those dealing with human data. Again, this point is of relevance for CWFR.

Pilot Study on the Intercalibration of a Categorisation System for FAIRer Digital Objects Related to Sensitive Data in the Life Sciences
Based on the analysis and comments of the experts, the categorisation system was recalibrated and simplified. Suggestions for additional tags were taken into consideration, but there was also a general request to keep things as simple as possible. In order to reduce the considerable interobserver misalignment between the experts, the definitions of the tags were clarified and the distinction between tags was improved. The value of the original 8 tagging dimensions was re-evaluated and it was concluded that two could be dropped. Where possible tags were combined to keep the number of available options as low as possible. The result is a categorisation system with 55 tagging values spread over 6 dimensions, as compared to the original 93 tags spread over 8 dimensions (Figure 1 and supplementary material S3: Short form of the categorisation system (version 2)). Nevertheless, discussion and further evaluation of the categorisation system is still ongoing, and improvements are under discussion for future versions.

Relevance of the pilot study for CWFR and FDO (FAIR Digital Objects) typology
• Data annotation workflows constitute a basic element for improving reuse (R1.3 of Wilkinson criteria) and a first step in building rich community approved metadata (F2 and F4) and FAIR vocabularies linked to other metadata (I2 and I3; (3)).
• The pilot study is relevant to proposals for cross-disciplinary methods in life sciences to define FDO typology that could be reused by the new FDO Semantics WG (FDO-SEM | FAIR Digital Objects Forum (fairdo.org)). These typologies will be necessary to evaluate what should be incorporated within FAIR metadata (F2 and R1).
• CWFR includes a discussion of the role of generic workflows and whether modelling across disciplines is possible (18). Developing the categorisation system for the toolbox has demonstrated an approach to tackling this problem in the context of the life-sciences infrastructures. This tagging system may be considered as a first level of interoperability for vocabulary typology (I2). The pilot study has demonstrated, however, considerable differences between research infrastructures, which has to be taken into consideration in the discussion on future generic workflows.
• Harmonisation around how data collections are being described is required in the context of CWFR (18). On a high level and across all life-sciences infrastructures, this issue has been investigated in the categorisation system by incorporating the dimension "data type", covering various specific data types about or from living humans. Again, issues were observed that indicated the necessity for clear definitions of parameters and clear rules on how to apply them.
• As the CWFR initiative wants to discuss ways to reduce the large gap between technology and principles on the one hand and data practices in the labs on the other hand, the toolbox and its categorisation system constitute a practical example of how to improve interoperability between a large panel of (life sciences) disciplines. This exercise is based on real resources / digital objects, but as with other FAIRification processes it needs successive iterations and further community approval (19).
• The toolbox improves the discoverability (F in FAIR) of digital objects linked to sensitive data and thus will strengthen the provision of FAIRer digital objects because systematic application of a categorisation methodology will yield in metadata enrichment (for F4, and R1.3 compliance). This is a critical aspect for the handling of sensitive data in life sciences and to arrive at the necessary

Pilot Study on the Intercalibration of a Categorisation System for FAIRer Digital Objects Related to Sensitive Data in the Life Sciences
standard solutions and workflows to comply with rules and regulations. The discussion around CWFR and FDO should be informed that this type of data is likely to need specific workflows to handle it. The toolbox will become a major resource to support this work by referencing relevant resources (F4).
• Consideration should also be given to the tagging workflow itself, and how it might be better standardised. Elements of this tagging process are: Selection of resources to be tagged, assigning taggers to resources, performing the tagging (under standardised conditions), quality control and approval of the tagging, and monitoring and supervision of the process. The pilot study demonstrated major variation in tagging of resources if independent taggers are assessing the same resource (interobserver variation). The example of BBMRI has shown that with adequate measures this inter-observer variation can be considerably reduced. Another source of variance may occur if a tagger assesses the same resource at different time points. In order to come to a valid and reliable tagging workflow, both inter-and intra-observer variation have to be taken into account and be reduced. For example, the workflow of the tagging process should include an independent assessment by a second reviewer and a final consensus by both experts. An intercalibration tool for future taggers (with a training set) could be provided based on the first set of digital objects. Recurrent training could help to make the tagging workflow more replicable, although the tagging system itself should be mostly self-explanatory, as end users of the toolbox have to use the same system for querying purposes.
• In summary, the work is of major relevance for CWFR and FDO. An FDO-enabled application is able to identify the type of object (through reference to the object's type), directly operate on the object (through the object's location reference) or get more information about the object (through reference to the objects metadata records). A prerequisite is good, appropriate and sufficient metadata for an FDO, in our case characterising digital objects related to managing sensitive data. The categorisation system developed in the pilot study could evolve into a "methodological resources in life sciences" FDO type and could thus be of major help by linking services to digital objects. So far, FDO types are not well described in the life sciences and work has only been started in the FDO forum.

CONCLUSIONS AND FUTURE WORK
The first thing we demonstrated was that to have a relatively stable and generally applicable categorisation system across life sciences RIs, an iterative process was necessary to align the variation between experts from different RIs. As the categorisation system is part of an essential minimal metadata in this interdisciplinary study, this is of relevance for a "resources in life sciences" FDO and may also be significant for other FDO types (in particular more generic "FDO resources" type), such as FDOs describing workflows at a canonical level, particularly in the CWFR. In this context an important way to promote FAIRness is to request that all canonical components should support the concept of FDO (https://fairdo.org/wg/fdo-cwfr/). Secondly, the results from our study show that it will not be simple to define widely agreed data types across research domains and that pragmatic approaches are needed, which may carry the risk of a proliferation of data types. Thirdly, the pilot study demonstrated that highly sophisticated algorithms will be required to develop reliable automatic methods to interpret the categories relevant for sensitive data workflows (and indeed in the CWFR). It may be difficult, even impossible, to find purely machine-actionable algorithms for categorising digital objects dealing with managing sensitive data.

Pilot Study on the Intercalibration of a Categorisation System for FAIRer Digital Objects Related to Sensitive Data in the Life Sciences
Avoiding polysemic bias is one of the more difficult issues to resolve and is necessary across disciplines. The results from this pilot study have been used to tackle this issue by revising the categorisation system and providing an updated version 2 ( Figure 1, supplementary material S3). This version contains definitions for each term of the categorisation system, though these need to be further evaluated and approved by all life-sciences infrastructures. Currently, the 110 resources from the pilot study are being re-tagged with the new version of the categorisation system. This initial set of digital objects will be used as the content for the first demonstrator of the toolbox. This is a starting point, but the categorisation system will need iterative improvement. The pilot study on tagging inter-calibration could serve as a model for a future CWFR crossdisciplinary workflow tagging system. Lygature in The Netherlands dedicated to multi-stakeholder biomedical data infrastructure programs, such as TraIT (Translational Research IT), and the overarching national program for health data sharing, Health-RI. Internationally, Jan-Willem is involved in IMI/H2020 projects such as BIGPICTURE, FAIRplus, and Immune-Image, and participating in EOSC-Life on behalf of EATRIS. Before joining Lygature he held various research informatics positions in pharmaceutical industry. ORCID: 0000-0003-0327-638X Steve Canham has twenty years' experience working with clinical research systems and data, after previous roles in healthcare and teaching. He now works for ECRIN on several data related H2020 projects, and has a particular interest in managing metadata to promote FAIRness of data. ORCID: 0000-0002-4409-8834

Pilot Study on the Intercalibration of a Categorisation System for FAIRer Digital Objects Related to Sensitive Data in the Life Sciences
Maria Luisa Chiusano has a graduation in Biology (Ph.D.), Professor of Molecular Biology (University Federico II of Naples) and Associated Scientist of the Stazione Zoologica (Dohrn). She is an expert in omics and biodata management, responsible for biocomputing and bioinformatics services in a joint effort between two institutes. She is co-founder of the Italian Bioinformatics Society, councilor of the BIG DATA in HEALTH Italian Society, member in working groups for data sharing within EMBRC, EMSO-Italy, and ELIXIR-Italy, and member of the Steering Committee in national and international (included EU-funded) projects. ORCID: 0000-0002-6296-7132 Walter Dastrù is a research fellow at the University of Torino. He has more than 20 years' experience in MR contrast agents, MRI acquisition and processing of MRI images. In particular he contributed to developing and maintaining software for the analysis of DCE-MRI data. She is experienced in the fields of cell biology, nanotechnology, toxicology, genetics, genomics, biomedical imaging and clinical research. Currently, she is coordinating ECRIN's data projects portfolio focusing on developing tools, good practices and guidelines that serve the European clinical research community. ORCID: 0000-0002-4221-7254