The rapid evolution of Large Language Models (LLMs) highlights the necessity for ethical considerations and data integrity in AI development, particularly emphasizing the role of FAIR (Findable, Accessible, Interoperable, Reusable) data principles. While these principles are crucial for ethical data stewardship, their specific application in the context of LLM training data remains an under-explored area. This research gap is the focus of our study, which begins with an examination of existing literature to underline the importance of FAIR principles in managing data for LLM training. Building upon this, we propose a novel frame-work designed to integrate FAIR principles into the LLM development lifecycle. A contribution of our work is the development of a comprehensive checklist intended to guide researchers and developers in applying FAIR data principles consistently across the model development process. The utility and effectiveness of our frame-work are validated through a case study on creating a FAIR-compliant dataset aimed at detecting and mitigating biases in LLMs. We present this framework to the community as a tool to foster the creation of technologically advanced, ethically grounded, and socially responsible AI models.

In the era of artificial intelligence (AI), Language Models (LMs) play an important role in advancing diverse AI applications [1]. From virtual assistants to content generation, LMs have become indispensable in shaping both the future trajectory of academic research and in their widespread practical applications. The impact of this transformation is further amplified by the evolution of Large Language Models (LLMs) [2], such as OpenAI's GPT series, Gemini, LLaMa, and Falcon. As of January 2024, LLM development has collected $18.2 billion in funding and $2.1 billion in revenue [3].

The rapid success of these LLMs highlight the importance of diverse data for broad-ening their applicability across different domains. This idea is also well-supported in research [4, 5], which affirms the importance of high-quality data in training these models. However, the training of these big models on data from diverse sources highlights the complex ethical and responsible data practice challenges in their implementation [5, 6].

The FAIR data principles, which stand for Findable, Accessible, Interoperable, and Reusable [7, 8], were initially established to improve the stewardship of scientific data. These principles are can be used for any model development life-cycle [9, 10] and have become increasingly recognized in responsible AI development [11]. Recently, the relevance of FAIR principles has been particularly highlighted in generative AI due to ethical challenges such as bias, privacy concerns, and the potential misuse of AI-generated content [12, 13]. This highlights the growing importance of ensuring data used in building these LLMs is findable, accessible, interoperable, and reusable, adhering to ethical standards.

Seminal works in data science and management [7, 14-17] have explored FAIR principles in various domains. Their application in the training and development of LLMs, however, is a developing area of research. While recent advances in LLM studies [5, 6] have focused on aligning LLMs with ethical standards including human values, the direct incorporation of FAIR principles in LLM training is less explicit. Challenges specific to LLMs, such as addressing data biases, toxicity, stereotypes and the need for model explainability and interpretability [18], further highlight the need for ethical and balanced training approaches. This emphasis on data ethics is in line with broader trends in AI ethics and data science [19, 20] stress the necessity of integrating these principles into the LLM development.

In this work, we shed some light on FAIR data principles and propose a model development lifecycle for LLM training that incorporates FAIR principles at each stage. One of the contributions of this work is the development of a FAIR-compliant dataset, designed to include a wide array of narratives from diverse sources. To emphasize the significance of FAIR data principles in the context of preparing this dataset for LLM training, it is crucial to understand that while this process may not directly equate to the complete “FAIRification” of LLMs, it represents a critical and foundational step towards it. The application of FAIR principles ensures that the data feeding into LLMs is of high quality and organized in a way that maximizes its utility, thereby enhancing the model's performance and reliability.

The central contribution of our work, however, extends beyond simple data collection and preparation for LLM training. We have placed a strong emphasis on rigorously aligning the LLM training dataset with FAIR principles. We acknowledge that adherence to FAIR principles (even at the strict-most level) may not equate to absolute ethical compliance, however, it represents a crucial step in that direction. Our research establishes a foundational framework for further advanced studies in this field. The primary contributions of our work are:

1.1 Contributions

  1. An exploration of FAIR data principles in general AI research, including the provision of a comprehensive checklist for researchers and developers.

  2. Introduction of an innovative framework that integrates FAIR data principles throughout the LLM training lifecycle, ensuring ethical and effective application in various AI contexts.

  3. Demonstration of the practical benefits of a FAIR-compliant dataset through a case study. This study specifically focuses on identifying and mitigating biases prior to training LLMs. The topic of bias is broad-ranging, our case study specifically focuses on addressing linguistic biases targeting protected groups.

The data used for pre-training LLMs is predominantly unstructured. However, as demonstrated in this case study, if it is accurately labeled and formatted—interoperability being one such key principle—it can then be leveraged for fine-tuning a wide array of downstream tasks. Furthermore, adherence to FAIR principles not only ensures data is handled correctly but also significantly boosts the credibility of the models.

2.1 Literature Selection Criteria

This study investigates articles published within the past five years, focusing on the rise of generative AI models like those in the GPT series. The study specifically targets English-language articles from leading AI and ML journals and conferences. We explored several databases and proceedings from high-quality journals from Elsevier, Springer, Nature portfolio, IEEE Transactions, and key conferences like NeurIPS, ICML, ACL and such. Recognizing the fast-paced evolution of LLM research, we included relevant preprints and seminal works for a comprehensive perspective. This search resulted in 135 papers, out of which we carefully chose about 75 that directly pertained to FAIR data principles in the context of training models. We noted a scarcity of work combining LLMs with FAIR data principles, a gap that our study aims to address. The search query for this study is:

(Findability OR data discovery OR metadata standards OR persistent identifiers) AND (Accessibility OR data access OR data sharing policies OR authentication and authorization mechanisms) AND (Interoperability OR data integration OR standardized data formats OR cross-domain data exchange) AND (Reusability OR data documentation OR data quality assurance OR long-term data preservation) AND (large language models OR LLMs OR AI models OR machine learning models) AND (training data management OR ethical data sourcing OR bias in AI datasets OR responsible data use in AI) AND (ethical considerations in AI OR AI ethics OR responsible AI OR ethical AI development) PUBLISHED FROM 2018 TO 2023 IN English IN Journals and Conferences (e.g., JAIR, JMLR, IEEE Transactions, NeurIPS, ICLR, ACL, Springer, Nature portfolio, Elsevier).

2.2 Comparative Analysis of Related Work

We have conducted a meta-review of key articles on FAIR data principles, categorizing them by domain and their relevance to LLMs. We also highlight the difference of our work compared to the previous works. The findings are summarized in Table 1.

Table 1.

Summary of Review Papers on FAIR Data Principles.

ReferenceSummaryDomain FocusLLM Relevance
[21Discusses data governance for FAIR principles in health data management. Health Data Not Applicable 
[22Reviews initiatives for applying FAIR principles in managing health data. Health Data Not Applicable 
[23Examines FAIR data practices'role in global mental health research. Mental Health Research Not Applicable 
[24Analyzes FAIR principles in healthcare data, emphasizing cybersecurity. Health Data Partially (Instruct-GPT) 
[25Addresses challenges in FAIR application and legal aspects in European rare disease databases for ML technologies. Health Data Not Applicable 
[26Elaborates on FAIR principles in precision oncology research. Health Data Not Applicable 
[16Offers a broad perspective on FAIR principles'implementation across various fields. Multi-Domain Not Applicable 
Our Study Evaluates the application of FAIR data principles in the creation and training of datasets for LLMs. This study encompasses a range of models including the BERT and GPT families, as well as the LLaMa-2-7b model, high-lighting diverse approaches in LLM development. News Data Yes (BERT/, GPT families, LLaMa-2-7b) 
ReferenceSummaryDomain FocusLLM Relevance
[21Discusses data governance for FAIR principles in health data management. Health Data Not Applicable 
[22Reviews initiatives for applying FAIR principles in managing health data. Health Data Not Applicable 
[23Examines FAIR data practices'role in global mental health research. Mental Health Research Not Applicable 
[24Analyzes FAIR principles in healthcare data, emphasizing cybersecurity. Health Data Partially (Instruct-GPT) 
[25Addresses challenges in FAIR application and legal aspects in European rare disease databases for ML technologies. Health Data Not Applicable 
[26Elaborates on FAIR principles in precision oncology research. Health Data Not Applicable 
[16Offers a broad perspective on FAIR principles'implementation across various fields. Multi-Domain Not Applicable 
Our Study Evaluates the application of FAIR data principles in the creation and training of datasets for LLMs. This study encompasses a range of models including the BERT and GPT families, as well as the LLaMa-2-7b model, high-lighting diverse approaches in LLM development. News Data Yes (BERT/, GPT families, LLaMa-2-7b) 

Table 1 highlights a focus on health data in existing research on FAIR principles. One study incorporates a LLM application (GPT-based model). Our work, however, stands out for its application of FAIR data principles to a range of LLMs, including pretrained LMs, GPT-series, and Llama2 models. This approach broadens the scope and generalizability of our research.

The FAIR data principles significantly influence data management practices across various fields. These include physics [27], health [28], environmental science [13], pharmaceuticals [9], chemistry [29], computer science [30], research and development [31], and clinical studies [26]. These principles ensure data is well-organized, accessible, and reusable in different research domains and applications. In AI and Natural Language Processing (NLP), the implementation of interdisciplinary strategies [32] and structured approaches [33] is crucial to uphold ethical data practices [17] and ensure fairness in AI models [34]. Given the constraints of space, we present a concise overview of each FAIR principle, as depicted in Figure 1. Additionally, a compliance checklist is provided in Table 2 .

Figure 1.

FAIR Data Principles: Key Aspects of Findability, Accessibility, Interoperability, and Reusability in Data Management.

Figure 1.

FAIR Data Principles: Key Aspects of Findability, Accessibility, Interoperability, and Reusability in Data Management.

Close modal
Table 2.

Comprehensive Checklist for Ensuring FAIR Principles Compliance, Along with Associated Tools and Practices.

PrincipleFeaturesCompliance listTools and Practices
Findability F1: Rich and descriptive metadata
F2: Standardized data indexing
F3: Documented data sources
F4: Advanced search functionalities
F5: Persistent identifiers like DOIs 
F1: Metadata includes titles, authors, abstracts, keywords, and affiliations.
F2: Use of standardized taxonomy and ontology for indexing.
F3: Clear documentation of data sources and collection methodologies.
F4: Implementation of advanced search tools and interfaces.
F5: Assignment of persistent identifiers such as DOIs. 
Metadata Management: Apache Atlas, Collibra, ORCID
Persistent Identifiers: Cross-Ref for DOIs
Data Indexing: Elasticsearch, Apache Solr, DSpace
Search Interfaces: Algolia, Apache Lucene
Data Repositories: NCBI, RE3data, CKAN, Data-verse, Zenodo, Figshare, EPrints, ResearchGate, Academia.edu 
Accessibility A1: Clear data access protocols
A2: Long-term data preservation
A3: Open access platforms
A4: Standardized
APIs for data access
A5: Data availability in accessible formats 
A1: Detailed access instructions and authentication processes.
A2: Use of reliable digital preservation services like CLOCKSS or Portico.
A3: Data deposited in open repositories such as Figshare or Zenodo.
A4: APIs conform to standards such as OpenAPI for ease of use.
A5: Data provided in multiple formats (e. g., CSV, JSON, XML) to ensure usability. 
API Tools: OpenAPI, GraphQL, RESTful interfaces
Data Preservation: Archivematica, LOCKSS, Zenodo, Figshare
Cloud Storage: Amazon S3, Google Cloud Storage, Microsoft Azure
Ethical Access: OneTrust, TrustArc 
Interoperability I1: Standard data formats
I2: Common communication protocols
I3: Data exchange standards
I4: Tools for data transformation and mapping 
I1: Data conforms to community-recognized standards (e. g., MIAME, Ecological Metadata Language).
I2: Support for protocols such as OAI-PMH for metadata harvesting.
I3: Use of frameworks like Schema.org for structured data.
I4: Availability of services like XSLT or OpenRefine for data conversion and mapping. 
Data Formats: JSON, XML, CSV, DICOM, GenBank
Protocols: HTTP, SOAP, REST, gRPC
Data Standards: RDF, HL7 FHIR, ISO/IEC standards
Transformation Tools: XSLT, Talend, Informatica, Apache NiFi
Ontology Systems: OWL, SPARQL, XOD
Specialized Databases: IEDB 
Reusability R1: Detailed metadata for context
R2: Data curated for future use
R3: Adherence to ethical standards
R4: Licensing frameworks
R5: Consideration of societal impacts 
R1: Comprehensive metadata including experimental conditions, methodologies, and provenance.
R2: Curating data with clear versioning and update records.
R3: Compliance with GDPR and other privacy regulations.
R4: Use of licenses like Creative Commons to clarify user rights.
R5: Assessment and documentation of data impact on society and potential biases. 
Metadata Standards: Dublin Core, DataCite, schema.org Data Curation: CKAN, DSpace, Omeka
Ethical Frameworks: Responsible AI, OpenAI Ethics
Guidelines, AI4ALL
Licensing Tools: Creative
Commons
Data Provenance tools: Provenance Tracking PROV-DM 
PrincipleFeaturesCompliance listTools and Practices
Findability F1: Rich and descriptive metadata
F2: Standardized data indexing
F3: Documented data sources
F4: Advanced search functionalities
F5: Persistent identifiers like DOIs 
F1: Metadata includes titles, authors, abstracts, keywords, and affiliations.
F2: Use of standardized taxonomy and ontology for indexing.
F3: Clear documentation of data sources and collection methodologies.
F4: Implementation of advanced search tools and interfaces.
F5: Assignment of persistent identifiers such as DOIs. 
Metadata Management: Apache Atlas, Collibra, ORCID
Persistent Identifiers: Cross-Ref for DOIs
Data Indexing: Elasticsearch, Apache Solr, DSpace
Search Interfaces: Algolia, Apache Lucene
Data Repositories: NCBI, RE3data, CKAN, Data-verse, Zenodo, Figshare, EPrints, ResearchGate, Academia.edu 
Accessibility A1: Clear data access protocols
A2: Long-term data preservation
A3: Open access platforms
A4: Standardized
APIs for data access
A5: Data availability in accessible formats 
A1: Detailed access instructions and authentication processes.
A2: Use of reliable digital preservation services like CLOCKSS or Portico.
A3: Data deposited in open repositories such as Figshare or Zenodo.
A4: APIs conform to standards such as OpenAPI for ease of use.
A5: Data provided in multiple formats (e. g., CSV, JSON, XML) to ensure usability. 
API Tools: OpenAPI, GraphQL, RESTful interfaces
Data Preservation: Archivematica, LOCKSS, Zenodo, Figshare
Cloud Storage: Amazon S3, Google Cloud Storage, Microsoft Azure
Ethical Access: OneTrust, TrustArc 
Interoperability I1: Standard data formats
I2: Common communication protocols
I3: Data exchange standards
I4: Tools for data transformation and mapping 
I1: Data conforms to community-recognized standards (e. g., MIAME, Ecological Metadata Language).
I2: Support for protocols such as OAI-PMH for metadata harvesting.
I3: Use of frameworks like Schema.org for structured data.
I4: Availability of services like XSLT or OpenRefine for data conversion and mapping. 
Data Formats: JSON, XML, CSV, DICOM, GenBank
Protocols: HTTP, SOAP, REST, gRPC
Data Standards: RDF, HL7 FHIR, ISO/IEC standards
Transformation Tools: XSLT, Talend, Informatica, Apache NiFi
Ontology Systems: OWL, SPARQL, XOD
Specialized Databases: IEDB 
Reusability R1: Detailed metadata for context
R2: Data curated for future use
R3: Adherence to ethical standards
R4: Licensing frameworks
R5: Consideration of societal impacts 
R1: Comprehensive metadata including experimental conditions, methodologies, and provenance.
R2: Curating data with clear versioning and update records.
R3: Compliance with GDPR and other privacy regulations.
R4: Use of licenses like Creative Commons to clarify user rights.
R5: Assessment and documentation of data impact on society and potential biases. 
Metadata Standards: Dublin Core, DataCite, schema.org Data Curation: CKAN, DSpace, Omeka
Ethical Frameworks: Responsible AI, OpenAI Ethics
Guidelines, AI4ALL
Licensing Tools: Creative
Commons
Data Provenance tools: Provenance Tracking PROV-DM 

3.1 Findability

The principle of Findability is important for ensuring that data and resources are not only discoverable but also readily accessible [35]. It involves the establishment of detailed metadata, the implementation of persistent identifiers, and the promotion of effective data indexing and search strategies. This principle significantly enhances the discoverability of data by both humans and machines, facilitating a smoother research process. Recent contributions to the enhancement of data findability within the FAIR framework include research on strategies for the FAIRification [33] of data to improve its findability, and studies [36] investigating models for the FAIR digital object framework with a focus on improving the findability of digital objects. Discussions [32] on the application of FAIR principles specifically within AI datasets highlight the growing recognition of these principles'importance. Additionally, considerations on the trustworthiness of data repositories underline findability's critical role in ensuring data quality and reliability in these environments [37].

3.2 Accessibility

Accessibility, as articulated by the FAIR data principles, emphasizes the straightfor-wardness of obtaining and utilizing data upon its discovery [8]. This aspect is integral to the FAIR framework and encompasses strategies for long-term data preservation, the establishment of ethical access protocols, and the assurance of data retrievability and usability across time. Key discussions [38] on the role of trust in data repositories, an essential component of accessibility, offering a detailed exploration of accessibility within the context of the FAIR principles. The application of FAIR principles to research software is also discussed in the literature [39] with a specific lens on enhancing accessibility. The support of the Fedora platform for FAIR data principles, especially in terms of accessibility, is examined in a related work [31]. Furthermore, the challenges and opportunities encountered in the implementation of FAIR principles, including accessibility, within the Brazilian data science landscape is also discussed [40]. Collectively, these contributions highlight the critical importance of embedding accessibility considerations into research software, platform support, and addressing implementation hurdles.

3.3 Interoperability

Interoperability, a core aspect of the FAIR data principles, denotes the capability of different data systems to operate cohesively [41]. This principle necessitates the adoption of standardized data formats and protocols to facilitate straightforward data exchange and integration across heterogeneous systems. Discussions on the Immune Epitope Database (IEDB) emphasize its commitment to interoperability through the adoption of FAIR principles [42]. The development and support of interoperability in ontology creation, as facilitated by the eXtensible ontology development (XOD) principles and tools, are detailed in [43]. Furthermore, the examination of challenges and strategies related to the implementation of FAIR data principles, with a particular focus on interoperability in data science, is undertaken in [40]. Additionally, a framework along with metrics for assessing the FAIRness of data, emphasizing interoperability, are delineated in [44]. The objectives and efforts of the GO FAIR initiative, aimed at promoting the widespread adoption of FAIR principles, including interoperability, are elaborated in [45]. Collectively, these contributions highlight the role of interoperability in promoting responsible and efficient data management across diverse scientific and technological domains.

3.4 Reusability

Reusability, as one of the foundational aspects of the FAIR data principles, highlights the necessity for data to be stored and documented in a manner that facilitates future retrieval and reuse [46]. This principle is supported by the creation of comprehensive metadata, consideration of legal and ethical frameworks, and the assessment of potential societal impacts. Research focusing on a structured methodology for planning the FAIRification of data, particularly with reusability in mind, is presented in [33]. Furthermore, the development of a model for digital objects adhering to FAIR principles that prioritizes reusability is detailed in [36]. The exploration of FAIR principles for dataset reusability, is disussed in [32]. Additionally, the work flows that prioritize reusability from the outset is argued in [47]. An exploration of FAIR principles, including detailed insights on reusability, is presented in [16]. Collectively, these contributions advocate for a data management approach that not only preserves data for future use but also ensures it remains ethically and effectively employable for training.

The evolution of LLMs introduces a complex spectrum of data management challenges, necessitating advanced strategies for efficient data organization and accessibility due to the engagement with extensive and intricate datasets [2]. The importance of high-quality data cannot be overstated; it is highly important to use representative, well-curated datasets to develop models that are ethically sound and have broad applicability [4]. Addressing privacy and ethical concerns is essential, requiring rigorous data governance to adhere to ethical norms and protect individual privacy [5]. Furthermore, the precision of data annotation and labeling is important to ensure model reliability, calling for standardized and transparent practices [48, 49]. Balancing data accessibility with the protection of proprietary information is crucial, a task that must be aligned with legal and ethical standards [20]. Moreover, compliance with data protection laws is critical for legal conformity and upholding the ethical integrity of AI technologies, which is fundamental to sustaining public trust [19]. A structured overview of these data management challenges and requirements in LLM development is provided in Figure 2, with a summary given below.

Figure 2.

Data Management Challenges in Large Language Models.

Figure 2.

Data Management Challenges in Large Language Models.

Close modal

Understanding the Context: Improving algorithms and reasoning is essential for enhancing contextual awareness in LLMs [50]. The capability of these models to understand and interpret complex contexts significantly contributes to their practical utility. Developing algorithms that grasp linguistic nuances, cultural contexts, and the implications of language use is increasingly necessary.

Accuracy and Reliability: LLMs sometimes producing incorrect information high-lights the need for effective fact-checking mechanisms [51]. To increase the reliability of these models, it is vital to implement verification algorithms and data validation techniques that ensure their outputs are accurate and trustworthy.

Ethical and Fair Use: Mitigating biases in LLM outputs is a significant concern [5, 11, 52]. Developing algorithms for detecting and correcting biases in training data and model outputs is essential for fair use [53]. Efforts must focus on incorporating diverse perspectives to ensure ethical use and fairness in model responses.

Interactivity and Personalization: The general nature of responses from current LLMs indicates the importance of developing adaptive learning algorithms [54]. Such algorithms should learn from user interactions, preferences, and feedback to provide personalized responses [55]. The aim is to create models that tailor their responses to individual user needs, improving personalization and user experience.

Language and Cultural Sensitivity: Addressing the underrepresentation of low-resource languages is crucial for increasing the inclusivity of LLMs [6]. It is important to enhance language model diversity and cultural dataset representation [56]. Expanding datasets to include more languages and ensuring models respect cultural contexts are key steps toward global inclusivity.

Efficiency and Scalability: The significant computational demands of LLM development present a notable challenge [2]. Developing optimization strategies to reduce computational requirements without compromising performance is necessary [57]. This may involve new model architectures, data processing methods, and hardware efficiencies, facilitating more scalable LLM deployment.

Research and Development Directions Engaging in collaborations with academic institutions, continuous model training, and adopting user-centric design philosophies are crucial for the ongoing enhancement of LLMs. These efforts are aimed at ensuring models meet the diverse and changing needs of their users.

Mapping FAIR principles to Data Management Challenges in LLMs

The challenges in managing data for LLMs are complex, yet the FAIR data principles offer targeted solutions that can alleviate some of these issues. For instance, Findability enhances the process of identifying pertinent data within vast datasets through the use of enriched metadata and persistent identifiers, streamlining data retrieval and analysis. Interoperability facilitates the seamless integration of varying data formats and systems, important for LLM training. This principle also aids in developing algorithms that are more contextually aware by allowing for the combination of different data types and sources, thereby leading to more nuanced and accurate model outputs.

Accessibility ensures data is readily available and access is controlled appropriately, allowing for the responsible sharing of data. This principle shows the fine line between promoting open access to data to foster innovation and collaboration, and the necessity to safeguard sensitive information and intellectual property. Reusability amplifies the longevity and utility of data beyond its original application. By adhering to legal and regulatory standards, reusability ensures that data remains applicable, reliable, and ethically sound for future endeavors. A detailed mapping of these data management challenges in LLMs to the FAIR data principles is provided in S2 Appendix B.

Rationale for Integrating FAIR Data Principles in LLM Development

Our research is dedicated to enhancing the training of LLMs by embedding FAIR data principles directly into their training methodologies. The main objective is to develop training datasets for LLMs that are designed to comply with FAIR principles. To realize this, we have devised a framework that interweaves FAIR principles throughout the entire model development lifecycle of LLMs, as depicted in Figure 3. In the following sections, we will elaborate on each phase of this proposed LLM lifecycle, connecting each step specifically to our case study which concentrates on the creation of a FAIR-compliant dataset.

Figure 3.

FAIR principles integrated into the LLM lifecycle.

Figure 3.

FAIR principles integrated into the LLM lifecycle.

Close modal

Case Study: Developing a FAIR-Compliant Dataset

To address the risk of LLMs perpetuating societal biases, our case study focuses on developing a dataset with a conscientious design that proactively identifies biases within the data prior to the training of these models. Adopting FAIR principles in preparing data for LLMs training is essential, though not a full “FAIRification” of LLMs. This step is crucial for enhancing data quality and model performance, paving the way towards more efficient and ethical AI development. The dataset and model card are available here.

This case study identifies “bias” in an NLP dataset as a linguistic tendency leading to the unfair representation of specific groups, manifesting as linguistic bias, stereo-types, toxicity, or misinformation [58-62]. Such biases often stem from data collection, processing, and usage methods, potentially causing LLMs to replicate or amplify these biases [63]. Our analysis covers various bias dimensions, illustrated in Figure 4, encompassing ageism, occupational jargon, and political rhetoric, among others. This dataset exemplifies how to construct diverse training datasets for LLMs in adherence to FAIR principles.

Figure 4.

Biases across Multiple Dimensions Explored in this Study.

Figure 4.

Biases across Multiple Dimensions Explored in this Study.

Close modal

Data Collection and Curation for LLM Development. FAIR Principles Achieved: Findability, Accessibility, Interoperability, and Reusability.

In line with the FAIR data principles, our dataset is structured to enhance its Findability, featuring comprehensive metadata to facilitate easy discovery. This dataset, sourced from a variety of channels such as news feeds and websites between January and May 2023, encompasses over 50, 000 filtered entries. For data curation, we utilized various feeds and hashtags, including#MediaBias, #SocialJustice, #GenderEquality, #RacialInjustice, #CulturalDiversity, #AgeismAwareness, #ReligiousTolerance, and#EconomicDisparity, to ensure a wide representation of social issues. This diversity of news articles not only broadens accessibility but also enriches the dataset relevance to current social discourses.

The detailed metadata includes crucial information such as dataset title, description, authors, date of creation, version, and keywords that reflect the dataset scope and purpose. These keywords (e. g., LLMs, Training, Biases, News Media, NLP) are strategically chosen to enhance the dataset retrieval efficiency across various research portals. Additionally, the metadata specifies the data type, whether textual, numerical, or otherwise, further aligning with the Interoperability principle by facilitating data integration across different systems.

This readability test is crucial for determining the level of education required to comprehend our dataset texts. As illustrated in Figure 5, the distribution of Gunning Fog Index scores in our dataset shows a normal distribution with a mean score of 7.79. This indicates that the majority of our texts are suitable for readers with at least an eighth-grade education level. Such readability analysis is instrumental in aligning our dataset with the FAIR principle of Accessibility, ensuring that the content is comprehensible to the intended audience, thus enhancing the usability of the dataset in training LLMs.

Figure 5.

Histogram of the Gunning Fog Index on FAIR-Complaint Dataset. The x-axis denotes the Gunning Fog Index scores, reflecting text complexity, and the y-axis represents the number of samples with each score.

Figure 5.

Histogram of the Gunning Fog Index on FAIR-Complaint Dataset. The x-axis denotes the Gunning Fog Index scores, reflecting text complexity, and the y-axis represents the number of samples with each score.

Close modal

Annotations The process of annotating labels and debiasing text initially utilized GPT-3.5 for an automated preliminary screening, efficiently identifying potential bias, toxicity, stereotyping, and harm. Recent works [64] indicate that AI can excel in labeling tasks, outperforming traditional crowd-sourced methods. In our case, we utilize an extra layer of human review to have quality with speed. This facilitated the early detection of content requiring human review. Following this, a team of 15 experts and students across various disciplines crafted comprehensive annotation guidelines emphasizing accuracy and sensitivity. These guidelines assist annotators in conducting a thorough review to mitigate any biases, ensuring content fairness and respectfulness.

Annotators engage in a nuanced review process, identifying and revising instances of bias, toxicity, stereotyping, and harm based on a wide range of attributes, guided by principles of neutrality, respect, and inclusivity. This detailed annotation effort is supported by a commitment to ongoing feedback, expert reviews, and inter-annotator agreement (IAA) checks, ensuring high-quality annotations evidenced by a Cohen's Kappa score above 0.75. Examples of IAA agreements across main bias dimensions are provided in S3 Appendix C, illustrating the robustness of our annotation methodology.

The robustness of our dataset is established through a dual-stage quality assessment: automated analysis for initial screening of 10, 000 statistically significant entries, and expert review of 500 stratified random samples for deeper insights. This method ensures the dataset accuracy, completeness, and consistency, detailed in Table 3.

Table 3.

Data Quality Analysis Summary.

AspectAutomated AnalysisExpert Review
Accuracy 96.5% Contextually accurate 
Completeness 93.0% High with occasional updates 
Consistency 98.0% Uniform across samples 
AspectAutomated AnalysisExpert Review
Accuracy 96.5% Contextually accurate 
Completeness 93.0% High with occasional updates 
Consistency 98.0% Uniform across samples 

For Interoperability, the data is formatted for seamless integration for various ML tasks, such as binary and multilabel classifiers, question answering (QA) system and debiased language generation task (debiasing), as detailed in Table 4. This enables researchers to use this data in diverse analytical contexts, facilitating cross-domain research and development. The dataset schema and examples on dataset formats are given in S4 Appendix D.

Table 4.

Specialized Data Formats for Interoperability.

Data FormatModel Type
Classification Format Binary Classifier 
CoNLL Format Multi-label Token Classification 
SQuAD Format Question Answering System 
Counterfactual Formatting Debiasing Model 
Sentiment Analysis Format Sentiment Classifier 
Toxicity Classification Format Toxicity Classifier 
Data FormatModel Type
Classification Format Binary Classifier 
CoNLL Format Multi-label Token Classification 
SQuAD Format Question Answering System 
Counterfactual Formatting Debiasing Model 
Sentiment Analysis Format Sentiment Classifier 
Toxicity Classification Format Toxicity Classifier 

Furthermore, we have stored our dataset in repositories such as Huggingface, Zenodo, and Figshare, which not only adhere to metadata standards but also ensure its long-term Accessibility. Overall, our data collection and curation strategy aligns with the FAIR principles and enhance the practical utility of our dataset for LLM development.

Model Training and Algorithm Development. FAIR Principles Achieved: Interoperability and Reusability.

The model training and algorithm development phase is crucial for adhering to FAIR principles, particularly interoperability and reusability. We strategically deploy various LLMs, fine-tuned on 10, 000 statically significant data sample from our main data, for tasks ranging from sentiment analysis and QA to debiasing (language generation), and develop them with a modular design for enhanced Reusability across projects.

Interoperability is achieved by using common frameworks like TensorFlow and PyTorch and standardizing data formats for inputs and outputs. Every task performed by LLMs, such as classification and debiasing, is executed following the preparation of specific training formats tailored to each task. This ensures seamless integration of our models with various systems and datasets. We further enhance this with the inclusion of model cards for each LLM, providing essential information such as model purpose, performance, and usage guidelines, thus aiding in understanding and adoption.

We prepare API documentation that provides details for every aspect of training and development, including parameter settings and algorithm modifications. This transparency facilitates scientific validation and progress [16]. We also foster community collaboration by sharing our findings, models, and tools.

Model Evaluation and Validation. FAIR Principles Achieved: Reusability and Accessibility.

In our evaluation and validation phase, we prioritize the FAIR principles of Reusability and Accessibility to ensure that our models and findings can be widely utilized and understood. Transparency in reporting and the provision of accessible documentation are key aspects of our methodology. All results, including bias analysis and model benchmarking, are made publicly available, allowing for comprehensive observation and application by the wider research community. An example of this approach is presented in our detailed bias analysis (Figure 6) and the benchmarking table (Table 5), which illustrate our commitment to these principles.

Figure 6.

Heatmap Visualization: the prevalence and intensity of different types of biases, such as ageism, gender, and political, across various classifications like bias, non-biased, toxic, and sentiment within a dataset.

Figure 6.

Heatmap Visualization: the prevalence and intensity of different types of biases, such as ageism, gender, and political, across various classifications like bias, non-biased, toxic, and sentiment within a dataset.

Close modal
Table 5.

Comprehensive Analysis of Language Model Performance and Impact of Debiasing: The first part of the table displays the performance metrics of various language models, with higher scores indicating better performance. The second part of the table examines the impact of debiasing, where lower scores post-debiasing are preferable.

Model (Type)AccuracyPrecisionF1-Score
Binary Bias Classifier (BERT-large-uncased) 92% 89% 90% 
NER Model (RoBERTa-large-uncased) 95% 93% 94% 
QA System (BERT-base-uncased) 88% 87% 86% 
Sentiment Classifier (BERT-large-uncased) 90% 91% 90% 
Toxicity Classifier (BERT-large-uncased) 91% 90% 91% 
LLaMa2 (7b-chat-hf) 93% 94% 93% 
Model (Type)AccuracyPrecisionF1-Score
Binary Bias Classifier (BERT-large-uncased) 92% 89% 90% 
NER Model (RoBERTa-large-uncased) 95% 93% 94% 
QA System (BERT-base-uncased) 88% 87% 86% 
Sentiment Classifier (BERT-large-uncased) 90% 91% 90% 
Toxicity Classifier (BERT-large-uncased) 91% 90% 91% 
LLaMa2 (7b-chat-hf) 93% 94% 93% 
MetricPre-DebiasingPost-Debiasing
Toxicity Score 75% 24%  
Bias Classification Score 80% 55%  
Sentiment Score -0.3 0.2  
MetricPre-DebiasingPost-Debiasing
Toxicity Score 75% 24%  
Bias Classification Score 80% 55%  
Sentiment Score -0.3 0.2  

In Table 5, we present performance metrics across various tasks, demonstrating the effectiveness of our classifiers in toxicity detection, bias classification, sentiment analysis, multi-label token classification, and QA capabilities. Additionally, we integrated a debiasing process using LLaMa 2B Chat [65], re-evaluating our models on the same test set post-debiasing. This allowed us to assess the impact of debiasing on reducing toxicity and bias, and its influence on sentiment analysis. The debiasing process notably improved scores and indicated the efficacy of the debiasing intervention.

Overall, these models'evaluation and validation phase, emphasize transparent reporting and public availability of models and the detailed results, aligns with the FAIR principles of Reusability and Accessibility.

Deployment and Ongoing Monitoring. FAIR Principles Achieved: Accessibility, Reusability, and Findability.

In the post-deployment phase, we emphasize the FAIR principles of Accessibility, Reusability, and Findability. Our approach includes providing comprehensive documentation and clear usage guidelines to enhance the model findability and Accessibility. We employ version control systems to ensure easy tracking and retrieval of the latest model versions, supporting both findability and Reusability.

To facilitate user engagement and model effectiveness, we offer extensive training materials and support, making the models more accessible to a diverse user base. Regular reporting of performance metrics offers insights into the models'ongoing effectiveness and areas for improvement, aiding in their reusability. We also provide integration support with various tools and platforms, ensuring that our models can be easily adopted in different technological contexts. These practices collectively ensure that our deployed models adhere to the FAIR principles, maintaining their utility and efficacy in a dynamic, user-centric environment.

Community Engagement and Collaborative Development. FAIR Principles Achieved: Findability, Accessibility, Interoperability, and Reusability.

In our efforts to foster community engagement and collaborative development, we align with the FAIR principles effectively. We provide open-source development to facilitate collaborative innovation. Our datasets are provided in accessible formats such as JSON and CSV and are indexed with rich metadata. This approach ensures that our LLMs are developed in environments conducive to collaborative progress. The datasets are disseminated under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license [66], aligning our work with the FAIR principles of data stewardship.

Upholding FAIR Principles. We provide a comprehensive analysis in Table 6, showcasing our commitment to upholding the FAIR data principles to enhance the dataset's usefulness for both present and future research endeavors.

Table 6.

FAIR Compliance in Dataset and LLM Development.

PrincipleFAIR Compliance CriteriaMetRemarks (Implementation Details)
Findability F1: Rich, descriptive metadata
F2: Standardized data indexing
F3: Documented data sources
F4: Persistent identifiers for data and models 
√ F1: Metadata includes descriptions for data and models.
F2: Data and models indexed with relevant keywords.
F3: Clear documentation of data sources and model development.
F4: Unique identifiers for data and models. 
Accessibility A1: Clearly defined access protocols
A2: Long-termdataandmodel preservation
A3: Data and models available in accessible formats 
√ A1: Open-access repository for data and models.
A2: Long-term preservation of data and models.
A3: Data in CSV, JSON, XML; Models accessible via APIs. 
Interoperability I1: Use of standard data and model formats
I2: Adherence to data and model exchange standards 
√ I1: Data in standard formats; Models compatiblewithcommonMLframeworks.
I2: Adheres to standard data and model exchange protocols. 
Reusability R1: Detailed metadata for data and models
R2: Adherence to ethical and transparency standards
R3: Licensing information for data and model usage 
√ R1: Metadata includes usage guidelines for data and models.
R2: Focus on ethical data use and transparent model training.
R3: Data licensed under CC BY-NC 4.0; Models with open-source licenses. 
PrincipleFAIR Compliance CriteriaMetRemarks (Implementation Details)
Findability F1: Rich, descriptive metadata
F2: Standardized data indexing
F3: Documented data sources
F4: Persistent identifiers for data and models 
√ F1: Metadata includes descriptions for data and models.
F2: Data and models indexed with relevant keywords.
F3: Clear documentation of data sources and model development.
F4: Unique identifiers for data and models. 
Accessibility A1: Clearly defined access protocols
A2: Long-termdataandmodel preservation
A3: Data and models available in accessible formats 
√ A1: Open-access repository for data and models.
A2: Long-term preservation of data and models.
A3: Data in CSV, JSON, XML; Models accessible via APIs. 
Interoperability I1: Use of standard data and model formats
I2: Adherence to data and model exchange standards 
√ I1: Data in standard formats; Models compatiblewithcommonMLframeworks.
I2: Adheres to standard data and model exchange protocols. 
Reusability R1: Detailed metadata for data and models
R2: Adherence to ethical and transparency standards
R3: Licensing information for data and model usage 
√ R1: Metadata includes usage guidelines for data and models.
R2: Focus on ethical data use and transparent model training.
R3: Data licensed under CC BY-NC 4.0; Models with open-source licenses. 

6.1 Limitations of the FAIR Data Principles in Addressing Data Challenges

The FAIR data principles is a significant step towards enhancing the openness and efficiency of data usage in research and related fields. However, their effectiveness is constrained by several factors. Primarily, while the FAIR principles improve data accessibility and usability, they do not automatically guarantee data quality or validity. The absence of mechanisms for ensuring accuracy, completeness, or reliability could lead to the spread of low-quality data, negatively impacting research and decision-making. Additionally, FAIR principles, though encouraging data sharing, may not adequately address the ethical and privacy concerns associated with sensitive data like health records or personal information, which could result in privacy violations.

Furthermore, the implementation of FAIR principles requires considerable resources, including necessary infrastructure and expertise. This poses a significant challenge, especially for smaller institutions or those in developing countries, potentially exacerbating the digital divide. The broad application of FAIR principles may not be suitable for all scientific disciplines, as different fields might require tailored data management approaches not fully encompassed by these general guidelines. The effectiveness of these principles also depends heavily on the awareness and training of data handlers, where a lack of such training can be a major barrier to adoption. Moreover, the absence of universally accepted standards for evaluating’FAIRness'leads to an inconsistent application of these principles.

The diversity of data formats creates technical challenges in achieving seamless interoperability, where the emphasis on data sharing might overshadow other vital aspects of data stewardship, such as privacy considerations. Finally, the principles can conflict with intellectual property rights and commercial interests. The concept of free data access and reuse might challenge proprietary research, leading to resistance from certain sectors. This necessitates compliance with various legal and regulatory frameworks, adding another layer of complexity to the application of FAIR principles.

6.2 Future Directions

Strategies for Mitigating Limitations and Enhancing Data Utility

In addressing the limitations of FAIR principles and enhancing data utility, a multifaceted strategy is essential. This involves refining data quality through rigorous validation processes, ensuring accuracy and reliability. Simultaneously, balancing data accessibility with ethical and privacy considerations is crucial, there is need to employ proper anonymization techniques and strict privacy protocols. Emphasizing data preservation and sustainability alongside sharing, broadens the scope of stewardship..

Limitation of the Study and Future Perspectives

Our Fair-Compliant dataset, designed to identify and mitigate known biases, may inadvertently overlook emerging biases, thereby emphasizing the need for ongoing revision and vigilant monitoring [63]. Concurrently, scaling the dataset to match the complexity of advanced LLMs, while adhering to ethical standards, emerges as a substantial challenge. This endeavor is compounded by the imperative to mitigate biases in model interpretations, bolster interpretability, and ensure robust data privacy and security [67, 68]. Therefore, future research directions should consider creating dynamic mechanisms for dataset updates [69], advancing bias mitigation techniques, aligning datasets with evolving LLM technologies, exploring scalable dataset maintenance solutions, and broadening the spectrum of LLM applications. Integral to this journey is the development of comprehensive ethical guidelines and the fostering of collaborative research, both of which are pivotal for the responsible evolution of LLM technology.

Our study incorporates FAIR data principles into LLM training and development, improving data management and model training. This approach has yielded a versatile dataset that shows the process of integrating FAIR principles into the data building and that can be used for LLM training. While our dataset and case study offer guidance, they do not fully tackle the ‘FAIRification’ of LLMs. Advancements in LLM research must focus on data and the applicability of FAIR principles. This includes updating datasets to capture emerging trends, enhancing bias detection, adapting to novel LLM architectures, improving model interpretability, and formulating ethical guidelines. Our efforts contribute to the responsible advancement of AI, aiming to forge more ethical and efficient AI tools that serve diverse communities.

This research was conducted by the first and corresponding author, Dr. Shaina Raza (shaina. [email protected]). Dr. Shaina Raza is responsible for formulating the research questions, collecting, and analyzing the data, and drafting the manuscript. Dr. Deval Pandya and Dr. Chen Ding contributed to shaping the research questions and played a crucial role in designing and refining the manuscript. Shardul Ghuge helped in building models on the data, and Dr. Elham Dolatabadi assisted in the critical analysis of the models'results and detailed review. All authors have read and approved the manuscript.

The data sets generated during and/or analyzed during the current study are available in the BiasScan (https://huggingface.co/collections/newsmediabias/biasscan-659d681ed7a5bc9d98cde11b).

Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

‘LLMs’ here includes both ‘LMs’ and ‘LLMs’, with ‘LLMs’ representing more advanced versions.

For hyperlinks to the technologies and APIs referenced in the table, refer to S1 Appendix A.

[1]
Jiang
,
Z.
,
Xu
,
F. F.
,
Araki
,
J.
,
Neubig
,
G.
:
How can we know what language models know?
Transactions of the Association for Computational Linguistics
8
,
423
438
(
2020
).
[2]
Zhao
,
W. X.
,
Zhou
,
K.
,
Li
,
J.
,
Tang
,
T.
,
Wang
,
X.
,
Hou
,
Y.
,
Min
,
Y.
,
Zhang
,
B.
,
Zhang
,
J.
,
Dong
,
Z.
, et al
.:
A survey of large language models
. arXiv preprint arXiv: 2303. 18223 (
2023
).
[3]
TrendFeedr
:
Large Language Model (LLM) Trends
. https://trendfeedr.com/blog/large-language-model-llm-trends/. Accessed: 2024-01-01 (
2024
).
[4]
Bender
,
E. M.
,
Gebru
,
T.
,
McMillan-Major
,
A.
,
Shmitchell
,
S.
:
On the dangers of stochastic parrots: Can language models be too big?
In:
Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency
, pp.
610-623
(
2021
).
[5]
Wang
,
Y.
,
Zhong
,
W.
,
Li
,
L.
,
Mi
,
F.
,
Zeng
,
X.
,
Huang
,
W.
,
Shang
,
L.
,
Jiang
,
X.
,
Liu
,
Q.
:
Aligning large language models with human: A survey
. arXiv preprint arXiv: 2307. 12966 (
2023
).
[6]
Chang
,
Y.
,
Wang
,
X.
,
Wang
,
J.
,
Wu
,
Y.
,
Zhu
,
K.
,
Chen
,
H.
,
Yang
,
L.
,
Yi
,
X.
,
Wang
,
C.
,
Wang
,
Y.
, et al
.:
A survey on evaluation of large language models
. arXiv preprint arXiv: 2307. 03109 (
2023
).
[7]
Dunning
,
A.
,
De Smaele
,
M.
,
Böhmer
,
J.
:
Are the fair data principles fair?
International Journal of digital curation
12
(
2
),
177
195
(
1970
).
[8]
Boeckhout
,
M.
,
Zielhuis
,
G. A.
,
Bredenoord
,
A. L.
:
The fair guiding principles for data stewardship: fair enough?
European Journal of Human Genetics
26
(
7
),
931
936
(
2018
).
[9]
Wise
,
J.
,
Barron
,
A. G.
,
Splendiani
,
A.
,
Balali-Mood
,
B.
,
Vasant
,
D.
,
Little
,
E.
,
Mellino
,
G.
,
Harrow
,
I.
,
Smith
,
I.
,
Taubert
,
J.
et al
:
Implementation and relevance of FAIR data principles in biopharmaceutical r& d
.
Drug Discovery Today
24
(
4
),
933
938
(
2019
).
[10]
Chen
,
X.
,
Jagerhorn
,
M.
:
Implementing fair work flows along the research lifecycle
.
Procedia Computer Science
211
,
83
92
(
2022
).
[11]
Deshpande
,
A.
,
Sharp
,
H.
:
Responsible ai systems: Who are the stakeholders?
In:
Proceedings of the 2022 AAAI/ACM Conference on AI
,
Ethics, and Society
, pp.
227
236
(
2022
).
[12]
Ethics
,
O. S. C.
:
Home
(
2024
). Accessed: https://openai.com/policies/supplier-code.
[13]
Partescano
,
E.
,
Molina Jack
,
M. E.
,
Vinci
,
M.
,
Cociancich
,
A.
,
Altenburger
,
A.
,
Giorgetti
,
A.
,
Galgani
,
F.
:
Data quality and fair principles applied to marine litter data in europe
.
Marine Pollution Bulletin
168
,
112965
(
2021
) https://doi.org/10.1016/J.MARPOLBUL.2021.112965.
[14]
Wilkinson
,
M. D.
,
Dumontier
,
M.
,
Aalbersberg
,
I. J.
,
Appleton
,
G.
,
Axton
,
M.
,
Baak
,
A.
,
Blomberg
,
N.
,
Boiten
,
J.-W.
,
Silva Santos
,
L. B.
,
Bourne
,
P. E.
, et al
.:
The fair guiding principles for scientific data management and stewardship
.
Scientific Data
3
(
1
),
1
9
(
2016
).
[15]
Hasnain
,
A.
,
Rebholz-Schuhmann
,
D.
:
Assessing fair data principles against the 5-star open data principles
. In:
The Semantic Web: ESWC 2018 Satellite Events: ESWC 2018 Satellite Events, Heraklion, Crete, Greece, June 3-7, 2018, Revised Selected Papers 15
, pp.
469
477
(
2018
).
Springer
.
[16]
Jacobsen
,
A.
,
Azevedo
,
R.M.
,
Juty
,
N.
,
Batista
,
D.
,
Coles
,
S.
,
Cornet
,
R.
et al
.:
FAIR principles: Interpretations and implementation considerations
.
Data Intelligence
2
(
1-2
),
10
29
(
2020
) doi: https://doi.org/10.1162/dint_r_00024.
[17]
Shmueli
,
B.
,
Fell
,
J.
,
Ray
,
S.
,
Ku
,
L. -W.
:
Beyond fair pay: Ethical implications of nlp crowdsourcing
. In:
North American Chapter of the Association for Computational Linguistics
(
2021
). https://doi.org/10.18653/V1/2021.NAACL-MAIN.295.
[18]
Singh
,
C.
,
Askari
,
A.
,
Caruana
,
R.
,
Gao
,
J.
:
Augmenting interpretable models with large language models during training
.
Nature Communications
14
(
1
),
7913
(
2023
).
[19]
Raji
,
I. D.
,
Bender
,
E. M.
,
Paullada
,
A.
,
Denton
,
E.
,
Hanna
,
A.
:
AI and the everything in the whole wide world benchmark
. arXiv preprint arXiv: 2111. 15366 (
2021
).
[20]
Jobin
,
A.
,
Ienca
,
M.
,
Vayena
,
E.
:
The global landscape of ai ethics guidelines
.
Nature Machine Intelligence
1
(
9
),
389
399
(
2019
).
[21]
Alvarez-Romero
,
C.
,
Rodríguez-Mejias
,
S.
,
Parra-Calderón
,
C.
:
Desiderata for the data governance and FAIR principles adoption in health data hubs
.
Study in Health Technology and Informatics.
305
(
164-167
), (
2023
) doi: 10.3233/SHTI230452.
[22]
Inau
,
E. T.
,
Sack
,
J.
,
Waltemath
,
D.
,
Zeleke
,
A. A.
:
Initiatives, concepts, and implementation practices of FAIR (findable, accessible, interoperable, and reusable) data principles in health data stewardship practice: protocol for a scoping review
.
JMIR Research Protocols
10
(
2
),
22505
(
2021
).
[23]
Sadeh
,
Y.
,
Denejkina
,
A.
,
Karyotaki
,
E.
,
Lenferink
,
L. I.
,
Kassam-Adams
,
N.
:
Opportunities for improving data sharing and FAIR data practices to advance global mental health
.
Cambridge Prisms: Global Mental Health
10
,
14
(
2023
).
[24]
Stanciu
,
A.
:
Data management plan for healthcare: Following FAIR principles and addressing cybersecurity aspects. a systematic review using instructgpt
.
medRxiv
,
2023-04
(
2023
).
[25]
Raycheva
,
R.
,
Kostadinov
,
K.
,
Mitova
,
E.
,
Bogoeva
,
N.
,
Iskrov
,
G.
,
Štefanov
,
G.
,
Štefanov
,
R.
:
Challenges in mapping european rare disease databases, relevant for ml-based screening technologies in terms of organizational, fair and legal principles: scoping review
.
Frontiers in Public Health
11
(
2023
).
[26]
Vesteghem
,
C.
,
Brøndum
,
R. F.
,
Sønderkær
,
M.
,
Sommer
,
M.
,
Schmitz
,
A.
,
Bødker
,
J. S.
,
Dybkær
,
K.
,
El-Galaly
,
T. C.
,
Bøgsted
,
M.
:
Implementing the FAIR Data Principles in precision oncology: review of supporting initiatives
.
Briefings in Bioinformatics
21
(
3
),
936
945
(
2019
) https://doi.org/10.1093/bib/bbz044 https://academic.oup.com/bib/article-pdf/21/3/936/33398969/bbz044.pdf.
[27]
Dungkek
,
K.
:
Fair principles for data and ai models in high energy physics research and education
. arXiv(
2022
) https://doi.org/10.48550/arxiv.2211.15021.
[28]
Inau
,
E. T.
,
Sack
,
J.
,
Waltemath
,
D.
,
Zeleke
,
A. A.
:
Initiatives, concepts, and implementation practices of the findable, accessible, interoperable, and reusable data principles in health data stewardship: Scoping review
.
Journal of Medical Internet Research
25
,
45013
(
2023
).
[29]
Jeliazkova
,
N.
,
Kochev
,
N.
,
Tancheva
,
G.
:
FAIR data model for chemical substances: Development challenges, management strategies, and applications
. In:
Data Integrity and Data Governance
.
IntechOpen
(
2023
). doi: 10.5772/intechopen.110248.
[30]
Axton
,
M.
,
Baak
,
A.
,
Blomberg
,
N.
,
Boiten
,
J.-W.
,
Silva Santos
,
L. B.
,
Bourne
,
P. E.
,
Bouwman
,
J.
,
Brookes
,
A. J.
,
Clark
,
T.
, et al
.:
The FAIR guiding principles for scientific data management and stewardship
.
Scientific Data
3
,
160018
(
2016
).
[31]
Wilcox
,
D.
:
Supporting FAIR data principles with fedora
.
LIBER Quarterly: The Journal of the Association of European Research Libraries
28
(
1
),
1
8
(
2018
).
[32]
Huerta
,
E. A.
,
Blaiszik
,
B.
,
Brinson
,
L. C.
,
Bouchard
,
K. E.
,
Diaz
,
D.
,
Doglioni
,
C.
,
Duarte
,
J. M.
,
Emani
,
M.
,
Foster
,
I.
,
Fox
,
G.
,
Harris
,
P.
,
Heinrich
,
L.
,
Jha
,
S.
,
Katz
,
D. S.
,
Kindratenko
,
V.
,
Kirkpatrick
,
C. R.
,
Lassila-Perini
,
K.
,
Madduri
,
R. K.
,
Neubauer
,
M. S.
,
Psomopoulos
,
F. E.
,
Roy
,
A.
,
R'ubel
,
O.
,
Zhao
,
Z.
,
Zhu
,
R.
:
FAIR for AI: An interdisciplinary and international community building perspective
.
Scientific Data
10
(
1
) (
2023
) https://doi.org/10.1038/s41597-023-02298-6.
[33]
Bernabé
,
C.
,
Sales
,
T. P.
,
Schultes
,
E.
,
Ulzen
,
N.
,
Jacobsen
,
A.
,
Silva Santos
,
L. O. B.
,
Mons
,
B.
,
Roos
,
M.
:
A goal-oriented method for fairification planning
In:
CEUR Workshop Proceedings
(
2023
) 10.21203/rs.3.rs3092538/v1.
[34]
Bateni
,
A.
,
Chan
,
M.
,
Eitel-Porter
,
R. J.
:
Ai fairness: from principles to practice
. arXiv (
2022
) https://doi.org/10.48550/arXiv.2207.09833.
[35]
Findlay
,
M.
,
Seah
,
J.
:
An ecosystem approach to ethical ai and data use: experimental reflections
. In:
2020 IEEE/ITU International Conference on Artificial Intelligence for Good (AI4G)
, pp.
192
197
(
2020
).
IEEE
.
[36]
Santos
,
L. O. B. d. S.
,
Sales
,
T. P.
,
Fonseca
,
C. M.
,
Guizzardi
,
G.
:
Towards a conceptual model for the fair digital object framework
. arXiv preprint arXiv: 2302.
11894
(
2023
).
[37]
Götz
,
A.
:
The fair principles: Trusting in fair data repositories
.
Open Access Government
(
2023
).
[38]
Wang
,
M.
,
Savard
,
D.
:
The fair principles and research data management
.
Research Data Management in the Canadian Context
(
2023
).
[39]
Lamprecht
,
A. -L.
,
Garcia
,
L.
,
Kuzak
,
M.
,
Martinez
,
C.
,
Arcila
,
R.
,
Martin Del Pico
,
E.
,
Dominguez Del Angel
,
V.
,
Van De Sandt
,
S.
,
Ison
,
J.
,
Martinez
,
P. A.
, et al
.:
Towards fair principles for research software
.
Data Science
3
(
1
),
37
59
(
2020
).
[40]
Sales
,
L.
,
Henning
,
P.
,
Veiga
,
V.
,
Costa
,
M. M.
,
Sayão
,
L. F.
,
Silva Santos
,
L. O. B.
,
Pires
,
L. F.
:
Go FAIR Brazil: a challenge for brazilian data science
.
Data Intelligence
2
(
1-2
),
238
245
(
2020
).
[41]
Silva Santos
,
L. B.
,
Wilkinson
,
M. D.
,
Kuzniar
,
A.
,
Kaliyaperumal
,
R.
,
Thompson
,
M.
,
Dumontier
,
M.
,
Burger
,
K.
:
FAIR data points supporting big data interoper-ability
.
Enterprise Interoperability in the Digitized and Networked Factory of the Future
.
ISTE
,
London
,
270
279
(
2016
).
[42]
Vita
,
R.
,
Overton
,
J. A.
,
Mungall
,
C. J.
,
Sette
,
A.
,
Peters
,
B.
:
Fair principles and the iedb: short-term improvements and a long-term vision of obo-foundry mediated machine-actionable interoperability
.
Database
2018
,
105
(
2018
).
[43]
He
,
Y.
,
Xiang
,
Z.
,
Zheng
,
J.
,
Lin
,
Y.
,
Overton
,
J. A.
,
Ong
,
E.
:
The extensible ontology development (xod) principles and tool implementation to support ontology interoperability
.
Journal of biomedical semantics
9
,
1
10
(
2018
).
[44]
Wilkinson
,
M. D.
,
Sansone
,
S.-A.
,
Schultes
,
E.
,
Doorn
,
P.
,
Silva Santos
,
L. O.
,
Dumontier
,
M.
:
A design framework and exemplar metrics for fairness
.
Scientific data
5
(
1
),
1
4
(
2018
).
[45]
Schultes
,
E. A.
,
Strawn
,
G. O.
,
Mons
,
B.
, et al
.:
Ready, set, go fair: Accelerating convergence to an internet of fair data and services
.
DAMDID/RCDL
19
,
23
(
2018
).
[46]
Anguswamy
,
R.
,
Frakes
,
W. B.
:
A study of reusability, complexity, and reuse design principles
. In:
Proceedings of the ACM-IEEE International Symposium on Empirical Software Engineering and Measurement
, pp.
161
164
(
2012
).
[47]
Wolf
,
M.
,
Logan
,
J.
,
Mehta
,
K.
,
Jacobson
,
D.
,
Cashman
,
M.
,
Walker
,
A. M.
,
Eisenhauer
,
G.
,
Widener
,
P.
,
Cliff
,
A.
:
Reusability first: Toward fair work flows
. In:
2021 IEEE International Conference on Cluster Computing (CLUSTER)
, pp.
444
455
(
2021
).
IEEE
.
[48]
Raza
,
S.
,
Schwartz
,
B.
:
Constructing a disease database and using natural language processing to capture and standardize free text clinical information
.
Scientific Reports
13
(
1
),
8591
(
2023
).
[49]
Monarch
,
R. M.
:
Human-in-the-Loop Machine Learning: Active Learning and Annotation for Human-centered AI
.
Simon and Schuster
, ??? (
2021
).
[50]
Xi
,
Z.
,
Chen
,
W.
,
Guo
,
X.
,
He
,
W.
,
Ding
,
Y.
,
Hong
,
B.
,
Zhang
,
M.
,
Wang
,
J.
,
Jin
,
S.
,
Zhou
,
E.
, et al
.:
The rise and potential of large language model based agents: A survey
. arXiv preprint arXiv: 2309. 07864 (
2023
).
[51]
Ji
,
Z.
,
Lee
,
N.
,
Frieske
,
R.
,
Yu
,
T.
,
Su
,
D.
,
Xu
,
Y.
,
Ishii
,
E.
,
Bang
,
Y. J.
,
Madotto
,
A.
,
Fung
,
P.
:
Survey of hallucination in natural language generation
.
ACM Computing Surveys
55
(
12
),
1
38
(
2023
).
[52]
Raza
,
S.
,
Reji
,
D. J.
,
Ding
,
C.
:
Dbias: detecting biases and ensuring fairness in news articles
.
International Journal of Data Science and Analytics
,
1-21
(
2022
).
[53]
Raza
,
S.
,
Pour
,
P. O.
,
Bashir
,
S. R.
:
Fairness in machine learning meets with equity in healthcare
. arXiv preprint arXiv: 2305. 07041 (
2023
).
[54]
He
,
Z.
,
Xie
,
Z.
,
Jha
,
R.
,
Steck
,
H.
,
Liang
,
D.
,
Feng
,
Y.
,
Majumder
,
B. P.
,
Kallus
,
N.
,
McAuley
,
J.
:
Large language models as zero-shot conversational recommenders
. In:
Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
, pp.
720
730
(
2023
).
[55]
Porsdam Mann
,
S.
,
Earp
,
B. D.
,
Møller
,
N.
,
Vynn
,
S.
,
Savulescu
,
J.
:
Autogen: A personalized large language model for academic enhancement—ethics and proof of principle
.
The American Journal of Bioethics
23
(
10
),
28
41
(
2023
).
[56]
Ranathunga
,
S.
,
Lee
,
E.-S. A.
,
Prifti Skenduli
,
M.
,
Shekhar
,
R.
,
Alam
,
M.
,
Kaur
,
R.
:
Neural machine translation for low-resource languages: A survey
.
ACM Computing Surveys
55
(
11
),
1
37
(
2023
).
[57]
Bai
,
H.
,
Hou
,
L.
,
Shang
,
L.
,
Jiang
,
X.
,
King
,
I.
,
Lyu
,
M. R.
:
Towards efficient post-training quantization of pre-trained language models
.
Advances in Neural Information Processing Systems
35
,
1405
1418
(
2022
).
[58]
Ntoutsi
,
E.
,
Fafalios
,
P.
,
Gadiraju
,
U.
,
losifidis
,
V.
,
Nejdl
,
W.
,
Vidal
,
M.-E.
,
Ruggieri
,
S.
,
Turini
,
F.
,
Papadopoulos
,
S.
,
Krasanakis
,
E.
, et al
.:
Bias in data-driven artificial intelligence systems—an introductory survey
.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
10
(
3
),
1356
(
2020
).
[59]
Raza
,
S.
,
Garg
,
M.
,
Reji
,
D. J.
,
Bashir
,
S. R.
,
Ding
,
C.
:
Nbias: A natural language processing framework for bias identification in text
.
Expert Systems with Applications
237
,
121542
(
2024
).
[60]
Nadeem
,
M.
,
Bethke
,
A.
,
Reddy
,
S.
:
Stereoset: Measuring stereotypical bias in pretrained language models
. In:
ACL-IJCNLPth Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing
, pp.
2021
59
(
2021
).
[61]
Barikeri
,
S.
,
Lauscher
,
A.
,
Vulić
,
I.
,
Glavaš
,
G.
:
Redditbias: A real-world resource for bias evaluation and debiasing of conversational language models
. arXiv preprint arXiv: 2106. 03521 (
2021
).
[62]
Raza
,
S.
,
Ding
,
C.
:
Fake news detection based on news content and social contexts: a transformer-based approach
.
International Journal of Data Science and Analytics
13
(
4
),
335
362
(
2022
).
[63]
May
,
C.
,
Wang
,
A.
,
Bordia
,
S.
,
Bowman
,
S. R.
,
Rudinger
,
R.
:
On measuring social biases in sentence encoders
.
NAACL HLT
2019
,
622
628
(
2019
).
[64]
Gilardi
,
F.
,
Alizadeh
,
M.
,
Kubli
,
M.
:
Chatgpt outperforms crowd workers for text-annotation tasks
.
Proceedings of the National Academy of Sciences
120
(
30
),
2305016120
(
2023
).
[65]
Touvron
,
H.
,
Martin
,
L.
,
Stone
,
K.
,
Albert
,
P.
,
Almahairi
,
A.
,
Babaei
,
Y.
,
Bashlykov
,
N.
,
Batra
,
S.
,
Bhargava
,
P.
,
Bhosale
,
S.
, et al
.:
Llama 2: Open foundation and fine-tuned chat models
. arXiv e-prints, 2307 (
2023
).
[66]
Creative Commons
:
Creative Commons Attribution-NonCommercial 4. 0 International License
. https://creativecommons.org/licenses/by-nc/4.0/. Accessed on 2023-12-10 (
2023
).
[67]
Zhao
,
H.
,
Chen
,
H.
,
Yang
,
F.
,
Liu
,
N.
,
Deng
,
H.
,
Cai
,
H.
,
Wang
,
S.
,
Yin
,
D.
,
Du
,
M.
:
Explainability for large language models: A survey
. arXiv preprint arXiv: 2309. 01029 (
2023
).
[68]
Chen
,
Y.
,
Arunasalam
,
A.
,
Celik
,
Z. B.
:
Can large language models provide security & privacy advice?measuring the ability of llms to refute misconceptions
. In:
Proceedings of the 39th Annual Computer Security Applications Conference
, pp.
366
378
(
2023
).
[69]
Wilson
,
M.
,
Petty
,
J.
,
Frank
,
R.
:
How abstract is linguistic generalization in large language models? experiments with argument structure
.
Transactions of the Association for Computational Linguistics
11
,
1377
1395
(
2023
).

APPENDICES

A. Acronyms and Full Forms

Academia.edu, Academia.edu; AI4ALL, AI4ALL; Algolia, Algolia; Amazon S3, Amazon Simple Storage Service; Apache Atlas, Apache Atlas; Apache Lucene, Apache Lucene; Apache NiFi, Apache NiFi; Apache Solr, Apache Solr; Archivematica, Archivematica; CKAN, Comprehensive Knowledge Archive Network; Clockss, Controlled LOCKSS; Collibra, Collibra; Creative Commons, Creative Commons; Crossref, Crossref; DSpace, DSpace; Dublin Core, Dublin Core Metadata Initiative; Data Provenance Tools, Data Provenance Tools; DataCite, DataCite; Dataverse, Microsoft Dataverse; EPrints, EPrints; Ecoinformatics, Ecoinformatics; Elasticsearch, Elasticsearch; FGED, Functional Genomics Data Society; Figshare, Figshare; GDPR, General Data Protection Regulation; Google Cloud Storage, Google Cloud Storage; GraphQL, GraphQL; HL7 FHIR, Health Level Seven Fast Healthcare Interoperability Resources; IEDB, Immune Epitope Database; ISO/IEC, International Organization for Standardization/International Electrotechnical Commission; LOCKSS, Lots of Copies Keep Stu ffSafe; NCBI, National Center for Biotechnology Information; OAI-PMH, Open Archives Initiative Protocol for Metadata Harvesting; Omeka, Omeka; OneTrust, OneTrust; OpenAI Ethics Guidelines, OpenAI Ethics Guidelines; OpenAPI, OpenAPI Initiative; OpenRefine, OpenRefine; ORCID, Open Researcher and Contributor ID; OWL, Web Ontology Language; Portico, Portico; PROV-DM, Provenance Data Model; RDF, Resource Description Framework; RE3data, Registry of Research Data Repositories; ResearchGate, ResearchGate; Responsible AI, Responsible AI; REST, Representational State Transfer; schema.org, Schema.org; SOAP, Simple Object Access Protocol; SPARQL, SPARQL Protocol and RDF Query Language; Talend, Talend; TrustArc, TrustArc; XOD, eXtensible ontology development; XSLT, Extensible Stylesheet Language Transformations; Zenodo, Zenodo; gRPC, gRPC Remote Procedure Calls.

B. Data Management Challenges in LLMs and Corresponding FAIR Principles

Table 1.

Data Management Challenges in LLMs and Corresponding FAIR Principles.

Data Management ChallengeFAIR Principle Addressed
Vast and Complex Datasets Findability through detailed metadata and persistent identifiers to enhance data discovery. 
Data Quality and Bias Accessibility with ethical access protocols to provide high-quality, unbiased data. 
Privacy and Ethical Concerns Reusability with clear legal and ethical documentation to uphold privacy and ethics. 
Data Annotation and Labeling Interoperability through standardized data formats for reliable annotation and labeling. 
Data Accessibility and Sharing Accessibility to ensure a balance between open data sharing and protection of proprietary information. 
Legal and Regulatory Compliance Reusability to align data management with legal standards for future use. 
Contextual Awareness Interoperability for enhanced NLP algorithms to accurately capture and interpret context. 
Accuracy and Reliability Accessibility and Reusability to ensure mechanisms for improved factchecking and consistent data quality. 
Ethical and Fair Use Accessibility with advanced bias detection algorithms to promote fairness. 
Interactivity and Personalization Findability and Accessibility for adaptive learning from user interactions for personalized experiences. 
Language and Cultural Sensitivity Interoperability to support expanded language models and cultural datasets for inclusivity. 
Technical and Scalability Reusability for efficient processing strategies that facilitate the scalable use of computational resources. 
Data Management ChallengeFAIR Principle Addressed
Vast and Complex Datasets Findability through detailed metadata and persistent identifiers to enhance data discovery. 
Data Quality and Bias Accessibility with ethical access protocols to provide high-quality, unbiased data. 
Privacy and Ethical Concerns Reusability with clear legal and ethical documentation to uphold privacy and ethics. 
Data Annotation and Labeling Interoperability through standardized data formats for reliable annotation and labeling. 
Data Accessibility and Sharing Accessibility to ensure a balance between open data sharing and protection of proprietary information. 
Legal and Regulatory Compliance Reusability to align data management with legal standards for future use. 
Contextual Awareness Interoperability for enhanced NLP algorithms to accurately capture and interpret context. 
Accuracy and Reliability Accessibility and Reusability to ensure mechanisms for improved factchecking and consistent data quality. 
Ethical and Fair Use Accessibility with advanced bias detection algorithms to promote fairness. 
Interactivity and Personalization Findability and Accessibility for adaptive learning from user interactions for personalized experiences. 
Language and Cultural Sensitivity Interoperability to support expanded language models and cultural datasets for inclusivity. 
Technical and Scalability Reusability for efficient processing strategies that facilitate the scalable use of computational resources. 

C. Inter Annotators Agreement

Figure 1 depicts the consensus among experts on different dimensions of bias, with the most significant agreement observed in the area of ‘Socioeconomic Bias’. This figure highlights the level of concordance among annotators regarding the various aspects of bias.

Figure 1.

Expert Agreement Across Bias Dimensions. The bar graph quantifies the concordance between domain experts evaluations and the model's predictions.

Figure 1.

Expert Agreement Across Bias Dimensions. The bar graph quantifies the concordance between domain experts evaluations and the model's predictions.

Close modal

D. Dataset Schema and Formats

The dataset has multiple attributes like Text, Dimension, Biased Words, Aspect, Bias Label, Sentiment, Toxic, Identity Mention, Debasied Text.

Text: Lawyers are always manipulative and cannot be trusted.

Dimension: Professional Integrity

Biased Words: always, manipulative, cannot be trusted

Aspect: Trustworthiness of lawyers

Bias Label: BIASED

Sentiment: Negative

Toxic: Yes

Identity Mention: Lawyers

Debiased Text: Trustworthiness is an individual trait and varies among professionals, including lawyers.

We provide examples for each data format, demonstrating the use of dataset attributes in diverse machine learning models:

Table 2.

Examples of Dataset Formats.

FormatExample
Classification {Text: Politicians are often corrupt., Bias Label: BIASED, Sentiment: Negative} 
CoNLL {Text: Doctors are always caring., Biased Words: [always, caring], Identity Mention: Doctors} 
SQuAD {Text: Why are lawyers untrustworthy?, Aspect: Trust-worthiness, Debiased Text: Trustworthiness varies individually.} 
Counterfactual {Text: Young people are irresponsible with money., Biased Words: [irresponsible], Debiased Text: Financial habits vary by individual.} 
Sentiment/Toxicity {Text: Politicians are dishonest., Sentiment: Negative, Toxic: Yes} 
FormatExample
Classification {Text: Politicians are often corrupt., Bias Label: BIASED, Sentiment: Negative} 
CoNLL {Text: Doctors are always caring., Biased Words: [always, caring], Identity Mention: Doctors} 
SQuAD {Text: Why are lawyers untrustworthy?, Aspect: Trust-worthiness, Debiased Text: Trustworthiness varies individually.} 
Counterfactual {Text: Young people are irresponsible with money., Biased Words: [irresponsible], Debiased Text: Financial habits vary by individual.} 
Sentiment/Toxicity {Text: Politicians are dishonest., Sentiment: Negative, Toxic: Yes} 
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.