ABSTRACT
The rapid evolution of Large Language Models (LLMs) highlights the necessity for ethical considerations and data integrity in AI development, particularly emphasizing the role of FAIR (Findable, Accessible, Interoperable, Reusable) data principles. While these principles are crucial for ethical data stewardship, their specific application in the context of LLM training data remains an under-explored area. This research gap is the focus of our study, which begins with an examination of existing literature to underline the importance of FAIR principles in managing data for LLM training. Building upon this, we propose a novel frame-work designed to integrate FAIR principles into the LLM development lifecycle. A contribution of our work is the development of a comprehensive checklist intended to guide researchers and developers in applying FAIR data principles consistently across the model development process. The utility and effectiveness of our frame-work are validated through a case study on creating a FAIR-compliant dataset aimed at detecting and mitigating biases in LLMs. We present this framework to the community as a tool to foster the creation of technologically advanced, ethically grounded, and socially responsible AI models.
1. INTRODUCTION
In the era of artificial intelligence (AI), Language Models (LMs) play an important role in advancing diverse AI applications [1]. From virtual assistants to content generation, LMs have become indispensable in shaping both the future trajectory of academic research and in their widespread practical applications. The impact of this transformation is further amplified by the evolution of Large Language Models (LLMs) [2], such as OpenAI's GPT series, Gemini, LLaMa, and Falcon. As of January 2024, LLM development has collected $18.2 billion in funding and $2.1 billion in revenue [3].
The rapid success of these LLMs highlight the importance of diverse data for broad-ening their applicability across different domains. This idea is also well-supported in research [4, 5], which affirms the importance of high-quality data in training these models. However, the training of these big models on data from diverse sources highlights the complex ethical and responsible data practice challenges in their implementation [5, 6].
The FAIR data principles, which stand for Findable, Accessible, Interoperable, and Reusable [7, 8], were initially established to improve the stewardship of scientific data. These principles are can be used for any model development life-cycle [9, 10] and have become increasingly recognized in responsible AI development [11]. Recently, the relevance of FAIR principles has been particularly highlighted in generative AI due to ethical challenges such as bias, privacy concerns, and the potential misuse of AI-generated content [12, 13]. This highlights the growing importance of ensuring data used in building these LLMs① is findable, accessible, interoperable, and reusable, adhering to ethical standards.
Seminal works in data science and management [7, 14-17] have explored FAIR principles in various domains. Their application in the training and development of LLMs, however, is a developing area of research. While recent advances in LLM studies [5, 6] have focused on aligning LLMs with ethical standards including human values, the direct incorporation of FAIR principles in LLM training is less explicit. Challenges specific to LLMs, such as addressing data biases, toxicity, stereotypes and the need for model explainability and interpretability [18], further highlight the need for ethical and balanced training approaches. This emphasis on data ethics is in line with broader trends in AI ethics and data science [19, 20] stress the necessity of integrating these principles into the LLM development.
In this work, we shed some light on FAIR data principles and propose a model development lifecycle for LLM training that incorporates FAIR principles at each stage. One of the contributions of this work is the development of a FAIR-compliant dataset, designed to include a wide array of narratives from diverse sources. To emphasize the significance of FAIR data principles in the context of preparing this dataset for LLM training, it is crucial to understand that while this process may not directly equate to the complete “FAIRification” of LLMs, it represents a critical and foundational step towards it. The application of FAIR principles ensures that the data feeding into LLMs is of high quality and organized in a way that maximizes its utility, thereby enhancing the model's performance and reliability.
The central contribution of our work, however, extends beyond simple data collection and preparation for LLM training. We have placed a strong emphasis on rigorously aligning the LLM training dataset with FAIR principles. We acknowledge that adherence to FAIR principles (even at the strict-most level) may not equate to absolute ethical compliance, however, it represents a crucial step in that direction. Our research establishes a foundational framework for further advanced studies in this field. The primary contributions of our work are:
1.1 Contributions
An exploration of FAIR data principles in general AI research, including the provision of a comprehensive checklist for researchers and developers.
Introduction of an innovative framework that integrates FAIR data principles throughout the LLM training lifecycle, ensuring ethical and effective application in various AI contexts.
Demonstration of the practical benefits of a FAIR-compliant dataset through a case study. This study specifically focuses on identifying and mitigating biases prior to training LLMs. The topic of bias is broad-ranging, our case study specifically focuses on addressing linguistic biases targeting protected groups.
The data used for pre-training LLMs is predominantly unstructured. However, as demonstrated in this case study, if it is accurately labeled and formatted—interoperability being one such key principle—it can then be leveraged for fine-tuning a wide array of downstream tasks. Furthermore, adherence to FAIR principles not only ensures data is handled correctly but also significantly boosts the credibility of the models.
2. METHODOLOGY
2.1 Literature Selection Criteria
This study investigates articles published within the past five years, focusing on the rise of generative AI models like those in the GPT series. The study specifically targets English-language articles from leading AI and ML journals and conferences. We explored several databases and proceedings from high-quality journals from Elsevier, Springer, Nature portfolio, IEEE Transactions, and key conferences like NeurIPS, ICML, ACL and such. Recognizing the fast-paced evolution of LLM research, we included relevant preprints and seminal works for a comprehensive perspective. This search resulted in 135 papers, out of which we carefully chose about 75 that directly pertained to FAIR data principles in the context of training models. We noted a scarcity of work combining LLMs with FAIR data principles, a gap that our study aims to address. The search query for this study is:
(Findability OR data discovery OR metadata standards OR persistent identifiers) AND (Accessibility OR data access OR data sharing policies OR authentication and authorization mechanisms) AND (Interoperability OR data integration OR standardized data formats OR cross-domain data exchange) AND (Reusability OR data documentation OR data quality assurance OR long-term data preservation) AND (large language models OR LLMs OR AI models OR machine learning models) AND (training data management OR ethical data sourcing OR bias in AI datasets OR responsible data use in AI) AND (ethical considerations in AI OR AI ethics OR responsible AI OR ethical AI development) PUBLISHED FROM 2018 TO 2023 IN English IN Journals and Conferences (e.g., JAIR, JMLR, IEEE Transactions, NeurIPS, ICLR, ACL, Springer, Nature portfolio, Elsevier).
2.2 Comparative Analysis of Related Work
We have conducted a meta-review of key articles on FAIR data principles, categorizing them by domain and their relevance to LLMs. We also highlight the difference of our work compared to the previous works. The findings are summarized in Table 1.
Reference . | Summary . | Domain Focus . | LLM Relevance . |
---|---|---|---|
[21] | Discusses data governance for FAIR principles in health data management. | Health Data | Not Applicable |
[22] | Reviews initiatives for applying FAIR principles in managing health data. | Health Data | Not Applicable |
[23] | Examines FAIR data practices'role in global mental health research. | Mental Health Research | Not Applicable |
[24] | Analyzes FAIR principles in healthcare data, emphasizing cybersecurity. | Health Data | Partially (Instruct-GPT) |
[25] | Addresses challenges in FAIR application and legal aspects in European rare disease databases for ML technologies. | Health Data | Not Applicable |
[26] | Elaborates on FAIR principles in precision oncology research. | Health Data | Not Applicable |
[16] | Offers a broad perspective on FAIR principles'implementation across various fields. | Multi-Domain | Not Applicable |
Our Study | Evaluates the application of FAIR data principles in the creation and training of datasets for LLMs. This study encompasses a range of models including the BERT and GPT families, as well as the LLaMa-2-7b model, high-lighting diverse approaches in LLM development. | News Data | Yes (BERT/, GPT families, LLaMa-2-7b) |
Reference . | Summary . | Domain Focus . | LLM Relevance . |
---|---|---|---|
[21] | Discusses data governance for FAIR principles in health data management. | Health Data | Not Applicable |
[22] | Reviews initiatives for applying FAIR principles in managing health data. | Health Data | Not Applicable |
[23] | Examines FAIR data practices'role in global mental health research. | Mental Health Research | Not Applicable |
[24] | Analyzes FAIR principles in healthcare data, emphasizing cybersecurity. | Health Data | Partially (Instruct-GPT) |
[25] | Addresses challenges in FAIR application and legal aspects in European rare disease databases for ML technologies. | Health Data | Not Applicable |
[26] | Elaborates on FAIR principles in precision oncology research. | Health Data | Not Applicable |
[16] | Offers a broad perspective on FAIR principles'implementation across various fields. | Multi-Domain | Not Applicable |
Our Study | Evaluates the application of FAIR data principles in the creation and training of datasets for LLMs. This study encompasses a range of models including the BERT and GPT families, as well as the LLaMa-2-7b model, high-lighting diverse approaches in LLM development. | News Data | Yes (BERT/, GPT families, LLaMa-2-7b) |
Table 1 highlights a focus on health data in existing research on FAIR principles. One study incorporates a LLM application (GPT-based model). Our work, however, stands out for its application of FAIR data principles to a range of LLMs, including pretrained LMs, GPT-series, and Llama2 models. This approach broadens the scope and generalizability of our research.
3. FAIR DATA PRINCIPLES: THEORETICAL BACKGROUND AND SIGNIFICANCE
The FAIR data principles significantly influence data management practices across various fields. These include physics [27], health [28], environmental science [13], pharmaceuticals [9], chemistry [29], computer science [30], research and development [31], and clinical studies [26]. These principles ensure data is well-organized, accessible, and reusable in different research domains and applications. In AI and Natural Language Processing (NLP), the implementation of interdisciplinary strategies [32] and structured approaches [33] is crucial to uphold ethical data practices [17] and ensure fairness in AI models [34]. Given the constraints of space, we present a concise overview of each FAIR principle, as depicted in Figure 1. Additionally, a compliance checklist is provided in Table 2 ②.
Principle . | Features . | Compliance list . | Tools and Practices . |
---|---|---|---|
Findability | F1: Rich and descriptive metadata F2: Standardized data indexing F3: Documented data sources F4: Advanced search functionalities F5: Persistent identifiers like DOIs | F1: Metadata includes titles, authors, abstracts, keywords, and affiliations. F2: Use of standardized taxonomy and ontology for indexing. F3: Clear documentation of data sources and collection methodologies. F4: Implementation of advanced search tools and interfaces. F5: Assignment of persistent identifiers such as DOIs. | Metadata Management: Apache Atlas, Collibra, ORCID Persistent Identifiers: Cross-Ref for DOIs Data Indexing: Elasticsearch, Apache Solr, DSpace Search Interfaces: Algolia, Apache Lucene Data Repositories: NCBI, RE3data, CKAN, Data-verse, Zenodo, Figshare, EPrints, ResearchGate, Academia.edu |
Accessibility | A1: Clear data access protocols A2: Long-term data preservation A3: Open access platforms A4: Standardized APIs for data access A5: Data availability in accessible formats | A1: Detailed access instructions and authentication processes. A2: Use of reliable digital preservation services like CLOCKSS or Portico. A3: Data deposited in open repositories such as Figshare or Zenodo. A4: APIs conform to standards such as OpenAPI for ease of use. A5: Data provided in multiple formats (e. g., CSV, JSON, XML) to ensure usability. | API Tools: OpenAPI, GraphQL, RESTful interfaces Data Preservation: Archivematica, LOCKSS, Zenodo, Figshare Cloud Storage: Amazon S3, Google Cloud Storage, Microsoft Azure Ethical Access: OneTrust, TrustArc |
Interoperability | I1: Standard data formats I2: Common communication protocols I3: Data exchange standards I4: Tools for data transformation and mapping | I1: Data conforms to community-recognized standards (e. g., MIAME, Ecological Metadata Language). I2: Support for protocols such as OAI-PMH for metadata harvesting. I3: Use of frameworks like Schema.org for structured data. I4: Availability of services like XSLT or OpenRefine for data conversion and mapping. | Data Formats: JSON, XML, CSV, DICOM, GenBank Protocols: HTTP, SOAP, REST, gRPC Data Standards: RDF, HL7 FHIR, ISO/IEC standards Transformation Tools: XSLT, Talend, Informatica, Apache NiFi Ontology Systems: OWL, SPARQL, XOD Specialized Databases: IEDB |
Reusability | R1: Detailed metadata for context R2: Data curated for future use R3: Adherence to ethical standards R4: Licensing frameworks R5: Consideration of societal impacts | R1: Comprehensive metadata including experimental conditions, methodologies, and provenance. R2: Curating data with clear versioning and update records. R3: Compliance with GDPR and other privacy regulations. R4: Use of licenses like Creative Commons to clarify user rights. R5: Assessment and documentation of data impact on society and potential biases. | Metadata Standards: Dublin Core, DataCite, schema.org Data Curation: CKAN, DSpace, Omeka Ethical Frameworks: Responsible AI, OpenAI Ethics Guidelines, AI4ALL Licensing Tools: Creative Commons Data Provenance tools: Provenance Tracking PROV-DM |
Principle . | Features . | Compliance list . | Tools and Practices . |
---|---|---|---|
Findability | F1: Rich and descriptive metadata F2: Standardized data indexing F3: Documented data sources F4: Advanced search functionalities F5: Persistent identifiers like DOIs | F1: Metadata includes titles, authors, abstracts, keywords, and affiliations. F2: Use of standardized taxonomy and ontology for indexing. F3: Clear documentation of data sources and collection methodologies. F4: Implementation of advanced search tools and interfaces. F5: Assignment of persistent identifiers such as DOIs. | Metadata Management: Apache Atlas, Collibra, ORCID Persistent Identifiers: Cross-Ref for DOIs Data Indexing: Elasticsearch, Apache Solr, DSpace Search Interfaces: Algolia, Apache Lucene Data Repositories: NCBI, RE3data, CKAN, Data-verse, Zenodo, Figshare, EPrints, ResearchGate, Academia.edu |
Accessibility | A1: Clear data access protocols A2: Long-term data preservation A3: Open access platforms A4: Standardized APIs for data access A5: Data availability in accessible formats | A1: Detailed access instructions and authentication processes. A2: Use of reliable digital preservation services like CLOCKSS or Portico. A3: Data deposited in open repositories such as Figshare or Zenodo. A4: APIs conform to standards such as OpenAPI for ease of use. A5: Data provided in multiple formats (e. g., CSV, JSON, XML) to ensure usability. | API Tools: OpenAPI, GraphQL, RESTful interfaces Data Preservation: Archivematica, LOCKSS, Zenodo, Figshare Cloud Storage: Amazon S3, Google Cloud Storage, Microsoft Azure Ethical Access: OneTrust, TrustArc |
Interoperability | I1: Standard data formats I2: Common communication protocols I3: Data exchange standards I4: Tools for data transformation and mapping | I1: Data conforms to community-recognized standards (e. g., MIAME, Ecological Metadata Language). I2: Support for protocols such as OAI-PMH for metadata harvesting. I3: Use of frameworks like Schema.org for structured data. I4: Availability of services like XSLT or OpenRefine for data conversion and mapping. | Data Formats: JSON, XML, CSV, DICOM, GenBank Protocols: HTTP, SOAP, REST, gRPC Data Standards: RDF, HL7 FHIR, ISO/IEC standards Transformation Tools: XSLT, Talend, Informatica, Apache NiFi Ontology Systems: OWL, SPARQL, XOD Specialized Databases: IEDB |
Reusability | R1: Detailed metadata for context R2: Data curated for future use R3: Adherence to ethical standards R4: Licensing frameworks R5: Consideration of societal impacts | R1: Comprehensive metadata including experimental conditions, methodologies, and provenance. R2: Curating data with clear versioning and update records. R3: Compliance with GDPR and other privacy regulations. R4: Use of licenses like Creative Commons to clarify user rights. R5: Assessment and documentation of data impact on society and potential biases. | Metadata Standards: Dublin Core, DataCite, schema.org Data Curation: CKAN, DSpace, Omeka Ethical Frameworks: Responsible AI, OpenAI Ethics Guidelines, AI4ALL Licensing Tools: Creative Commons Data Provenance tools: Provenance Tracking PROV-DM |
3.1 Findability
The principle of Findability is important for ensuring that data and resources are not only discoverable but also readily accessible [35]. It involves the establishment of detailed metadata, the implementation of persistent identifiers, and the promotion of effective data indexing and search strategies. This principle significantly enhances the discoverability of data by both humans and machines, facilitating a smoother research process. Recent contributions to the enhancement of data findability within the FAIR framework include research on strategies for the FAIRification [33] of data to improve its findability, and studies [36] investigating models for the FAIR digital object framework with a focus on improving the findability of digital objects. Discussions [32] on the application of FAIR principles specifically within AI datasets highlight the growing recognition of these principles'importance. Additionally, considerations on the trustworthiness of data repositories underline findability's critical role in ensuring data quality and reliability in these environments [37].
3.2 Accessibility
Accessibility, as articulated by the FAIR data principles, emphasizes the straightfor-wardness of obtaining and utilizing data upon its discovery [8]. This aspect is integral to the FAIR framework and encompasses strategies for long-term data preservation, the establishment of ethical access protocols, and the assurance of data retrievability and usability across time. Key discussions [38] on the role of trust in data repositories, an essential component of accessibility, offering a detailed exploration of accessibility within the context of the FAIR principles. The application of FAIR principles to research software is also discussed in the literature [39] with a specific lens on enhancing accessibility. The support of the Fedora platform for FAIR data principles, especially in terms of accessibility, is examined in a related work [31]. Furthermore, the challenges and opportunities encountered in the implementation of FAIR principles, including accessibility, within the Brazilian data science landscape is also discussed [40]. Collectively, these contributions highlight the critical importance of embedding accessibility considerations into research software, platform support, and addressing implementation hurdles.
3.3 Interoperability
Interoperability, a core aspect of the FAIR data principles, denotes the capability of different data systems to operate cohesively [41]. This principle necessitates the adoption of standardized data formats and protocols to facilitate straightforward data exchange and integration across heterogeneous systems. Discussions on the Immune Epitope Database (IEDB) emphasize its commitment to interoperability through the adoption of FAIR principles [42]. The development and support of interoperability in ontology creation, as facilitated by the eXtensible ontology development (XOD) principles and tools, are detailed in [43]. Furthermore, the examination of challenges and strategies related to the implementation of FAIR data principles, with a particular focus on interoperability in data science, is undertaken in [40]. Additionally, a framework along with metrics for assessing the FAIRness of data, emphasizing interoperability, are delineated in [44]. The objectives and efforts of the GO FAIR initiative, aimed at promoting the widespread adoption of FAIR principles, including interoperability, are elaborated in [45]. Collectively, these contributions highlight the role of interoperability in promoting responsible and efficient data management across diverse scientific and technological domains.
3.4 Reusability
Reusability, as one of the foundational aspects of the FAIR data principles, highlights the necessity for data to be stored and documented in a manner that facilitates future retrieval and reuse [46]. This principle is supported by the creation of comprehensive metadata, consideration of legal and ethical frameworks, and the assessment of potential societal impacts. Research focusing on a structured methodology for planning the FAIRification of data, particularly with reusability in mind, is presented in [33]. Furthermore, the development of a model for digital objects adhering to FAIR principles that prioritizes reusability is detailed in [36]. The exploration of FAIR principles for dataset reusability, is disussed in [32]. Additionally, the work flows that prioritize reusability from the outset is argued in [47]. An exploration of FAIR principles, including detailed insights on reusability, is presented in [16]. Collectively, these contributions advocate for a data management approach that not only preserves data for future use but also ensures it remains ethically and effectively employable for training.
4. DATA MANAGEMENT CHALLENGES IN LARGE LANGUAGE MODELS
The evolution of LLMs introduces a complex spectrum of data management challenges, necessitating advanced strategies for efficient data organization and accessibility due to the engagement with extensive and intricate datasets [2]. The importance of high-quality data cannot be overstated; it is highly important to use representative, well-curated datasets to develop models that are ethically sound and have broad applicability [4]. Addressing privacy and ethical concerns is essential, requiring rigorous data governance to adhere to ethical norms and protect individual privacy [5]. Furthermore, the precision of data annotation and labeling is important to ensure model reliability, calling for standardized and transparent practices [48, 49]. Balancing data accessibility with the protection of proprietary information is crucial, a task that must be aligned with legal and ethical standards [20]. Moreover, compliance with data protection laws is critical for legal conformity and upholding the ethical integrity of AI technologies, which is fundamental to sustaining public trust [19]. A structured overview of these data management challenges and requirements in LLM development is provided in Figure 2, with a summary given below.
Understanding the Context: Improving algorithms and reasoning is essential for enhancing contextual awareness in LLMs [50]. The capability of these models to understand and interpret complex contexts significantly contributes to their practical utility. Developing algorithms that grasp linguistic nuances, cultural contexts, and the implications of language use is increasingly necessary.
Accuracy and Reliability: LLMs sometimes producing incorrect information high-lights the need for effective fact-checking mechanisms [51]. To increase the reliability of these models, it is vital to implement verification algorithms and data validation techniques that ensure their outputs are accurate and trustworthy.
Ethical and Fair Use: Mitigating biases in LLM outputs is a significant concern [5, 11, 52]. Developing algorithms for detecting and correcting biases in training data and model outputs is essential for fair use [53]. Efforts must focus on incorporating diverse perspectives to ensure ethical use and fairness in model responses.
Interactivity and Personalization: The general nature of responses from current LLMs indicates the importance of developing adaptive learning algorithms [54]. Such algorithms should learn from user interactions, preferences, and feedback to provide personalized responses [55]. The aim is to create models that tailor their responses to individual user needs, improving personalization and user experience.
Language and Cultural Sensitivity: Addressing the underrepresentation of low-resource languages is crucial for increasing the inclusivity of LLMs [6]. It is important to enhance language model diversity and cultural dataset representation [56]. Expanding datasets to include more languages and ensuring models respect cultural contexts are key steps toward global inclusivity.
Efficiency and Scalability: The significant computational demands of LLM development present a notable challenge [2]. Developing optimization strategies to reduce computational requirements without compromising performance is necessary [57]. This may involve new model architectures, data processing methods, and hardware efficiencies, facilitating more scalable LLM deployment.
Research and Development Directions Engaging in collaborations with academic institutions, continuous model training, and adopting user-centric design philosophies are crucial for the ongoing enhancement of LLMs. These efforts are aimed at ensuring models meet the diverse and changing needs of their users.
Mapping FAIR principles to Data Management Challenges in LLMs
The challenges in managing data for LLMs are complex, yet the FAIR data principles offer targeted solutions that can alleviate some of these issues. For instance, Findability enhances the process of identifying pertinent data within vast datasets through the use of enriched metadata and persistent identifiers, streamlining data retrieval and analysis. Interoperability facilitates the seamless integration of varying data formats and systems, important for LLM training. This principle also aids in developing algorithms that are more contextually aware by allowing for the combination of different data types and sources, thereby leading to more nuanced and accurate model outputs.
Accessibility ensures data is readily available and access is controlled appropriately, allowing for the responsible sharing of data. This principle shows the fine line between promoting open access to data to foster innovation and collaboration, and the necessity to safeguard sensitive information and intellectual property. Reusability amplifies the longevity and utility of data beyond its original application. By adhering to legal and regulatory standards, reusability ensures that data remains applicable, reliable, and ethically sound for future endeavors. A detailed mapping of these data management challenges in LLMs to the FAIR data principles is provided in S2 Appendix B.
5. FRAMEWORK FOR FAIR DATA PRINCIPLES INTEGRATION IN LLM DEVELOPMENT
Rationale for Integrating FAIR Data Principles in LLM Development
Our research is dedicated to enhancing the training of LLMs by embedding FAIR data principles directly into their training methodologies. The main objective is to develop training datasets for LLMs that are designed to comply with FAIR principles. To realize this, we have devised a framework that interweaves FAIR principles throughout the entire model development lifecycle of LLMs, as depicted in Figure 3. In the following sections, we will elaborate on each phase of this proposed LLM lifecycle, connecting each step specifically to our case study which concentrates on the creation of a FAIR-compliant dataset.
Case Study: Developing a FAIR-Compliant Dataset
To address the risk of LLMs perpetuating societal biases, our case study focuses on developing a dataset with a conscientious design that proactively identifies biases within the data prior to the training of these models. Adopting FAIR principles in preparing data for LLMs training is essential, though not a full “FAIRification” of LLMs. This step is crucial for enhancing data quality and model performance, paving the way towards more efficient and ethical AI development. The dataset and model card are available here.
This case study identifies “bias” in an NLP dataset as a linguistic tendency leading to the unfair representation of specific groups, manifesting as linguistic bias, stereo-types, toxicity, or misinformation [58-62]. Such biases often stem from data collection, processing, and usage methods, potentially causing LLMs to replicate or amplify these biases [63]. Our analysis covers various bias dimensions, illustrated in Figure 4, encompassing ageism, occupational jargon, and political rhetoric, among others. This dataset exemplifies how to construct diverse training datasets for LLMs in adherence to FAIR principles.
Data Collection and Curation for LLM Development. FAIR Principles Achieved: Findability, Accessibility, Interoperability, and Reusability.
In line with the FAIR data principles, our dataset is structured to enhance its Findability, featuring comprehensive metadata to facilitate easy discovery. This dataset, sourced from a variety of channels such as news feeds and websites between January and May 2023, encompasses over 50, 000 filtered entries. For data curation, we utilized various feeds and hashtags, including#MediaBias, #SocialJustice, #GenderEquality, #RacialInjustice, #CulturalDiversity, #AgeismAwareness, #ReligiousTolerance, and#EconomicDisparity, to ensure a wide representation of social issues. This diversity of news articles not only broadens accessibility but also enriches the dataset relevance to current social discourses.
The detailed metadata includes crucial information such as dataset title, description, authors, date of creation, version, and keywords that reflect the dataset scope and purpose. These keywords (e. g., LLMs, Training, Biases, News Media, NLP) are strategically chosen to enhance the dataset retrieval efficiency across various research portals. Additionally, the metadata specifies the data type, whether textual, numerical, or otherwise, further aligning with the Interoperability principle by facilitating data integration across different systems.
This readability test is crucial for determining the level of education required to comprehend our dataset texts. As illustrated in Figure 5, the distribution of Gunning Fog Index scores in our dataset shows a normal distribution with a mean score of 7.79. This indicates that the majority of our texts are suitable for readers with at least an eighth-grade education level. Such readability analysis is instrumental in aligning our dataset with the FAIR principle of Accessibility, ensuring that the content is comprehensible to the intended audience, thus enhancing the usability of the dataset in training LLMs.
Annotations The process of annotating labels and debiasing text initially utilized GPT-3.5 for an automated preliminary screening, efficiently identifying potential bias, toxicity, stereotyping, and harm. Recent works [64] indicate that AI can excel in labeling tasks, outperforming traditional crowd-sourced methods. In our case, we utilize an extra layer of human review to have quality with speed. This facilitated the early detection of content requiring human review. Following this, a team of 15 experts and students across various disciplines crafted comprehensive annotation guidelines emphasizing accuracy and sensitivity. These guidelines assist annotators in conducting a thorough review to mitigate any biases, ensuring content fairness and respectfulness.
Annotators engage in a nuanced review process, identifying and revising instances of bias, toxicity, stereotyping, and harm based on a wide range of attributes, guided by principles of neutrality, respect, and inclusivity. This detailed annotation effort is supported by a commitment to ongoing feedback, expert reviews, and inter-annotator agreement (IAA) checks, ensuring high-quality annotations evidenced by a Cohen's Kappa score above 0.75. Examples of IAA agreements across main bias dimensions are provided in S3 Appendix C, illustrating the robustness of our annotation methodology.
The robustness of our dataset is established through a dual-stage quality assessment: automated analysis for initial screening of 10, 000 statistically significant entries, and expert review of 500 stratified random samples for deeper insights. This method ensures the dataset accuracy, completeness, and consistency, detailed in Table 3.
Aspect . | Automated Analysis . | Expert Review . |
---|---|---|
Accuracy | 96.5% | Contextually accurate |
Completeness | 93.0% | High with occasional updates |
Consistency | 98.0% | Uniform across samples |
Aspect . | Automated Analysis . | Expert Review . |
---|---|---|
Accuracy | 96.5% | Contextually accurate |
Completeness | 93.0% | High with occasional updates |
Consistency | 98.0% | Uniform across samples |
For Interoperability, the data is formatted for seamless integration for various ML tasks, such as binary and multilabel classifiers, question answering (QA) system and debiased language generation task (debiasing), as detailed in Table 4. This enables researchers to use this data in diverse analytical contexts, facilitating cross-domain research and development. The dataset schema and examples on dataset formats are given in S4 Appendix D.
Data Format . | Model Type . |
---|---|
Classification Format | Binary Classifier |
CoNLL Format | Multi-label Token Classification |
SQuAD Format | Question Answering System |
Counterfactual Formatting | Debiasing Model |
Sentiment Analysis Format | Sentiment Classifier |
Toxicity Classification Format | Toxicity Classifier |
Data Format . | Model Type . |
---|---|
Classification Format | Binary Classifier |
CoNLL Format | Multi-label Token Classification |
SQuAD Format | Question Answering System |
Counterfactual Formatting | Debiasing Model |
Sentiment Analysis Format | Sentiment Classifier |
Toxicity Classification Format | Toxicity Classifier |
Furthermore, we have stored our dataset in repositories such as Huggingface, Zenodo, and Figshare, which not only adhere to metadata standards but also ensure its long-term Accessibility. Overall, our data collection and curation strategy aligns with the FAIR principles and enhance the practical utility of our dataset for LLM development.
Model Training and Algorithm Development. FAIR Principles Achieved: Interoperability and Reusability.
The model training and algorithm development phase is crucial for adhering to FAIR principles, particularly interoperability and reusability. We strategically deploy various LLMs, fine-tuned on 10, 000 statically significant data sample from our main data, for tasks ranging from sentiment analysis and QA to debiasing (language generation), and develop them with a modular design for enhanced Reusability across projects.
Interoperability is achieved by using common frameworks like TensorFlow and PyTorch and standardizing data formats for inputs and outputs. Every task performed by LLMs, such as classification and debiasing, is executed following the preparation of specific training formats tailored to each task. This ensures seamless integration of our models with various systems and datasets. We further enhance this with the inclusion of model cards for each LLM, providing essential information such as model purpose, performance, and usage guidelines, thus aiding in understanding and adoption.
We prepare API documentation that provides details for every aspect of training and development, including parameter settings and algorithm modifications. This transparency facilitates scientific validation and progress [16]. We also foster community collaboration by sharing our findings, models, and tools.
Model Evaluation and Validation. FAIR Principles Achieved: Reusability and Accessibility.
In our evaluation and validation phase, we prioritize the FAIR principles of Reusability and Accessibility to ensure that our models and findings can be widely utilized and understood. Transparency in reporting and the provision of accessible documentation are key aspects of our methodology. All results, including bias analysis and model benchmarking, are made publicly available, allowing for comprehensive observation and application by the wider research community. An example of this approach is presented in our detailed bias analysis (Figure 6) and the benchmarking table (Table 5), which illustrate our commitment to these principles.
Model (Type) . | Accuracy . | Precision . | F1-Score . |
---|---|---|---|
Binary Bias Classifier (BERT-large-uncased) | 92% | 89% | 90% |
NER Model (RoBERTa-large-uncased) | 95% | 93% | 94% |
QA System (BERT-base-uncased) | 88% | 87% | 86% |
Sentiment Classifier (BERT-large-uncased) | 90% | 91% | 90% |
Toxicity Classifier (BERT-large-uncased) | 91% | 90% | 91% |
LLaMa2 (7b-chat-hf) | 93% | 94% | 93% |
Model (Type) . | Accuracy . | Precision . | F1-Score . |
---|---|---|---|
Binary Bias Classifier (BERT-large-uncased) | 92% | 89% | 90% |
NER Model (RoBERTa-large-uncased) | 95% | 93% | 94% |
QA System (BERT-base-uncased) | 88% | 87% | 86% |
Sentiment Classifier (BERT-large-uncased) | 90% | 91% | 90% |
Toxicity Classifier (BERT-large-uncased) | 91% | 90% | 91% |
LLaMa2 (7b-chat-hf) | 93% | 94% | 93% |
Metric . | Pre-Debiasing . | Post-Debiasing . | . |
---|---|---|---|
Toxicity Score | 75% | 24% | |
Bias Classification Score | 80% | 55% | |
Sentiment Score | -0.3 | 0.2 |
Metric . | Pre-Debiasing . | Post-Debiasing . | . |
---|---|---|---|
Toxicity Score | 75% | 24% | |
Bias Classification Score | 80% | 55% | |
Sentiment Score | -0.3 | 0.2 |
In Table 5, we present performance metrics across various tasks, demonstrating the effectiveness of our classifiers in toxicity detection, bias classification, sentiment analysis, multi-label token classification, and QA capabilities. Additionally, we integrated a debiasing process using LLaMa 2B Chat [65], re-evaluating our models on the same test set post-debiasing. This allowed us to assess the impact of debiasing on reducing toxicity and bias, and its influence on sentiment analysis. The debiasing process notably improved scores and indicated the efficacy of the debiasing intervention.
Overall, these models'evaluation and validation phase, emphasize transparent reporting and public availability of models and the detailed results, aligns with the FAIR principles of Reusability and Accessibility.
Deployment and Ongoing Monitoring. FAIR Principles Achieved: Accessibility, Reusability, and Findability.
In the post-deployment phase, we emphasize the FAIR principles of Accessibility, Reusability, and Findability. Our approach includes providing comprehensive documentation and clear usage guidelines to enhance the model findability and Accessibility. We employ version control systems to ensure easy tracking and retrieval of the latest model versions, supporting both findability and Reusability.
To facilitate user engagement and model effectiveness, we offer extensive training materials and support, making the models more accessible to a diverse user base. Regular reporting of performance metrics offers insights into the models'ongoing effectiveness and areas for improvement, aiding in their reusability. We also provide integration support with various tools and platforms, ensuring that our models can be easily adopted in different technological contexts. These practices collectively ensure that our deployed models adhere to the FAIR principles, maintaining their utility and efficacy in a dynamic, user-centric environment.
Community Engagement and Collaborative Development. FAIR Principles Achieved: Findability, Accessibility, Interoperability, and Reusability.
In our efforts to foster community engagement and collaborative development, we align with the FAIR principles effectively. We provide open-source development to facilitate collaborative innovation. Our datasets are provided in accessible formats such as JSON and CSV and are indexed with rich metadata. This approach ensures that our LLMs are developed in environments conducive to collaborative progress. The datasets are disseminated under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license [66], aligning our work with the FAIR principles of data stewardship.
Upholding FAIR Principles. We provide a comprehensive analysis in Table 6, showcasing our commitment to upholding the FAIR data principles to enhance the dataset's usefulness for both present and future research endeavors.
Principle . | FAIR Compliance Criteria . | Met . | Remarks (Implementation Details) . |
---|---|---|---|
Findability | F1: Rich, descriptive metadata F2: Standardized data indexing F3: Documented data sources F4: Persistent identifiers for data and models | √ | F1: Metadata includes descriptions for data and models. F2: Data and models indexed with relevant keywords. F3: Clear documentation of data sources and model development. F4: Unique identifiers for data and models. |
Accessibility | A1: Clearly defined access protocols A2: Long-termdataandmodel preservation A3: Data and models available in accessible formats | √ | A1: Open-access repository for data and models. A2: Long-term preservation of data and models. A3: Data in CSV, JSON, XML; Models accessible via APIs. |
Interoperability | I1: Use of standard data and model formats I2: Adherence to data and model exchange standards | √ | I1: Data in standard formats; Models compatiblewithcommonMLframeworks. I2: Adheres to standard data and model exchange protocols. |
Reusability | R1: Detailed metadata for data and models R2: Adherence to ethical and transparency standards R3: Licensing information for data and model usage | √ | R1: Metadata includes usage guidelines for data and models. R2: Focus on ethical data use and transparent model training. R3: Data licensed under CC BY-NC 4.0; Models with open-source licenses. |
Principle . | FAIR Compliance Criteria . | Met . | Remarks (Implementation Details) . |
---|---|---|---|
Findability | F1: Rich, descriptive metadata F2: Standardized data indexing F3: Documented data sources F4: Persistent identifiers for data and models | √ | F1: Metadata includes descriptions for data and models. F2: Data and models indexed with relevant keywords. F3: Clear documentation of data sources and model development. F4: Unique identifiers for data and models. |
Accessibility | A1: Clearly defined access protocols A2: Long-termdataandmodel preservation A3: Data and models available in accessible formats | √ | A1: Open-access repository for data and models. A2: Long-term preservation of data and models. A3: Data in CSV, JSON, XML; Models accessible via APIs. |
Interoperability | I1: Use of standard data and model formats I2: Adherence to data and model exchange standards | √ | I1: Data in standard formats; Models compatiblewithcommonMLframeworks. I2: Adheres to standard data and model exchange protocols. |
Reusability | R1: Detailed metadata for data and models R2: Adherence to ethical and transparency standards R3: Licensing information for data and model usage | √ | R1: Metadata includes usage guidelines for data and models. R2: Focus on ethical data use and transparent model training. R3: Data licensed under CC BY-NC 4.0; Models with open-source licenses. |
6. DISCUSSION
6.1 Limitations of the FAIR Data Principles in Addressing Data Challenges
The FAIR data principles is a significant step towards enhancing the openness and efficiency of data usage in research and related fields. However, their effectiveness is constrained by several factors. Primarily, while the FAIR principles improve data accessibility and usability, they do not automatically guarantee data quality or validity. The absence of mechanisms for ensuring accuracy, completeness, or reliability could lead to the spread of low-quality data, negatively impacting research and decision-making. Additionally, FAIR principles, though encouraging data sharing, may not adequately address the ethical and privacy concerns associated with sensitive data like health records or personal information, which could result in privacy violations.
Furthermore, the implementation of FAIR principles requires considerable resources, including necessary infrastructure and expertise. This poses a significant challenge, especially for smaller institutions or those in developing countries, potentially exacerbating the digital divide. The broad application of FAIR principles may not be suitable for all scientific disciplines, as different fields might require tailored data management approaches not fully encompassed by these general guidelines. The effectiveness of these principles also depends heavily on the awareness and training of data handlers, where a lack of such training can be a major barrier to adoption. Moreover, the absence of universally accepted standards for evaluating’FAIRness'leads to an inconsistent application of these principles.
The diversity of data formats creates technical challenges in achieving seamless interoperability, where the emphasis on data sharing might overshadow other vital aspects of data stewardship, such as privacy considerations. Finally, the principles can conflict with intellectual property rights and commercial interests. The concept of free data access and reuse might challenge proprietary research, leading to resistance from certain sectors. This necessitates compliance with various legal and regulatory frameworks, adding another layer of complexity to the application of FAIR principles.
6.2 Future Directions
Strategies for Mitigating Limitations and Enhancing Data Utility
In addressing the limitations of FAIR principles and enhancing data utility, a multifaceted strategy is essential. This involves refining data quality through rigorous validation processes, ensuring accuracy and reliability. Simultaneously, balancing data accessibility with ethical and privacy considerations is crucial, there is need to employ proper anonymization techniques and strict privacy protocols. Emphasizing data preservation and sustainability alongside sharing, broadens the scope of stewardship..
Limitation of the Study and Future Perspectives
Our Fair-Compliant dataset, designed to identify and mitigate known biases, may inadvertently overlook emerging biases, thereby emphasizing the need for ongoing revision and vigilant monitoring [63]. Concurrently, scaling the dataset to match the complexity of advanced LLMs, while adhering to ethical standards, emerges as a substantial challenge. This endeavor is compounded by the imperative to mitigate biases in model interpretations, bolster interpretability, and ensure robust data privacy and security [67, 68]. Therefore, future research directions should consider creating dynamic mechanisms for dataset updates [69], advancing bias mitigation techniques, aligning datasets with evolving LLM technologies, exploring scalable dataset maintenance solutions, and broadening the spectrum of LLM applications. Integral to this journey is the development of comprehensive ethical guidelines and the fostering of collaborative research, both of which are pivotal for the responsible evolution of LLM technology.
7. CONCLUSION
Our study incorporates FAIR data principles into LLM training and development, improving data management and model training. This approach has yielded a versatile dataset that shows the process of integrating FAIR principles into the data building and that can be used for LLM training. While our dataset and case study offer guidance, they do not fully tackle the ‘FAIRification’ of LLMs. Advancements in LLM research must focus on data and the applicability of FAIR principles. This includes updating datasets to capture emerging trends, enhancing bias detection, adapting to novel LLM architectures, improving model interpretability, and formulating ethical guidelines. Our efforts contribute to the responsible advancement of AI, aiming to forge more ethical and efficient AI tools that serve diverse communities.
AUTHOR CONTRIBUTIONS
This research was conducted by the first and corresponding author, Dr. Shaina Raza (shaina. [email protected]). Dr. Shaina Raza is responsible for formulating the research questions, collecting, and analyzing the data, and drafting the manuscript. Dr. Deval Pandya and Dr. Chen Ding contributed to shaping the research questions and played a crucial role in designing and refining the manuscript. Shardul Ghuge helped in building models on the data, and Dr. Elham Dolatabadi assisted in the critical analysis of the models'results and detailed review. All authors have read and approved the manuscript.
DATA AVAILABILITY STATEMENT
The data sets generated during and/or analyzed during the current study are available in the BiasScan (https://huggingface.co/collections/newsmediabias/biasscan-659d681ed7a5bc9d98cde11b).
ACKNOWLEDGEMENTS
Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.
‘LLMs’ here includes both ‘LMs’ and ‘LLMs’, with ‘LLMs’ representing more advanced versions.
For hyperlinks to the technologies and APIs referenced in the table, refer to S1 Appendix A.
REFERENCES
APPENDICES
A. Acronyms and Full Forms
Academia.edu, Academia.edu; AI4ALL, AI4ALL; Algolia, Algolia; Amazon S3, Amazon Simple Storage Service; Apache Atlas, Apache Atlas; Apache Lucene, Apache Lucene; Apache NiFi, Apache NiFi; Apache Solr, Apache Solr; Archivematica, Archivematica; CKAN, Comprehensive Knowledge Archive Network; Clockss, Controlled LOCKSS; Collibra, Collibra; Creative Commons, Creative Commons; Crossref, Crossref; DSpace, DSpace; Dublin Core, Dublin Core Metadata Initiative; Data Provenance Tools, Data Provenance Tools; DataCite, DataCite; Dataverse, Microsoft Dataverse; EPrints, EPrints; Ecoinformatics, Ecoinformatics; Elasticsearch, Elasticsearch; FGED, Functional Genomics Data Society; Figshare, Figshare; GDPR, General Data Protection Regulation; Google Cloud Storage, Google Cloud Storage; GraphQL, GraphQL; HL7 FHIR, Health Level Seven Fast Healthcare Interoperability Resources; IEDB, Immune Epitope Database; ISO/IEC, International Organization for Standardization/International Electrotechnical Commission; LOCKSS, Lots of Copies Keep Stu ffSafe; NCBI, National Center for Biotechnology Information; OAI-PMH, Open Archives Initiative Protocol for Metadata Harvesting; Omeka, Omeka; OneTrust, OneTrust; OpenAI Ethics Guidelines, OpenAI Ethics Guidelines; OpenAPI, OpenAPI Initiative; OpenRefine, OpenRefine; ORCID, Open Researcher and Contributor ID; OWL, Web Ontology Language; Portico, Portico; PROV-DM, Provenance Data Model; RDF, Resource Description Framework; RE3data, Registry of Research Data Repositories; ResearchGate, ResearchGate; Responsible AI, Responsible AI; REST, Representational State Transfer; schema.org, Schema.org; SOAP, Simple Object Access Protocol; SPARQL, SPARQL Protocol and RDF Query Language; Talend, Talend; TrustArc, TrustArc; XOD, eXtensible ontology development; XSLT, Extensible Stylesheet Language Transformations; Zenodo, Zenodo; gRPC, gRPC Remote Procedure Calls.
B. Data Management Challenges in LLMs and Corresponding FAIR Principles
Data Management Challenge . | FAIR Principle Addressed . |
---|---|
Vast and Complex Datasets | Findability through detailed metadata and persistent identifiers to enhance data discovery. |
Data Quality and Bias | Accessibility with ethical access protocols to provide high-quality, unbiased data. |
Privacy and Ethical Concerns | Reusability with clear legal and ethical documentation to uphold privacy and ethics. |
Data Annotation and Labeling | Interoperability through standardized data formats for reliable annotation and labeling. |
Data Accessibility and Sharing | Accessibility to ensure a balance between open data sharing and protection of proprietary information. |
Legal and Regulatory Compliance | Reusability to align data management with legal standards for future use. |
Contextual Awareness | Interoperability for enhanced NLP algorithms to accurately capture and interpret context. |
Accuracy and Reliability | Accessibility and Reusability to ensure mechanisms for improved factchecking and consistent data quality. |
Ethical and Fair Use | Accessibility with advanced bias detection algorithms to promote fairness. |
Interactivity and Personalization | Findability and Accessibility for adaptive learning from user interactions for personalized experiences. |
Language and Cultural Sensitivity | Interoperability to support expanded language models and cultural datasets for inclusivity. |
Technical and Scalability | Reusability for efficient processing strategies that facilitate the scalable use of computational resources. |
Data Management Challenge . | FAIR Principle Addressed . |
---|---|
Vast and Complex Datasets | Findability through detailed metadata and persistent identifiers to enhance data discovery. |
Data Quality and Bias | Accessibility with ethical access protocols to provide high-quality, unbiased data. |
Privacy and Ethical Concerns | Reusability with clear legal and ethical documentation to uphold privacy and ethics. |
Data Annotation and Labeling | Interoperability through standardized data formats for reliable annotation and labeling. |
Data Accessibility and Sharing | Accessibility to ensure a balance between open data sharing and protection of proprietary information. |
Legal and Regulatory Compliance | Reusability to align data management with legal standards for future use. |
Contextual Awareness | Interoperability for enhanced NLP algorithms to accurately capture and interpret context. |
Accuracy and Reliability | Accessibility and Reusability to ensure mechanisms for improved factchecking and consistent data quality. |
Ethical and Fair Use | Accessibility with advanced bias detection algorithms to promote fairness. |
Interactivity and Personalization | Findability and Accessibility for adaptive learning from user interactions for personalized experiences. |
Language and Cultural Sensitivity | Interoperability to support expanded language models and cultural datasets for inclusivity. |
Technical and Scalability | Reusability for efficient processing strategies that facilitate the scalable use of computational resources. |
C. Inter Annotators Agreement
Figure 1 depicts the consensus among experts on different dimensions of bias, with the most significant agreement observed in the area of ‘Socioeconomic Bias’. This figure highlights the level of concordance among annotators regarding the various aspects of bias.
D. Dataset Schema and Formats
The dataset has multiple attributes like Text, Dimension, Biased Words, Aspect, Bias Label, Sentiment, Toxic, Identity Mention, Debasied Text.
Text: Lawyers are always manipulative and cannot be trusted.
Dimension: Professional Integrity
Biased Words: always, manipulative, cannot be trusted
Aspect: Trustworthiness of lawyers
Bias Label: BIASED
Sentiment: Negative
Toxic: Yes
Identity Mention: Lawyers
Debiased Text: Trustworthiness is an individual trait and varies among professionals, including lawyers.
We provide examples for each data format, demonstrating the use of dataset attributes in diverse machine learning models:
Format . | Example . |
---|---|
Classification | {Text: Politicians are often corrupt., Bias Label: BIASED, Sentiment: Negative} |
CoNLL | {Text: Doctors are always caring., Biased Words: [always, caring], Identity Mention: Doctors} |
SQuAD | {Text: Why are lawyers untrustworthy?, Aspect: Trust-worthiness, Debiased Text: Trustworthiness varies individually.} |
Counterfactual | {Text: Young people are irresponsible with money., Biased Words: [irresponsible], Debiased Text: Financial habits vary by individual.} |
Sentiment/Toxicity | {Text: Politicians are dishonest., Sentiment: Negative, Toxic: Yes} |
Format . | Example . |
---|---|
Classification | {Text: Politicians are often corrupt., Bias Label: BIASED, Sentiment: Negative} |
CoNLL | {Text: Doctors are always caring., Biased Words: [always, caring], Identity Mention: Doctors} |
SQuAD | {Text: Why are lawyers untrustworthy?, Aspect: Trust-worthiness, Debiased Text: Trustworthiness varies individually.} |
Counterfactual | {Text: Young people are irresponsible with money., Biased Words: [irresponsible], Debiased Text: Financial habits vary by individual.} |
Sentiment/Toxicity | {Text: Politicians are dishonest., Sentiment: Negative, Toxic: Yes} |