Abstract
The sparsity of labeled data is an obstacle to the development of Relation Extraction (RE) models and the completion of databases in various biomedical areas. While being of high interest in drug-discovery, the literature on natural products, reporting the identification of potential bioactive compounds from organisms, is a concrete example of such an overlooked topic. To mark the start of this new task, we created the first curated evaluation dataset and extracted literature items from the LOTUS database to build training sets. To this end, we developed a new sampler, inspired by diversity metrics in ecology, named Greedy Maximum Entropy sampler (https://github.com/idiap/gme-sampler). The strategic optimization of both balance and diversity of the selected items in the evaluation set is important given the resource-intensive nature of manual curation. After quantifying the noise in the training set, in the form of discrepancies between the text of input abstracts and the expected output labels, we explored different strategies accordingly. Framing the task as an end-to-end Relation Extraction, we evaluated the performance of standard fine-tuning (BioGPT, GPT-2, and Seq2rel) and few-shot learning with open Large Language Models (LLMs) (LLaMA 7B-65B). In addition to their evaluation in few-shot settings, we explore the potential of open LLMs as synthetic data generators and propose a new workflow for this purpose. All evaluated models exhibited substantial improvements when fine-tuned on synthetic abstracts rather than the original noisy data. We provide our best performing (F1-score = 59.0) BioGPT-Large model for end-to-end RE of natural products relationships along with all the training and evaluation datasets. See more details at https://github.com/idiap/abroad-re.
1 Introduction
The biomedical literature constitutes a vast but still underexploited reservoir of knowledge, the growth of which reflects the expansion of topics and areas of applications. However, the diversity and morphological richness of bio-entities and the complexity of the relationships expressed between them contrast with the sparsity of the available labeled data. While some domains can already benefit from efficient extraction models (e.g., chemical–disease relationships) for database completion, less popular domains, like the literature on natural products (NP), are often overlooked. NPs are chemical compounds produced by living organisms (plants, bacteria, fungi, etc.) exhibiting a wide range of structure and functions and offering a vast reservoir of potential therapeutic molecules. The isolation and identification of NP is primarily reported in the scientific literature and also disseminated in different public databases (e.g., COCONUT Sorokina et al. 2021; KNApSAcK Shinbo et al. 2006). Recently, the LOTUS initiative (Rutz et al. 2022) has successfully established an Open and FAIR standard resource for natural products chemistry through a rigorous harmonization of a heterogenous set of databases. However, the extent of the NP landscape is not reflected by the content of the databases, which are incomplete and exhibit an imbalanced coverage toward model organisms (e.g., A. Thaliana). While a significant portion of the existing literature remains unannotated, there is also a continuous surge of new publications reporting novel relationships that could contribute to filling this gap.
Enriching such knowledge bases requires jointly performing Named Entity Recognition (NER) and Relation Extraction (RE). In this case, NER is defined as a sub-task that consists of identifying the boundaries and classifying the type of named entities (i.e., an organism “Isaria sinclairii” and a chemical “fingolimod”1). The subsequent RE step is the semantic classification of the relations between two (or more) entities. To complete NP databases, the objective is to extract the “produces” or “is isolated from” relationships between organisms and chemicals. Note that other types of relationships can also be expressed, such as “inhibits the growth of”. Traditional deep learning models exhibiting SOTA performance on NER and RE (separately or in so-called end-to-end models) rely on a large set of labeled data (Luo et al. 2022b; Giorgi, Bader, and Wang 2022; Wang et al. 2020). However, while datasets like Linneaus (Gerner, Nenadic, and Bergman 2010) have been successfully applied for organism recognition, existing chemical NER datasets, that is, CHEMDNER (Krallinger et al. 2015), do not provide sufficient coverage on the NP literature and do not adequately capture their morphological specificities. Along with the typically long systematic names of metabolites (e.g., 3’-[gamma-hydroxymethyl-(E)-gamma-methylallyl]-2,4,2’,4’ - tetrahydroxychalcone 11’-O-coumarate2), many chemical mentions are defined as multiple co-joined enumerations, where entities are mentioned in non-continuous strings such as “cystodiones A-D” or “wortmannins C and D”, and are particularly frequent. These chemical mentions must be correctly identified and expanded to recover the full list of entities, which also adds complexity to the decoding process. Finally, to the best of our knowledge, no datasets are available for the subsequent RE step (Luo et al. 2022a). The aforementioned constraints are frequently encountered in BioNLP, when venturing beyond the well-studied chemical–disease associations or protein–protein interactions.
Meanwhile, the abundance of unlabeled textual data has been instrumental in driving recent breakthroughs in representation learning (Wysocki et al. 2023) and the development of the foundational models (e.g., GPT and LLaMA model families). The zero/few-shot learning capabilities of Large Language Models (LLMs) (Kojima et al. 2023; Brown et al. 2020) make them serious candidates for performing a task with only a handful of examples. Moreover, conversation (chatbots) and instruction-tuned models (Zhang et al. 2023) also represent a promising opportunity for synthetic data generation to alleviate the main problem, namely, the lack of labeled data within the target distribution. Indeed, beyond the sophistication of model architectures, data availability and quality are limiting factors for the extraction performance, but often neglected (Sambasivan et al. 2021).
In order to address these scarcity constraints, we propose an end-to-end generative extraction paradigm, which introduces two novel methodological contributions. Firstly, we introduce a diversity-optimized sampling strategy, which minimizes the selection of items for the parsimonious creation of evaluation gold-standards and training sets. This component minimizes the popularity biases and associated imbalance towards entities which are over-expressed in the literature (e.g., model organisms and recurring substances), allowing for a systematic (entropy-based) method to maximize diversity and measure the utility of new annotations. Secondly, we use the generative expressivity of models fine-tuned on conversations and instructions for creating within-distribution synthetic data, to support the construction of end-to-end joint NER-RE extraction models. In this framework, the diversity-sampled entities and associated relations are linguistically embedded within synthetically generated text. The overall framework is depicted in Figure 1.
More formally, this article aims to investigate the following research hypotheses (RHs) as supporting mechanisms for addressing these limitations, using NP as a validation domain:
RH1: Diversity-optimized sampling provides a valuable selection of items to build training and evaluation datasets for RE.
RH2: In a practical scenario with noisy labels, LLMs can be more beneficial as a synthetic data generator than unsupervised predictors.
1.1 Related Work
Biomedical RE (Shahab 2017; Zhao et al. 2020, 2023) encompasses various subtypes, depending on the considered bio-entities, such as drug–drug interactions (Zhang, Leng, and Liu 2020), chemical–disease relationships (Li et al. 2016), gene–disease associations (Su et al. 2021), and protein–protein interactions (Ahmed et al. 2019), among the most popular. Investigating the overlooked NP relationships necessitated the exploration of several interconnected sub-tasks, including the selection and partitioning of a dataset, the generation of synthetic data, and the assessment of various end-to-end RE strategies. This section provides a review of the closely related works that align with these three development axes.
1.1.1 Splitting Datasets and Impact of Diversity
Data selection and partitioning methods can significantly impact the generalization performance of supervised models. Xu and Goodacre (2018) evaluated various splitting techniques, including K-S (Kennard and Stone 1969) and SPXY (Galvao et al. 2005), and emphasized the importance of maintaining a balance between training and test sets for a reliable evaluation of models. Like the recently proposed SPlit (Joseph and Vakayil 2022) method, these approaches aim to select a representative subset of the data, leveraging different distance metrics. Unlike the Euclidean or energy-based distances used in aforementioned methods, the Greedy Maximum Entropy (GME)-sampler uses an entropy-based metric to capture diversity and select representative evaluation and training sets. Although these distance-based methods share a common objective, they were initially designed to work with continuous variables, rather than categorical variables, such as large sets of organisms and chemicals. Moreover, to the best of our knowledge, no method has been specifically developed to sample documents reporting N-ary relations for the purpose of building NER/RE datasets. The GME-sampler represents a first attempt to address this gap. Regarding diversity, Yu, Khadivi, and Xu (2022) investigated various diversity-based metrics for selecting training data, and demonstrated their positive impact on the performance of NER models. Additionally, other works have highlighted the significance of effective data selection over a naive increase of the dataset size for training (Axelrod, He, and Gao 2011; Fan et al. 2017; Feng et al. 2018).
1.1.2 Synthetic Data Generation
Training neural RE models strongly rely on a substantial and diverse set of training data. However, annotating large datasets with experts is time-consuming and costly. To overcome this limitation, many studies explored approaches such as Data Augmentation (DA) (Hu et al. 2023; Feng et al. 2021; Pellicer, Ferreira, and Costa 2023) and Distant Supervision (DS) (Smirnova and Cudré-Mauroux 2018; Mintz et al. 2009), which enable the expansion of the dataset size by creating new training examples from existing ones, or, by assigning pseudo-labels to external, unlabeled data. In the biomedical domain, the RE challenge ChemProt (Yoon et al. 2023; Iinuma, Miwa, and Sasaki 2022) or protein–protein interactions extraction (Su et al. 2019), have recently benefited from the application of these methods. Synthetic data generation (SDG) goes beyond DA or DS by creating fully synthetic datasets, namely, paired input text and output labels. A significant body of influential works has leveraged the generative capabilities of LLMs to propose different SDG strategies in zero-shot (Ye et al. 2022; Gao et al. 2023; Schick and Schütze 2021; He et al. 2022; Wang et al. 2021; Smith et al. 2024; Meng et al. 2022; Kumar, Choudhary, and Cho 2020), few-shot (Bonifacio et al. 2022; Dai et al. 2023; Meng et al. 2023; Chen et al. 2022a; Yoo et al. 2021), or by fine-tuning (Anaby-Tavor et al. 2020; Papanikolaou and Pierleoni 2020; Hartvigsen et al. 2022). Similarly to this work, Josifoski et al. (2023) also proposed to reverse the task and used LLMs from OpenAI to generate plausible input text based on expected output triplets from Wikidata. Tang et al. (2023) compared the performance of an LLM (ChatGPT) in directly extracting information from unstructured clinical text to its potential use as synthetic data generator for DA. Veselovsky et al. (2023) evaluated various prompting strategies to improve diversity and alignment between synthetic and real-world data distributions for sarcasm detection. Yang et al. (2020) combined synthetic data generation with a diversity-augmentation component for common sense reasoning. Aggarwal, Jin, and Ahmad (2023) applied SDG to biomedical NER, while Xu et al. (2023) used a two-stage training procedure on synthetic and golden data, notably for extracting protein interactions with the ChemProt dataset. In contrast, this work proposes to leverage Open LLMs to generate synthetic abstracts based on a list of verbalized main findings. The diversity of the generations is increased and guided by the entropy-based sampling of the seed articles which originally report these findings, as well as a set of crafted patterns of expressions.
1.1.3 End-to-End Relation Extraction
Kambar, Esmaeilzadeh, and Heidari (2022) classifies various strategies and highlights the potential of end-to-end (or joint) NER and RE methods to overcome limitations of the traditional pipeline approaches. In the biomedical domain, Li et al. (2017) proposed a Bi-LSTM for drug adverse effects extraction, while Esmail Zadeh Nojoo Kambar, Esmaeilzadeh, and Taghva (2022) introduced a graph neural network for chemical–protein interactions. Recent approaches frame the task in a generative “text-to-text” process, using sequence-to-sequence models, by lineralizing the expected relations as a text string to be decoded from the input. Seq2Rel (Giorgi, Bader, and Wang 2022) and REBEL (Huguet Cabot and Navigli 2021) proposed different linearization schemas, and Zhang et al. (2020) and Zeng et al. (2019) notably assessed the biases caused by the forced order of relationships during training. Hou et al. (2022) trained a sequence-to-sequence model for drug–target interactions extraction, and Zeng et al. (2018) introduced a copy mechanism. Additionally, Eberts and Ulges (2021) used four task-specific sub-components, and Paolini et al. (2021) utilized a translation mechanism. Finally, BioGPT (Luo et al. 2022b) demonstrated SOTA performance on several biomedical datasets using an autoregressive approach, providing the input text as context.
With the aim of providing an end-to-end RE model to help expanding NP databases, we started by building a training and evaluation dataset. Inspired by the metrics used in ecology, we first proposed the Greedy-Maximum-Entropy sampler (GME-sampler) to extract a diversity-optimized sample from the LOTUS database. By manually annotating the top-diverse items, we proposed the first evaluation dataset for this task, which can serve as a benchmark for future developments in this area. Following a descriptive analysis of the remaining data and quantifying the noise present in the form of discrepancies between raw input text and annotated (standardized) labels, we evaluated various modeling approaches. First, we compared the performance of standard fine-tuning techniques on the available noisy data to the few-shot learning capabilities of open LLMs. Leveraging the generative capabilities of a LLM (Vicuna-13B), we then proposed a novel synthetic abstract generation pipeline and demonstrated the significant performance improvements (on average 24.7% in F1-score) brought by these new training data on the evaluated models. In line with these results, we have made available our best-performing BioGPT-Large model (F1-score = 59.0) and the ≈ 25,000 synthetic abstracts on which it has been trained. A synthetic diagram of the different strategies explored in this work is presented in Figure 2. The main contributions of the work can be summarized as:
A diversity-optimized sampler (GME-sampler) for building diverse and balanced datasets for RE (see https://github.com/idiap/gme-sampler).
The first curated evaluation dataset for RE between organisms and NP (see https://zenodo.org/records/8422007).
An evaluation of different strategies for RE with noisy labels.
A framework for synthetic data generation via chatbot or instruction-tuned models and the produced training datasets (see https://github.com/idiap/abroad-re and https://zenodo.org/records/8422294).
A set of ready-to-use BioGPT fine-tuned models (see https://huggingface.co/mdelmas/BioGPT-Large-Natural-Products-RE-Diversity-synt-v1.0)
2 Proposed Approach
This section describes the different methodology used in this work. We start by describing our first contribution, the GME-sampler, in Section 2.1. The few-shot learning and fine-tuning strategies evaluated for the RE task are then described in Sections 2.2.1 and 2.2.2. The synthetic abstract generation procedure is described in 2.3. Finally, details on the evaluation, experimental setup and implementation details are provided in Appendix A.
2.1 Greedy Maximum Entropy Sampling (GME)
The objective is to extract a sample S of documents from an initial set D with an optimized diversity of mentioned organisms and chemicals: S ⊂ D, |S| = l and |D| = L. The initial set D corresponds to the LOTUS dataset, in which each document d reports a set of relations between organism(s) and isolated natural product(s): , where nd is the number of reported relations in d. A relation rk = (oi, cj) involves the organism oi and the chemical cj. The set of organisms and chemicals are denoted as O and C, respectively.
The method can also be seen as a ranking procedure, and a sample is determined by selecting the first top n ranked items. The selection of an appropriate sample size l is also a critical, but often overlooked factor. By monitoring HS(O) and HS(C) during the iterative construction of S (until l = L), it is possible to determine the step l at which diversity starts to deteriorate and sampling should be stopped, that is, when the new added documents provide relationships for already frequently reported entities in S. The GME-sampler, initially designed for the purpose of extracting data from LOTUS, has also been implemented as a standalone library. It is proposed as a method to build samples of documents reporting N-ary relations with optimized diversity, and can be applied in alternative contexts (e.g., Pharmacogenomics: Variant–Drug–Adverse event). See code available at https://github.com/idiap/gme-sampler.
2.2 Different Strategies for Relation Extraction
2.2.1 Few-shot In-context Learning with Open LLM
In few-shot settings, the model is prompted with K input–completion example pairs and one final input, with the objective of accurately generating the completion for the final input (Brown et al. 2020). Considering the limited size of the context-window (2,048 tokens), we carefully selected K = 5 archetypical parts of diverse abstracts that exemplify various patterns and specificities of reporting NP relationships. More details in Appendix A.1.
2.2.2 Fine-tuning
2.3 Synthetic Abstract Generation
A general overview of the synthetic abstract generation is provided in Figure 3. The goal is to leverage the generative capabilities of instruction- and conversation-tuned models to correct the discrepancies between the expected output labels and the input text. Consistency is maintained by grounding the generated abstracts on key elements from an original seed abstract: title, keyphrases extracted from the abstract (and title), and verbalized main findings. The extracted keyphrases are also intended to mirror the annotated MeSH descriptors, which are attached to the title and abstract in a PubMed entry. The main findings represent the set of relations {r1, r2,…, rn} between organisms and NP reported in the seed article according to the LOTUS database (the expected output labels). Both the keyphrase extraction and the subsequent generation step can be framed as instructions-guided tasks: “Extract a list of keywords ...”, “Create a scientific abstract ...”. As the extracted keywords and keyphrases will provide an essential context to constrain the generation of the synthetic abstracts, it is also arguably advantageous that both tasks are carried by the same model.
The extraction of keywords (illustrated in Box A of Figure 3) consists of prompting the model to extract keywords and keyphrases from the original abstract, and establish a coherent context for the subsequent generation. However, there is a risk that certain chemicals or organisms mentioned in the original abstract may also be extracted as keywords. They could be erroneously mixed by the model with the main findings (the expected output labels) in the generation step. This could result in the generation of abstracts with unintended relationships that were not specified in the original main findings. To alleviate this potential issue, an exclusion list is created for each input seed abstract, including organisms and chemicals annotated by LOTUS, their synonyms from PubChem, and annotations from PubTator (Wei et al. 2019). Then, all extracted keywords matching items from this list are excluded.
By explicitly formalizing the expected patterns in upstream instructions, the expression of NP relationships during the generation step can be more efficiently controlled. The findings-verbalizer module operates as a sampler to emulate and combine various patterns of expression that can be observed in the literature. It incorporates 5 possible transformations: (1) members of a same chemical class3 can be replaced by the simple mention of the class (e.g., a list of chemicals c1:5 is replaced by the more concise mention “Five Meroterpenoids”); (2) Lists of chemical derivates can be contracted (e.g., “Cystodione A–D”); (3) The order of relationships is systematically shuffled; (4) Chemicals can be numbered (e.g., “Cystodione A-D (1–4)”); (5) Directions of the relationship can change from “OproducesC” to “Cwas isolated fromO”. See Box B of Figure 3 and more details in Appendix A.6. These different transformations are reminiscent of the strategies commonly used in data augmentation (Feng et al. 2021).
For each input seed abstract, m instructions are sampled and assembled following this procedure and forwarded to the model for generation (Box C). See illustrative examples of abstract generation in Appendix A.7. Finally, the selector module selects a top k, from the m generated abstracts, ensuring that at least a proportion q of the expected relations have the labels of the involved organisms and chemicals explicitly mentioned in the generated abstract (Box D). Regarding the expected output labels, the replacement operated by transformation (1) also applies: The initial relations r1:5 involving the 5 meroterpenoids are replaced by a single relation r6 involving “Meroterpenoids” as chemical entity. In contrast, transformation (2) does not affect the output labels, requiring the model to expand the list of relations involving each derivative (see Box D - Output labels). Also, the loss in Equation 3 (like in Seq2rel) is permutation sensitive, but the order created by the transformation (3), which also applied to the output labels, is almost systematically respected by the model in the generated abstract, alleviating this issue. Transformations (4) and (5) have no influence on the output labels.
3 Empirical Experiments
3.1 Imbalanced Repartition of Reported Relations and Coverage on Biological Kingdoms
As reported in the original release of the LOTUS dataset (Rutz et al. 2022), the imbalance in the data distribution manifests at two main levels: the repartition of the number of reported relationships per organism (respectively, chemicals) and the coverage of biological kingdoms. These observations were reproduced from the latest available snapshot of the LOTUS dataset (v10-01-2023),4 containing more than 533,000 distinct relations between organisms and NP, reported from more than 88,000 articles. As expected, a small fraction of the organisms (respectively, chemicals) attracts a large proportion of the relations: more than 72% of relations involve only 20% of the organisms (Figure 4.A). Beside these Pareto distributions (Newman 2005), the imbalance in the repartition of the relations across biological kingdoms is also important: 80% are related to Archaeplatida (Figure 4.B top-left). Considering these two biases is essential to extract a valuable sample. This motivated the use of the GME-sampler in a stratified way, to maximize diversity and reduce the Pareto effect, while ensuring a more balanced coverage across biological kingdoms.
3.2 Dataset Pre-processing
The original dataset was first preprocessed and filtered prior to sampling to eliminate various sources of perturbations and unusable data in subsequent steps. Specifically, only documents with publicly available abstracts on PubMed were selected, and these were further filtered based on the number of reported relations. Indeed, a manual inspection of a subset of articles revealed that documents reporting large numbers of relations (Swainston et al. 2016; Thiele et al. 2013; Stefanini et al. 2017; Thompson et al. 2006) often propose genome-scale metabolic reconstructions, large screening analyses, or database releases. Although these documents may report hundreds of relationships, they are typically not expressed in the abstracts, making them useless examples for building a RE model. Only articles reporting fewer than 20 relations (corresponding to the quantile 93%) were then selected. Compared to organism names, the length of chemical names can exhibit extreme variability and exceed hundreds of characters depending on the nomenclature. To mitigate the issues posed by these lengthy labels, which are inordinate to decode and could consume an excessive portion of the context window during training and testing, only relations involving chemicals with a label length l ≤ 60 characters were retained. See more details in Appendix C.1 and the global pre-processing statistics in Table C.1. The kingdom coverage is also presented in Figure 4.B top-right.
3.3 Building a Diversity-augmented Dataset
3.3.1 Diversity-sampling on Organisms and Chemicals
The preprocessed dataset was first stratified according to the taxonomic classification (kingdoms) of the organisms associated with the relations reported in each document. Subsequently, the GME-sampler was applied to each subset (Figure 5 Top) to monitor the evolution of the diversity metrics (HS(O) and HS(C)) and determine an optimal sample size. Indeed, the GME-sampler operates as a ranking method, where the article selected at step n, is the one which contributes the most to the diversity of the set of the n −1 articles selected upstream. For both organisms and chemicals, diversity increases rapidly in the first hundred ranked items, followed by a plateau. Specifically for organisms (regardless of the kingdom), diversity showed a decline in the second half of the sampled items (Appendix Table C.2). This is the signal that the addition of new articles provide relations for already well-covered organisms and disrupted the existing balance in the organism distribution. In contrast, the impact of newly added articles on chemicals is negligible, likely because they represent a larger set of distinct entities. To keep a reasonable balance between diversity and sample size, we decided to only retain the top n = 500 ranked articles per kingdom, ensuring at least 80% of the maximal observed entropy on both organisms and chemicals (Figure 5 Bottom). The proportions of maximal observed entropy at alternative sample sizes are presented in Appendix Table C.3.
The impact of the diversity-sampling strategy is evaluated by comparing the composition of the sample against 5 random samples of equivalent sizes.5 The original diversity sample and the extracted random samples are respectively denoted as Diversity and Random samples. While showing similar kingdoms’ coverage because of the common stratification procedure (Figure 4.B Bottom), the diversity sample is, as expected, significantly richer in terms of distinct number of chemicals, organisms, and relations (Figure 4.C). This improved diversity is also reflected in a reduced pareto effect for the distribution of the organisms (negligible for chemicals), and overlap between the entities reported in each article (Figures 4.D and E).
The diversity-sampling strategy was also evaluated against three alternative baselines. In Top-organisms, the top 500 articles with the most distinct organisms (individually) were extracted per biological kingdoms. This was similarly done for relations and chemicals with Top-relations and Top-chemicals. As expected, the Top-relations strategy led to the largest set of distinct relations (Figure 5.B), followed by Top-chemicals and the proposed diversity-sampling. However, this improvement comes at the expense of a poorer diversity in terms of organisms, but also balance in their distribution (Figures 5.C and D). Interestingly, the Top-organisms strategy led to a smaller set of entities compared to the diversity-sampling. Indeed, in the case of an imbalanced distribution of entities over the sampled items (i.e., some model organisms attract more articles than non-model organisms), the simple Top-organisms strategy does not consider this potential redundancy. However, its prevention is an explicit objective with the proposed approach. Overall, the evaluated metrics suggest that the diversity-sampling with the GME-sampler offers a valuable compromise between these alternative strategies.
3.3.2 Distance Between Standardized Annotations and Original Text
Several studies emphasize the importance of data quality over quantity for fine-tuning language models (Zhou et al. 2023; Dettmers et al. 2023; Li et al. 2023). LOTUS data are recognized as being of high quality, particularly because of the harmonization, cleaning, and validation steps of the workflow, aligning original records from several open NP databases into standardized structures and organisms in Wikidata. Although this is essential to ensure data FAIRness, these processes logically distance the standardized entries from their original literal mentions in the referenced articles. To get a rough estimate of this distance, the Diversity and the 5 Random samples were merged into a single Extended dataset. Then, we estimated the proportion of the labels of the standardized entities that could be found in the original abstracts of articles reporting the relationships. Details of this estimation are in Appendix C.2. More than 2/3 of the organism labels are effectively retrieved in the original abstract, while less than half of the chemical names can be retrieved, even considering their synonyms (Figure 4.F). Assuming that these two types of mismatches are independent, only 1/3 of the reported pairs would be completely found in an abstract. Finally, some reported NP relationships are simply not expressed in the abstract of the cited reference, but have been reported from the body of the article or supplementary materials.6 Whether they are derived from the Diversity or a Random sample, these noisy examples make the training of a model challenging because some labels to be predicted are missing from the input text (Northcutt, Jiang, and Chuang 2021; Jain et al. 2020). In this context, alternative strategies like zero-shot or few-shot learning (also called in-context learning) based on open LLMs (Liu et al. 2022; Chen et al. 2022b) also need to be considered.
3.3.3 Creating a Manually Curated Evaluation Dataset
If these discrepancies certainly affect the training of a model, they are a more sensitive issue in an evaluation set (Northcutt, Athalye, and Mueller 2021). Also, if diversity can be an important feature for a training set (Yu, Khadivi, and Xu 2022), it is arguably also important for an evaluation set (Liang et al. 2022). While smaller by design, the evaluation set needs to be representative. Finally, because the manual curation of an evaluation set is an expensive and time-consuming task, the selected set of entries need to be chosen carefully (Sambasivan et al. 2021). Considering the last points, the knee-points of the entropy curves (where the entropy increases weaker by new added articles) obtained with the GME-sampler suggest relevant tradeoffs between sample size and diversity (Figure 5 Bottom-panel) early in the sampling. Nonetheless, as they vary between different biological kingdoms, on organisms and chemicals, and could be too restrictive, the extended set of the top 50 items from each kingdom was extracted, resulting in an evaluation dataset of 200 abstracts. The abstracts were manually curated by an expert, annotating all instances of mentioned organism–NP relationships in their order of appearance in the original text and using established identifiers such as Wikidata IDs and PubChem IDs. As isolated chemicals are sometimes grouped into chemical families for the sake of brevity in abstracts, all mentions of a more general chemical family were also annotated. The curated evaluation set is publicly available at https://zenodo.org/records/8422007. Details about the curation protocol are available in Appendix B.1, together with a comprehensive overview of the content of the dataset in Appendix B.2. Additionally, we computed the inter-annotator agreement for the organism–NP relationships, based on a separate set of annotations provided by a second annotator using the same guidelines, and achieved 88.5%. Details in Appendix B.3.
3.4 Few-shot Learning Approaches the Performance of Standard Fine-tuning on Raw Data
The mismatches between the standardized labels and the original abstracts have therefore been corrected for the evaluation set. However, due to the considerable investment of time and resources required for this task, the same corrections were not applied on the remaining data available for training. In this particular context of noisy data for end-to-end RE, two strategies were evaluated: standard fine-tuning and few-shot learning, the latter being able to rely only on a few manually selected examples. The performance of the fine-tuning strategy was evaluated using train/valid datasets derived from the initial Diversity, Random, and Extended samples, which will be referred to as Diversity-raw, Random-raw, and Extended-raw, respectively. Specifically, Extended-raw is an extension of Diversity-raw that also includes all examples from the 5 Random-raw datasets. To further evaluate the impact of dataset size on training performance, models were also trained on the Full dataset. The Full dataset is larger than Extended-raw and contains all available examples from the pre-processed LOTUS dataset (excluding the 200 used in the test set). Their respective sizes and splits are detailed in Appendix Table C.4. All datasets were used to train 3 models for end-to-end RE: Seq2rel, BioGPT, and GPT-2. Six open LLMs were also evaluated in few-shot learning settings: LLaMA 7B, 13B, 33B, and 65B, along with two models, respectively fine-tuned on instructions and conversations and derived from LLaMA 7B and 13B: Alpaca-7B and Vicuna-13B.
Best performance in fine-tuning settings was achieved by BioGPT (Table 1). Regardless of the training dataset,7 it consistently outperformed Seq2rel and GPT-2 and demonstrated a F1-score of 32.5% when trained on Extended-raw. We also evaluated the influence of the different training datasets on models performance. The results indicate that models trained on Diversity-raw outperformed8 those models trained on Random-raw, with a notable improvement in recall at the expense of precision. Merging the datasets into a larger (Extended-raw) also resulted in improved performance for all models. However, expanding the dataset to all available examples only barely improved the previous performance and surprisingly underperformed with BioGPT. In few-shot learning scenarios, the best performance was obtained with LLaMA-65B and declines with smaller models. Although the performance was inferior compared with fine-tuned alternatives, the models achieved reasonable scores considering the limited number of archetypal examples provided. These results also emphasize the potential of few-shot learning or prompt-tuning based approaches in practical context with low-resources.
Model . | Training . | Precision . | Recall . | F1 . |
---|---|---|---|---|
LLaMA-7B | Few-shot learning (5-shot) | 27.0 | 9.0 | 13.6 |
LLaMA-13B | 35.6 | 23.6 | 28.5 | |
LLaMA-33B | 38.5 | 23.2 | 29.0 | |
LLaMA-65B | 40.2 | 23.0 | 29.2 | |
Alpaca-7B | 15.1 | 2.2 | 5.9 | |
Vicuna-13B | 38.4 | 20.4 | 26.5 | |
Seq2rel | Random-raw | 43.2 +/−(6.7) | 4.8 +/−(1.2) | 8.6 +/−(2.0) |
Diversity-raw | 39.6 | 5.4 | 9.5 | |
Extended-raw | 47.3 | 5.8 | 10.4 | |
Full | 45.6 | 7.1 | 12.2 | |
GPT-2 | Random-raw | 32.5 +/−(4.8) | 11.8 +/−(5.3) | 15.0 +/−(2.5) |
Diversity-raw | 22.3 | 19.2 | 20.6 | |
Extended-raw | 44.8 | 21.7 | 29.3 | |
Full | 47.5 | 22.5 | 30.5 | |
BioGPT | Random-raw | 47.2 +/−(4.0) | 19.8 +/−(2.7) | 27.6 +/−(2.5) |
Diversity-raw | 37.1 | 28.4 | 32.2 | |
Extended-raw | 42.2 | 26.5 | 32.5 | |
Full | 46.7 | 21.3 | 29.3 |
Model . | Training . | Precision . | Recall . | F1 . |
---|---|---|---|---|
LLaMA-7B | Few-shot learning (5-shot) | 27.0 | 9.0 | 13.6 |
LLaMA-13B | 35.6 | 23.6 | 28.5 | |
LLaMA-33B | 38.5 | 23.2 | 29.0 | |
LLaMA-65B | 40.2 | 23.0 | 29.2 | |
Alpaca-7B | 15.1 | 2.2 | 5.9 | |
Vicuna-13B | 38.4 | 20.4 | 26.5 | |
Seq2rel | Random-raw | 43.2 +/−(6.7) | 4.8 +/−(1.2) | 8.6 +/−(2.0) |
Diversity-raw | 39.6 | 5.4 | 9.5 | |
Extended-raw | 47.3 | 5.8 | 10.4 | |
Full | 45.6 | 7.1 | 12.2 | |
GPT-2 | Random-raw | 32.5 +/−(4.8) | 11.8 +/−(5.3) | 15.0 +/−(2.5) |
Diversity-raw | 22.3 | 19.2 | 20.6 | |
Extended-raw | 44.8 | 21.7 | 29.3 | |
Full | 47.5 | 22.5 | 30.5 | |
BioGPT | Random-raw | 47.2 +/−(4.0) | 19.8 +/−(2.7) | 27.6 +/−(2.5) |
Diversity-raw | 37.1 | 28.4 | 32.2 | |
Extended-raw | 42.2 | 26.5 | 32.5 | |
Full | 46.7 | 21.3 | 29.3 |
3.5 Reversing the Task: Generation of Synthetic Data with Open LLMs
While LLMs cannot compete in terms of performance with fine-tuned approaches in the evaluated settings, their generative abilities could be used alternatively to address the main bottleneck: the discrepancies between the input text and the labels in the training data. It requires going beyond distant supervision or data augmentation (Feng et al. 2021; Shang et al. 2018; Smirnova and Cudré-Mauroux 2018). The former involves mapping relationships from a knowledge base to a large corpus of text to generate pseudo-labels, whereas the latter entails applying a range of transformations, permutations, or morphings to a core set of high-quality examples. The semantic discrepancy between the input text and the output labels would not be resolved by introducing syntactic or lexical variations in the original abstracts. Moreover, the results presented in Table 1 indicate that the inclusion of more training (noisy) instances (Full dataset) does not result in systematic improvements. In contrast, the adaptive described approach proposes to generate a set of new synthetic input abstracts from a pre-defined context and a set of expected output labels (i.e., organism–NP relationships).
To maintain consistency, each synthetic abstract is based on the context and results reported from an original seed abstract. The first step is to generate the instructions to prompt the selected LLM for generation. The instructions are composed of a title, a list of keywords, and the verbalized main findings (Method 2.3). We decided to use the open source Vicuna-13B (Chiang et al. 2023),9 a LLaMA-13B model fine-tuned on user-shared conversations collected from ShareGPT,10 which outperforms alternatives of equivalent sizes on several benchmarks (Dettmers et al. 2023). For each input seed abstract, the top-10 extracted keywords were used in the built instruction. As this is a crucial step, the performance of Vicuna-13B to extract keywords have been evaluated on the SemEval2017-Task10 dataset (Augenstein et al. 2017) in Appendix C.4. To diversify the generated abstracts, m = 10 instruction prompts with different verbalization patterns were then sampled per initial seed article. Finally, only the top k = 3 most relevant synthesized abstracts per seed were selected with the simple, yet effective, selector module.
To evaluate the impact of diversity-sampling on the seed articles used for synthetic generation, we created two new datasets: Diversity-synth and Random-synth, derived from the original abstracts in the Diversity-raw and Random-raw datasets, respectively. Several illustrative examples of synthetic abstracts from Diversity-synth are discussed in Appendix A.7, highlighting both the variability and the potential caveats (errors, hallucinations) of the process. As with the original data, Diversity-synt and Random-synt were merged in Extended-synt to measure the impact of the dataset size. Statistics of the generated datasets are presented in Appendix Table C.4. In total, more than 25,000 synthetic abstracts were generated from the 7,901 originally contained in the raw datasets. From Diversity-raw, 200 initial items were excluded by the selector module and 162 on average for Random-raw. While the distinct numbers of entities/relations dropped in synthetic datasets, the selector guarantees that these labels are part of the generated abstracts. Furthermore, the generation process enables the integration of examples with chemical classes in the input text and expected labels, which were not available in the original data.
3.6 Training on Synthetic Data Improved Performance over Noisy Raw Data
The synthetic datasets were used to train new instances of the previously evaluated models: Seq2rel, GPT-2, and BioGPT. Although the synthetic training sets (Diversity-synt and Random-synt) are almost half the size of Extended-raw (respectively, 3,562 and 3,798 compared with 7,111 examples), on which was established the previous baseline with BioGPT (F1-score = 32.5), all the trained models demonstrated improved performance (see Table 2). Indeed, all metrics improved in all 9 configurations—3 models × 3 categories of dataset (Random, Diversity, and Extended)—with the best gains observed for Seq2rel. The ranges of improvements for precision, recall, and F1-score go respectively from: 6.2 to 21.9, 13.2 to 25.3, and 12.4 to 30.6. The ranking of the models and the impact of the synthetic training sets on the final performance align with the previous observations on the original data. BioGPT models consistently outperformed Seq2rel and GPT-2, and the training on Diversity-synt resulted in an improved recall at the expense of precision compared with Random-synt. However, the GPT-2 models trained on Random-synt on average outperformed the one trained on Diversity-synt, a departure from the trend observed with Seq2rel and BioGPT. Again, the best performance is achieved by BioGPT trained on the merged set, with F1-score = 53.8.
Model . | Dataset . | Precision . | Recall . | F1 . |
---|---|---|---|---|
Seq2rel | Random-synt | 62.4 +/−(1.0) (↑ 19.2) | 26.8 +/−(2.0) (↑ 22.0) | 37.5 +/−(1.9) (↑ 28.9) |
Diversity-synt | 61.5 (↑ 21.9) | 30.7 (↑ 25.3) | 40.1 (↑ 30.6) | |
Extended-synt | 65.1 (↑ 17.8) | 29.9 (↑ 24.1) | 41.0 (↑ 30.6) | |
GPT-2 | Random-synt | 42.6 +/−(2.9) (↑ 10.1) | 32.7 +/−(2.8) (↑ 20.9) | 37.2 +/−(2.8) (↑ 22.2) |
Diversity-synt | 28.5 (↑ 6.2) | 39.4 (↑ 20.2) | 33.0 (↑ 12.4) | |
Extended-synt | 52.0 (↑ 7.2) | 44.6 (↑ 22.9) | 48.0 (↑ 18.7) | |
BioGPT | Random-synt | 56.4 +/−(2.3) (↑ 9.2) | 38.8 +/−(1.9) (↑ 19.0) | 46.0 +/−1.1 (↑ 18.4) |
Diversity-synt | 52.5 (↑ 16.0) | 41.2 (↑ 13.2) | 46.2 (↑ 14.4) | |
Extended-synt | 63.7 (↑ 21.5) | 46.5 (↑ 20.0) | 53.8 (↑ 21.3) |
Model . | Dataset . | Precision . | Recall . | F1 . |
---|---|---|---|---|
Seq2rel | Random-synt | 62.4 +/−(1.0) (↑ 19.2) | 26.8 +/−(2.0) (↑ 22.0) | 37.5 +/−(1.9) (↑ 28.9) |
Diversity-synt | 61.5 (↑ 21.9) | 30.7 (↑ 25.3) | 40.1 (↑ 30.6) | |
Extended-synt | 65.1 (↑ 17.8) | 29.9 (↑ 24.1) | 41.0 (↑ 30.6) | |
GPT-2 | Random-synt | 42.6 +/−(2.9) (↑ 10.1) | 32.7 +/−(2.8) (↑ 20.9) | 37.2 +/−(2.8) (↑ 22.2) |
Diversity-synt | 28.5 (↑ 6.2) | 39.4 (↑ 20.2) | 33.0 (↑ 12.4) | |
Extended-synt | 52.0 (↑ 7.2) | 44.6 (↑ 22.9) | 48.0 (↑ 18.7) | |
BioGPT | Random-synt | 56.4 +/−(2.3) (↑ 9.2) | 38.8 +/−(1.9) (↑ 19.0) | 46.0 +/−1.1 (↑ 18.4) |
Diversity-synt | 52.5 (↑ 16.0) | 41.2 (↑ 13.2) | 46.2 (↑ 14.4) | |
Extended-synt | 63.7 (↑ 21.5) | 46.5 (↑ 20.0) | 53.8 (↑ 21.3) |
Finally, two BioGPT-Large models were trained on the Diversity-synt and Extended-synt (see Table 3). The model trained on Diversity-synt achieved F1-score = 57.2, comparable to the new best model trained on the much larger merged set (F1-score = 59.0) and also demonstrated a better recall (56.90 against 51.6).
4 Discussion
The application of deep learning models for the completion of biomedical knowledge bases is largely limited by the availability and quality of domain-specific labeled data (Liang et al. 2022). Therefore, we adopted a data-centric methodology (Mazumder et al. 2023; Zha et al. 2023). In order to address the data imbalance and optimize the manual curation process, we proposed the GME-sampler inspired by diversity metrics commonly used in ecology. The sampler was applied on the pre-processed LOTUS dataset (separately on each biological kingdom) to extract a subset of documents, ensuring a diverse set of organisms and chemicals in the reported relations. The compositional analysis revealed a higher number of distinct entities in the extracted sample, but also a better balance considering the fixed number of items. Diversity has been recognized as an important factor in a training set for representation learning and improving the generalization performance of models (Gong, Zhong, and Hu 2019; Yu, Khadivi, and Xu 2022), essential for the NER sub-task. By forcing diversity into the relation partners (organisms and chemicals), we also expect it to be improved in their mentioning contexts. Considering the time and domain expertise requirements to annotate an evaluation dataset, the diversity metric was also used for partitioning. We extracted and manually annotated a representative subset by extracting the 200 top-diverse items. We hope that this manually curated evaluation dataset will help the community to build upon this work.
Despite a smaller number of trained parameters, BioGPT and GPT-2 fine-tuned with QLoRa clearly outperformed Seq2rel. This highlighted the benefit of the larger pre-training, but also the effectiveness of the QLoRA strategy, where low-rank updates of a large, but quantized, model achieve better performance than the full fine-tuning of a smaller model, for a lower parameter budget (Aghajanyan, Gupta, and Zettlemoyer 2021; Dettmers et al. 2022; Hu et al. 2022). While based on the same architecture, improvements of BioGPT over GPT-2 can be attributed both to the pre-training on PubMed and also to the dedicated tokenizer (see Appendix Figure C.2). Beyond the architectures of the models, the training dataset also had a significant impact on the performance. A comparison between models trained on the largest (Extended-raw) and the diversity-optimized dataset revealed that the latter achieved competitive results despite its smaller size. Additionally, results also suggest that improving the diversity of the provided set of examples for training can improve the recall of the model, at the expense of precision. Intuitively, we speculate that the extensive variety of distinct named entities present in the Diversity samples (see Figure 4.C) may benefit the NER sub-task learned by the models. However, an increase in the number of identified named entities and the complexity of the examples (with more entities comes more potential relations) could not be as beneficial for the learning of the second sub-task: RE. This could result in more sensitive models: higher recall but lower precision. Overall performance (measured by F1-score) is improved with the Diversity dataset on raw data and is equivalent or better for Seq2rel and BioGPT on synthetic data. Also, increasing the number of training examples with noisy data also have limited benefits, as suggested by the comparison with the (Full) training dataset extended to all available data (no sampling, no stratification) (Salhofer, Liu, and Kern 2022; Prusa, Khoshgoftaar, and Seliya 2015; Liang et al. 2022). Finally, few-shot learning techniques leveraging open LLMs exhibit reasonable performance (see LLaMA-65B) and can be particularly valuable when only limited or noisy data are available. However, their larger size may incur higher management costs, necessitating careful consideration of resource allocation.
Instead of using them to directly perform the task, we then propose to use them to generate synthetic examples and alleviate the noise of the dataset. However, evaluating the quality of the generated abstracts is challenging. Although the process is prone to hallucinations, the factuality is not the key criteria, as long as the generated texts are credible, meaning that they are coherent and adhere to the established syntax, style, and patterns of expression of the relations in human-written abstracts. Since the training sets of LLMs contain scientific articles and abstracts, they have absorbed their stylistic and syntactic specificities. The generation of synthetic data could then be seen as a form of knowledge distillation. Moreover, while previous studies have suggested that LLMs may not be knowledgeable (Cao et al. 2021; Si et al. 2023; Mallen et al. 2023), other investigations have highlighted the remarkable capabilities of chatbot and instructions-tuned models in following style instructions (Pu and Demberg 2023; Chia et al. 2024). Then, a first relevant evaluation criteria for these synthetic data is the improvement on the performance they provided. Additionally, we measured the textual similarities between synthetic data and original abstracts from the natural products’ literature with an n-gram overlap analysis in Appendix A.8. The impact of hallucinations (more precisely instruction inconsistencies) on synthetic data and performance of trained models is also evaluated.
All three models, with different architectures or pre-training data, demonstrated improvements across all metrics, on the 3 categories of datasets (Random, Diversity, Extended), highlighting the benefits of synthetic data in contexts of initial sparse labeled data. Also, transitioning from raw noisy data to synthetic data did not alter the previously observed trend: BioGPT outperformed other models, and the diversity-optimized sampling had a positive effect on the recall of trained models when used to select the seed articles. Most importantly, we noticed that the transition from original to synthetic data had a more determinant impact on the performance improvements than the choice of the model architecture (Seq2rel, GPT-2, BioGPT). For instance, the influence of synthetic data on the performance of BioGPT and GPT-2 is greater than the difference between the two fine-tuned models. The performance of Seq2rel was also enhanced almost by a factor of 4, notably narrowing the gap with GPT models. Similarly, scaling-up the architecture with Bio-Large (>4.5 × larger) indeed resulted in improved performance, but comparable to the previous enhancement obtained with synthetic data. Also, we noticed a clear impact of the training dataset, with the best observed recall achieved with Diversity-synt. These results also support the data-centric view, even in low-resource scenarios, by demonstrating improved performance over other strategies such as few-shot learning (Xu et al. 2022).
5 Limits and Future Work
Fine-tuned methods exhibit superior performance compared with zero-shot/few-shot approaches. However, the basic prompting approach used in the experiments may not fully demonstrate the capabilities of the models, and alternative strategies have been proposed (Zhao et al. 2021; Wu et al. 2023; Liu et al. 2022). Nevertheless, Jimenez Gutierrez et al. (2022) noted that even with these improvements, the models still lack the accuracy of fine-tuned approaches with qualitative data. The use of LLMs to generate abstracts also has some evident limitations. The generated abstracts exhibit a narrow range of styles to express the relationships between organisms and chemicals compared to human-written abstracts. Although we argued that strict data augmentation could not effectively bridge the initial gap between text and labels, token or sentence level augmentations on generated abstracts could, however, improve both the quantity and diversity of synthetic data (Chen et al. 2023). We suppose that the synthetic data mostly improved the recognition of organism and chemical entities, this sub-task being inherently embedded in the ultimate task of decoding the relationships. Following Kim et al. (2022), LLMs could also be used to generate alternative demonstrations for in-context learning. Nonetheless, such approaches need to be further evaluated in the specific context of the biomedical literature.
Although the proposed framework is effective, it cannot guarantee the true diversity of the generated abstracts and the final selection may be very similar. Secondly, the selector module does not ensure that the relations are semantically expressed in the generated abstracts, as it only checks for the explicit mention of the entities. Finally, all generated examples are designed as “positive” cases, meaning that a relation is always expected, which may not be the case in practical applications. The developed models are intended for use on a large corpus of articles and the input documents can be either selected by an upstream retriever component, or, the predictions can be re-evaluated by a downstream selector. Continuing with this data-centric view, future works will prioritize improving the three key components (instructions builder, generator, and selector) to improve the diversity of the synthetic abstracts, rather than focusing on the architecture of the trained models.
Given the highly dynamic nature of the LLM research area, we anticipate significant advancements in model architecture and accessibility to arise from the research community. At the date of writing, the release of LLaMA (and LLaMA2) has paved the way for the creation of more open-license models, such as the next-generation of Vicuna,11 Mixtral (Jiang et al. 2024), or PMC-LLaMA (Wu et al. 2024), and BioMistral (Labrak et al. 2024) trained on the biomedical literature. The development of multilingual open LLMs (Scao et al. 2022) also offers opportunities for synthetic data generation in promising areas, such as the extraction of plant-disease relationships from Traditional Chinese Medicine prescriptions, where the scarcity of labeled data is limiting (Li et al. 2022).
6 Conclusion
With the aim of assisting the completion of NP databases, we provide the first training and evaluation datasets along with the first trained models for end-to-end RE of relationships between organisms and chemicals. Along with these main results, we explored different strategies and proposed new developments to address the problematics raised in this biomedical context. We empirically showed the benefit of the proposed GME-sampler for building a diverse and balanced evaluation dataset as well as its positive impact on the recall via the training data. The results also indicate that the opportunities brought by the open LLMs in scenarios with little or weakly labeled data may not lie only in their zero/few-shot learning abilities, but also in their great potential as synthetic data generator. They could open the door for the extraction of previously unexplored relationships between biomedical entities expressed in the literature, a prerequisite to unlock new paths of inferences in knowledge discovery.
Appendix A Experimental Setup and Implementation Details
A.1 Few-shot In-context Learning Details
The prompt used for few-shot in-context learning with K = 5 archetypal input-completion examples with LLaMA (7B, 13B, 33B, 65B) is provided in Figure A.1. We used greedy decoding, setting the temperature to 0. Considering their particular fine-tuning, small adjustments were provided to the prompt for Alpaca-7B and Vicuna-13B. All models were also quantized for memory efficient inferences, and average inference times are presented in Table A.1. Considering our available resources, we were not able to use the q8 (8 bits) quantization for LLaMA models >13B and improvements in performance could then be expected. In parallel, we noticed significant performance degradations when using q4 (4 bits). We used llama.cpp12 for quantization and inferences.
Model . | Quantization type (size in GB) . | Average inference time in ms (± sd) . |
---|---|---|
LLaMA-7B | q8 (6.8 GB) | 38,672 (± 22,418) |
LLaMA-13B | q8 (13.2 GB) | 74,102 (± 6,955) |
LLaMA-33B | q5_K_M (21.9 GB) | 143,418 (± 72,740) |
LLaMA-65B | q5_K_M (44.1 GB) | 238,103 (± 122,970) |
Alpaca-7B | q8 (6.8 GB) | 32,293 (± 16,099) |
Vicuna-13B | q8 (13.2 GB) | 67,504 (± 32,642) |
Model . | Quantization type (size in GB) . | Average inference time in ms (± sd) . |
---|---|---|
LLaMA-7B | q8 (6.8 GB) | 38,672 (± 22,418) |
LLaMA-13B | q8 (13.2 GB) | 74,102 (± 6,955) |
LLaMA-33B | q5_K_M (21.9 GB) | 143,418 (± 72,740) |
LLaMA-65B | q5_K_M (44.1 GB) | 238,103 (± 122,970) |
Alpaca-7B | q8 (6.8 GB) | 32,293 (± 16,099) |
Vicuna-13B | q8 (13.2 GB) | 67,504 (± 32,642) |
A.2 Choice of the Models
We selected 3 models for evaluation: Seq2Rel,13 BioGPT14 (and its variant BioGPT-Large15), and GPT-2.16 Seq2rel was originally designed for end-to-end RE, and was later outperformed by BioGPT. With this minimal set of models, we aim to evaluate the performance of two distinct architectures: Seq2Rel (encoder-decoder) and BioGPT or GPT-2 (encoder-only). Note that BioGPT and GPT-2 share the same architecture. Additionally, we evaluate two pre-training settings: BioGPT on Pubmed articles17 and GPT-2 on a non-biomedical corpus. Furthermore, we explore two training approaches: full fine-tuning on Seq2rel and tuning via adapters with QLoRA on BioGPT and GPT-2. The number of trained parameters for each model is detailed in Table A.2.
. | Total parameters . | Trainable parameters . |
---|---|---|
Seq2rel | 118546185 | 118546185 (100%) |
BioGPT | 350649472 | 3886208 (1.11%) |
BioGPT-Large | 1582722536 | 11533736 (0.73%) |
GPT-2 Medium | 358381208 | 3555992 (0.99%) |
. | Total parameters . | Trainable parameters . |
---|---|---|
Seq2rel | 118546185 | 118546185 (100%) |
BioGPT | 350649472 | 3886208 (1.11%) |
BioGPT-Large | 1582722536 | 11533736 (0.73%) |
GPT-2 Medium | 358381208 | 3555992 (0.99%) |
A.3 Fine-tuning Details
Dettmers et al. (2023) demonstrated the efficacy of the QLoRA approach by showing that the loss in performance due to quantization can be fully recovered through subsequent fine-tuning of the adapters, and that increasing the number of adapters is crucial to match full fine-tuning performance. By exploiting the memory benefits of the NF4 data type, we applied LoRA adapters to all linear blocks (except the initial embeddings layer) of the BioGPT and GPT-2 models. Details on the number of trained parameters are presented in Table A.2. During training, the special tokens <BOS> and <EOS> are used to delimitate the input X and the expected linearized output Y, such as [X, <EOS><BOS>, Y, <EOS>]. The <BOS> token triggers the RE task at inference time.
For all evaluated datasets, models were then trained during 15 epochs (10 for BioGPT-Large) with 100 warm-up steps and the best epoch was selected using the validation set. We set learning-rate = 1e −4, LoRA-r = 8, LoRA-α = 16, batch size = 16. We used the available implementation of QLoRA with PEFT (Mangrulkar et al. 2022). We used the recommended 8 −bits paged AdamW optimizer18 (Dettmers et al. 2022).
For Seq2rel, we applied a standard full fine-tunings as in the original article. All the fine-tuning experiments were conducted on an NVIDIA GeForce RTX 3090. See details on hyperparameter tuning in Section A.4.
A.4 Hyperparameter Tuning
Hyperparameter settings, including learning-rate, batch size, and LoRA config, were evaluated on the Diversity-synt dataset with Optuna (Akiba et al. 2019). A summary of the hyperparameters tuned for BioGPT is presented in Table A.3. The F1-score on the validation set was used as evaluation criteria. In line with Giorgi, Bader, and Wang (2022), a greedy decoding approach was utilized during the hyperparameter tuning phase, followed by a fine-tuning of the decoding strategy on the configuration that yielded the best results. The experimental setup involved n = 140 trials, each consisting of 5 epochs, and was executed using the TPE (Tree-structured Parzen Estimator) sampler and a median pruner. The results of the hyperparameter optimization are presented in Table A.3.
. | Tuned ? . | Value . |
---|---|---|
Training | ||
Batch size | yes | 16 (12) |
Number of epochs | no | 15 |
LoRa r | yes | 8 |
LoRa alpha | yes | 16 |
Learning rate | yes | 1.00e-4 |
Weight decay | no | 0.01 |
Gradient accumulation steps | no* | 5 |
LoRA dropout | no | 0.05 |
LoRa target modules | no | q_proj, k_proj, v_proj, out_proj, fc1, fc2, output_projection |
Decoding | ||
strategy | yes | beam search |
beam size | yes | 3 |
stopping criteria | yes | never |
length penality | yes | 1.5 |
temperature | no | 0 |
. | Tuned ? . | Value . |
---|---|---|
Training | ||
Batch size | yes | 16 (12) |
Number of epochs | no | 15 |
LoRa r | yes | 8 |
LoRa alpha | yes | 16 |
Learning rate | yes | 1.00e-4 |
Weight decay | no | 0.01 |
Gradient accumulation steps | no* | 5 |
LoRA dropout | no | 0.05 |
LoRa target modules | no | q_proj, k_proj, v_proj, out_proj, fc1, fc2, output_projection |
Decoding | ||
strategy | yes | beam search |
beam size | yes | 3 |
stopping criteria | yes | never |
length penality | yes | 1.5 |
temperature | no | 0 |
The relationships between hyperparameters and performance are depicted in panel A of Figure A.2. While the batch size and the LoRA configuration don’t show strong impact on the final performance, the learning rate was identified as a critical parameter. With the TPE sampler, the learning-rate of the trials rapidly converged around 1e −4, resulting in stable performance across different batch sizes and LoRA configurations. The impact of the LoRA rank r is more precisely illustrated in panels B and C. As previously observed by Aghajanyan, Gupta, and Zettlemoyer (2021) and Hu et al. (2022), increasing the rank of the LoRA adapters from r = 8 to r = 16 resulted in only marginal improvements, considering the doubling of the number of trained parameters. The boxplots in panel C also highlight close performance with small variability on validation F1-score for trials with LoRA r = 8 and r = 16. After choosing the final training configuration (lr = 1e −4, r = 8, α = 16, batch size = 16), the decoding strategy was fine-tuned with 40 trials, evaluating greedy decoding and beam search with beam sizes of 3 or 5 (see panel D). Ultimately, beam search with beam size = 3 was selected. The best hyperparameter settings obtained for BioGPT were reused for GPT-2 as they share the same architecture, and later for BioGPT-Large.
Similarly to the panel A in Figure A.2, the same hyperparameter tuning experiments were conducted for Seq2rel (see Figure A.3). It consisted of 30 trials on 10 epochs on the Diversity-synt dataset. A summary of the hyperparameters tuned for Seq2rel is presented in Table A.4.
. | Tuned ? . | Value . |
---|---|---|
Training | // | |
decodr learning rate | yes | 9.00e-4 |
batch size | no | 4 |
number of epochs | yes | 20 |
gradient accumulation steps | no | 10 |
others | no | identifical to seq2rel’s CDR config |
Decoding | // | |
beam size | yes | 5 |
length penality | yes | 1 |
. | Tuned ? . | Value . |
---|---|---|
Training | // | |
decodr learning rate | yes | 9.00e-4 |
batch size | no | 4 |
number of epochs | yes | 20 |
gradient accumulation steps | no | 10 |
others | no | identifical to seq2rel’s CDR config |
Decoding | // | |
beam size | yes | 5 |
length penality | yes | 1 |
A.5 Evaluation Details
All evaluated models (in fine-tuning and few-shot settings) were evaluated for end-to-end RE, jointly performing NER and RE, framed as a generative task. The performance of the tested models were assessed by measuring the F1-score over the predicted relations extracted from the decoded outputs. An extracted relation is considered correct only if the head (an organism) and the tail (a chemical) entities exactly match the ground-truth labels.
A.6 Main Findings: Verbalization Patterns
To emulate different patterns of expression of the NP relationships, 5 transformations are applied: (1) chemical class replacement, (2) derivates contraction, (3) shuffling, (4) numbering, and (5) relation directionality. The findings-verbalizer module operates as a sampler, and each transformation has an assigned probability. In the conducted experiments, we used p1 = 0.2, p2 = 0.9, p3 = 1 (systematic shuffle), p4 = 0.25, and p5 = 0.9 for the corresponding transformations. The values were empirically estimated from observed behaviors in the literature. To enhance the diversity of the generated abstracts, the temperature parameter is also randomly sampled in the next generation step: t ∈{0.5,0.6,0.7,0.8}, as similarly evaluated by Chung, Kamar, and Amershi (2023). All other decoding parameters were set by default: top-K = 40, top-P = 0.95, and repeat-penalty = 1.1. Similarly to few-shot learning, we also used llama.cpp through the Python bindings library llama-cpp-python19 for inference in generating the synthetic abstracts. We monitored the generation time and observed that on average20 a synthetic abstract is produced in 35,708 (± 13,945) ms, showing a significant variability depending on the prompt (min ≈ 10s and max ≈ 2 min). All generation experiments were conducted on a NVIDIA GeForce RTX 3090.
A.7 Examples of LLM Prompting for Synthetic Abstract Generation
The following section provides archetypal examples to illustrate the diversity engendered by the synthetic abstract generation process. Recall that each generation is calibrated with an original title, a set of keyphrases derived from the original abstract, and verbalized main findings. In the latter, 5 main transformations can be applied to improve the diversity of the generation (see Method 2.3).
These transformations allow for the generation of multiple alternative synthetic abstracts, which emulate different syntaxes or styles for communicating the isolation of the same set of compounds (see Figure A.4). A serves as a reference for a standard instruction/generation. The example B introduces variations by reshuffling the order of the mentioned chemicals and then numbering them. In C, different subsets of compounds were substituted with their associated chemical families. In A and B, the expected output labels align with the verbalized main findings, e.g: “Lachnum papyraceum produces 6-Methoxymellein; Lachnum papyraceum produces 4-Chloro-6-methoxymellein, ...”. In C, they are substituted by the chemical classes: “Lachnum papyraceum produces Coumarins; etc..”
However, for multiple co-joined chemicals (Figure A.5), while the synthetic text mention “cytosporones J-N, pestalasins A-E”, the outputs are expected to be expanded like: “Cytosporone J, Cytosporone K, Cytosporone L, …, Pestalasin A, Pestalasin B, …, Pestalasin E”. Verbalized relations can also exhibit a N:M pattern, when multiple compounds are isolated from multiple organisms, showcasing the model creative generation abilities (Figure A.6).
The generation process is subject to certain limitations and can occasionally produce inaccuracies of a similar nature to those that were intended to be mitigated. In Figure A.7, while it is explicitly indicated in the instruction part that “Tagetes erecta produces two Flavonoids”, this information does not appear in the generated abstract. Additionally, the NPs isolated from Tagetes lucida are qualified as Flavonoids, which is a wrong assertion, i.e., a hallucination. The synthetic abstracts frequently exhibit instances of hallucinations, yet, these do not significantly impair their utility for the specific task of RE, as long as they do not pertain to the expression of the relationships (see Figure A.8).
A.8 Synthetic Abstracts: Empirical Analysis of N-gram Overlap and Impact of Hallucinations
N-gram-based metrics (e.g., BLEU-score) have been widely used to assess the quality of text generation in machine translation and for author-style classification (Papineni et al. 2001; Sidorov et al. 2014; Ríos-Toledo et al. 2022). Intuitively, n-grams capture the frequency of words, as well as lexical and syntactic properties of a text. We computed the proportion of overlap between the top-50, top-100, and top-500 most frequent word n-grams in the generated abstracts compared to three distinct reference sets: the original seed articles used for generation (vs. Originals), random articles sampled from LOTUS (vs. LOTUS), and random articles from PubMed (vs. Randoms). By comparing the n-gram overlaps, we aim to determine if the generated abstracts are more similar to those from the natural products literature (Originals and LOTUS) than to random biomedical abstracts from PubMed (Randoms). Panel A in Figure A.9 shows a similar proportion of n-gram overlap between synthetic abstracts and the Originals and LOTUS sets, also consistently better than with random articles. Such frequent and shared n-gram include for instance: “were isolated from”, “structures were elucidated”, “1D and 2D NMR”, “with IC50 values”, etc.
The impact of hallucinations on the quality of synthetic data is also important to consider. We suggested that factual hallucinations on contextual elements in the generated abstracts (e.g., Figure A.8) are less harmful for the quality of synthetic data, than hallucinations related to the expression of the relations in the main findings (e.g., Figure A.7). These hallucinations are classified as Instruction inconsistency, when the output of the LLM deviates from the user instructions (Huang et al. 2023).
To evaluate their impact on the performance of trained models, we constructed a new synthetic dataset, based on the same seed abstracts as Disversity-synt, but using a dedicated decoding strategy. We fixed the temperature at t = 2 for all generations and used a top-k = 500 sampling strategy. With these decoding parameters, we intended to lower the quality of the generations by stimulating the “creativity” and increasing the frequency of instruction inconsistencies (Huang et al. 2023; Holtzman et al. 2020).
Panel B in Figure A.9 shows the distribution of the score obtained with the selector module between Diversity-synt and the newly created dataset Diversity-synt-2. Recall that the selector measures the proportion q of the relations, from the expected output labels, that have both their head and tail entities explicitly mentioned in the generated abstract. The observed shift clearly suggests more frequent inconsistencies between instructions and generated texts in Diversity-synt-2, where at least one member of a relation stated in the instructions is more frequently omitted.
Finally, we re-trained the 3 models (Seq2rel, GPT-2, and BioGPT) on two new training datasets from the new generations (with promoted hallucinations), and conducted an ablation study on the selector module. The dataset Diversity-synt-2-selector uses the implemented selector module to select the top-k =3 generations per seed articles, while the selection was random for Diversity-synt-2-NO-selector. Seq2rel (encoder-decoder) performs robustly when trained on Diversity-synt-2-selector, while BioGPT and GPT-2 exhibit a more significant decrease in F1-score. However, all models show a decrease in performance when trained on Diversity-synt-2-NO-selector. Notably, Seq2rel and BioGPT models still perform better when trained on synthetic data with promoted hallucinations, than on the raw noisy data.
Together with these new results, our general observations suggest that generated abstracts exhibit typical lexical and syntactic features of the literature on natural products. The n-gram distribution of the synthetic data is more similar to the natural product literature than to random abstracts, and models trained on these data outperform models trained on the raw noisy data. The results on the newly generated datasets with promoted hallucinations show a decrease in performance across all trained models. This decrease, coupled with our analysis of the selector’s score distribution, suggests that hallucinations, especially those related to the expression of the main findings (instruction inconsistencies), negatively impact model performance. The selector module can then alleviate this issue by excluding these undesirable generations.
Appendix B Evaluation Dataset
B.1 Dataset Curation Protocol
Biocurator:
The dataset was curated by a single curator with a PhD in microbiology and prior experience in manual curation. A second annotator with a background in biology re-annotated the dataset to measure the inter-annotator agreement (IAA).
Article selection:
Articles were selected using the proposed GME-sampler, by extracting the top-200 literature references which maximize the diversity of named entities. All selected articles have a PMID, an available abstract, a title, and are available online on PubMed. No filter was applied based on the journal or the publication date.
Objective:
The curator targeted the relations between organisms (head) and their isolated natural products (tail) in the abstracts. Only organisms and chemicals that are involved in NP relationships are extracted. For example, organisms on which the activity of a compound is tested (e.g., a pathogen like Bacillus cereus) are not annotated. The available LOTUS annotations were always used as a starting point.
Annotation of chemical entities:
All chemical entities are categorized as either singular chemical (e.g., hispaglabridin A) or chemical classes (e.g., Isoflavanoids). The nature of these entities was cross-validated with the standard ChEBI Ontology when necessary. For singular chemicals, information about their chemical class is also extracted if it is mentioned in the article. Importantly, the label of the chemical entity is annotated as it is mentioned in the abstract. To align with the original LOTUS data, Wikidata and PubChem identifiers were assigned to chemicals and classes when available. In cases of ambiguity, the curator refers to the full-text (if available) to obtain more detailed information and assign the correct standardized entity. If the entity is not found in Wikidata, a dedicated identifier in the format “{pmid}CHEM{N}” is assigned instead, e.g., “11421752CHEM1”.
Annotation of organism entities:
Similarly to chemicals, the name of the organism is annotated exactly as it appears in the abstract. When only the genus is determined (e.g., Plakinastrella sp.), the genus name serves as the label.
Annotation of relations:
The output labels only include relations explicitly mentioned in the abstract, while relations mentioned in the full-text are excluded. The relations are annotated based on their order of appearance in the abstract. If there are more than one organism, the relations of the first organism are annotated first, followed by the relations of the other organisms in order of appearance.
Export:
The annotations are exported in a JSON-format as illustrated in Figure B.1 along with more statistics on the annotation.
B.2 Evaluation Dataset: Content Overview
An in-depth evaluation of the content of the curated dataset is provided in Figure B.2. The median number of relations, chemicals, and organisms, per curated abstracts are respectively 6, 5, and 1 (Panels A, B, C). Most of the studies included in the dataset focused on identifying natural products (up to max 22) from one specific organism. However, as illustrated in panels D and E, almost all chemicals and organisms only manifest once in the dataset, minimizing the overlap between the mentioned entities in each document. This is expected as a result of the diversity-sampling. Considering the applied stratification procedure, the distribution of the biological kingdoms (panel F) also shows a relatively balanced repartition.
The composition of the curated evaluation dataset, in terms of number of distinct entities and relationships, is also compared to 5 random sets of equivalent sizes in Table B.1. Firstly, 13 abstracts did not mention any relationships between organisms and chemicals in the curated dataset. Secondly, for the random sets, statistics were directly estimated from the LOTUS annotations. Then, they may represent an overestimate of the actual number of distinct entities, given that a manual curation could potentially eliminate some irrelevant annotations that are actually not mentioned in the abstracts. They should therefore be regarded as an approximate upper bound. Considering the last points, the proposed strategy for selecting the evaluation set has significantly improved the diversity.
. | # Organisms . | # Chemicals . | # Relations . | # References . |
---|---|---|---|---|
eval-set (top-200 diversity) | 275 | 1,197 (1,092 / 105) | 1,488 (1,297 / 191) | 200 (187*) |
Random (200 articles) | 238 | 610 | 699 | 200 |
. | # Organisms . | # Chemicals . | # Relations . | # References . |
---|---|---|---|---|
eval-set (top-200 diversity) | 275 | 1,197 (1,092 / 105) | 1,488 (1,297 / 191) | 200 (187*) |
Random (200 articles) | 238 | 610 | 699 | 200 |
B.3 Inter-annotator Agreement
To assess the quality of the annotations in the evaluation dataset, we computed the IAA for the extracted relationships, following the same method as in Li et al. (2016). We use the Jaccard Index to measure the IAA, considering the union of all the extracted relations with a second annotator, who followed the same guidelines. A disagreement between the annotators occurred when there was a mismatch in the label or type of the chemical (“chemical” or “class”), or, in the label of the organism.
The observed IAA score is 88.5%. Out of the 1,569 annotations provided by the two annotators, 179 were subject to disagreements. An analysis of the disagreements is provided in panel G of Figure B.2. For example, in PMID 16595963, the first annotator annotated the compound 4 as “GS-4”, while the second annotator used the later identification “(4R,4aS,9aR)-1,9a-dihydronidulalin A” (Chemical mention disagreement). In another example, in PMID 32193929, the ambiguous links between “oudemansins”, “oudemansinols”, “polyketides”, and “Favolaschia calocera” were the subject of a disagreement between the annotators on the status of the relationships (Relation disagreement). Overall, while the identification of organisms involved in a relation from the text was almost always in agreement, the main sources of disagreements concern the status of relationships and the identifications (the extracted labels) of the chemicals (or classes).
Appendix C Supplementary Materials
C.1 Chemical Length Thresholding
To determine a reasonable threshold for filtering chemical labels with excessive length, we conducted a comparative analysis of the distribution of label lengths in LOTUS (derived from Wikidata) versus their corresponding IUPAC names (See Figure C.1). While the respective median and mean values clearly suggest that most of the available chemicals are identified with common names (i.e., shorter), the long right-tail of labels exhibit a length comparable to IUPAC names. These longer labels are often too lengthy to be practical for use in training examples for the targeted RE task. By estimating the limit when 90% of the chemical labels in LOTUS are at least as long as their corresponding IUPAC name, we estimated that a threshold of 60 characters effectively filters out excessively long labels.
C.2 Mismatches Between Standardized Labels and Original Abstracts
The 7,901 available abstracts from the literature references in the Extended dataset were extracted using the NCBI E-utilities efectch service. All the organism labels available on Wikidata were directly matched on the abstracts. Using the PubChem exchange service, all the synonyms (direct synonyms of the molecule and synonyms of its stereoisomers) were extracted when a PubChem ID was available. In total, 653,749 synonyms were extracted. A chemical entity was considered as mentioned in the abstract when there is an exact match of its name or one of its synonym in the abstract. Some chemicals, however, may also only be implicitly mentioned in an abstract. Indeed, the isolation of multiple derivatives, such as Atroviridin A, B, and C, is typical reported as “Atroviridins A-C”. Then, Atroviridin B would not be explicitly mentioned and has to be infered. All chemicals which could be part of such expressions were identified using a set of regular expressions and were treated separately to not wrongly inflate the proportion of chemicals not mentioned in the abstracts. Nonetheless, it is worth mentioning that a non-negligeable part of these multiple chemical entities are simply not mentioned in the abstract, either explicitly or implicitly. For instance, see the original mentions of malyngamide A21 in PMID 10924193, 11076568, and 21341718.
C.3 Raw and Synthetic Datasets Overview
. | # Organisms . | # Chemicals . | # Relations . | # References . |
---|---|---|---|---|
Original dataset | 36,803 | 220,783 | 533,347 | 88,810 |
Pre-processed dataset | 14,890 | 56,310 | 102,528 | 32,616 |
. | # Organisms . | # Chemicals . | # Relations . | # References . |
---|---|---|---|---|
Original dataset | 36,803 | 220,783 | 533,347 | 88,810 |
Pre-processed dataset | 14,890 | 56,310 | 102,528 | 32,616 |
Kingdom . | N . | max HS(O) (rank) . | max HS(C) (rank) . |
---|---|---|---|
Archaeplastida | 19,491 | 8.73 (10,512) | 9.81 (10,713) |
Fungi | 5,023 | 7.18 (2,519) | 9.33 (5,023) |
Metazoa | 1,920 | 6.72 (1,304) | 8.33 (1,920) |
Not Attributed (Bacteria or Algae) | 6,666 | 6.96 (2,503) | 8.90 (6,666) |
Kingdom . | N . | max HS(O) (rank) . | max HS(C) (rank) . |
---|---|---|---|
Archaeplastida | 19,491 | 8.73 (10,512) | 9.81 (10,713) |
Fungi | 5,023 | 7.18 (2,519) | 9.33 (5,023) |
Metazoa | 1,920 | 6.72 (1,304) | 8.33 (1,920) |
Not Attributed (Bacteria or Algae) | 6,666 | 6.96 (2,503) | 8.90 (6,666) |
. | Organisms (% of Max Entropy) . | Chemicals (% of Max Entropy) . | ||||||
---|---|---|---|---|---|---|---|---|
Kingdom | n = 250 | n = 500 | n = 1,000 | n = 2,000 | n = 250 | n = 500 | n = 1,000 | n = 2,000 |
Archaeplastida | 75.5 | 80.5 | 86 | 91.7 | 76.9 | 83.7 | 89.6 | 94.3 |
Fungi | 80.7 | 89 | 96.6 | 99.8 | 83.7 | 88.1 | 92.3 | 95.1 |
Metazoa | 84.4 | 93 | 99.5 | 96.1 | 90 | 94.6 | 97.4 | 100 |
Not Attributed (Bacteria or Algae) | 82.3 | 90.7 | 96.7 | 99.9 | 85.1 | 89.6 | 93.7 | 97.2 |
. | Organisms (% of Max Entropy) . | Chemicals (% of Max Entropy) . | ||||||
---|---|---|---|---|---|---|---|---|
Kingdom | n = 250 | n = 500 | n = 1,000 | n = 2,000 | n = 250 | n = 500 | n = 1,000 | n = 2,000 |
Archaeplastida | 75.5 | 80.5 | 86 | 91.7 | 76.9 | 83.7 | 89.6 | 94.3 |
Fungi | 80.7 | 89 | 96.6 | 99.8 | 83.7 | 88.1 | 92.3 | 95.1 |
Metazoa | 84.4 | 93 | 99.5 | 96.1 | 90 | 94.6 | 97.4 | 100 |
Not Attributed (Bacteria or Algae) | 82.3 | 90.7 | 96.7 | 99.9 | 85.1 | 89.6 | 93.7 | 97.2 |
C.4 Evaluation of Keyword Extraction on the SemEVAL2017 Dataset
The SemEVAL2017 (Augenstein et al. 2017) evaluation dataset consists of 100 paragraphs, extracted from scientific publications in various domains, with on average 17.23 annotated keyphrases. While 3 sub-tasks are proposed in this challenge (classification and semantic relations), we only focused on the mention-level keyphrases identification. To consider similar settings as used for synthetic abstract generation, we evaluated the precision in the top-10 extracted keywords. The comparison is done by exact-match and results are presented in Table C.5. Vicuna-13B largely outperforms the KeyBERT (Grootendorst 2020) baseline and shows more than acceptable performance in zero-shot settings. KeyBERT was used with standard parameters: keyphrase_ngram_range: (1,2), stop_words: None, use_mmr: True, diversity: 0.7 and BERT model all-MiniLM-L6-v2 for base embeddings.
Dataset . | Part. . | # Relations . | # Organisms . | # Chemical entities . | # References . |
---|---|---|---|---|---|
(w. chem / w. class) . | (chem. / class.) . | ||||
Diversity-raw | train | 12,666 | 2,644 | 10,311 | 1,519* |
valid | 1,425 | 301 | 1,211 | 168* | |
Random-raw | train | 5,102 | 1,434 | 4,286 | 1,531* |
valid | 657 | 220 | 584 | 189* | |
Extended-raw | train | 27,952 | 5,642 | 21,028 | 7,111* |
valid | 3,355 | 932 | 2,741 | 790* | |
Full | train | 90,326 | 13,208 | 51,658 | 28,286 |
valid | 1,533 | 484 | 1,288 | 430 | |
Diversity-synt | train | 11,547 (10,764 / 783) | 2,154 | (9,108 / 61) | 3,562 |
valid | 1,197 (1,096 / 101) | 220 | (998 / 37) | 389 | |
Random-synt | train | 4,825 (4,474 / 351) | 1,267 | (3,854 / 53 ) | 3,798 |
valid | 609 (561 / 47) | 190 | (507 / 22) | 460 | |
Extended-synt | train | 28,614 (26,373 / 2,242) | 5,258 | (20,404 / 69) | 23,985 |
valid | 1,444 (1,332 / 112) | 432 | (1,122 / 37) | 1,254 |
Dataset . | Part. . | # Relations . | # Organisms . | # Chemical entities . | # References . |
---|---|---|---|---|---|
(w. chem / w. class) . | (chem. / class.) . | ||||
Diversity-raw | train | 12,666 | 2,644 | 10,311 | 1,519* |
valid | 1,425 | 301 | 1,211 | 168* | |
Random-raw | train | 5,102 | 1,434 | 4,286 | 1,531* |
valid | 657 | 220 | 584 | 189* | |
Extended-raw | train | 27,952 | 5,642 | 21,028 | 7,111* |
valid | 3,355 | 932 | 2,741 | 790* | |
Full | train | 90,326 | 13,208 | 51,658 | 28,286 |
valid | 1,533 | 484 | 1,288 | 430 | |
Diversity-synt | train | 11,547 (10,764 / 783) | 2,154 | (9,108 / 61) | 3,562 |
valid | 1,197 (1,096 / 101) | 220 | (998 / 37) | 389 | |
Random-synt | train | 4,825 (4,474 / 351) | 1,267 | (3,854 / 53 ) | 3,798 |
valid | 609 (561 / 47) | 190 | (507 / 22) | 460 | |
Extended-synt | train | 28,614 (26,373 / 2,242) | 5,258 | (20,404 / 69) | 23,985 |
valid | 1,444 (1,332 / 112) | 432 | (1,122 / 37) | 1,254 |
C.5 Tokenized Length of Abstracts
Acknowledgments
The authors are thankful to Vincent Mutel, Joël Dumoulin, Joel Rossier, and Colombine Verzat for their help during the project. We are grateful to Olena Hrynenko for proofreading the mathematical formulations. We are also grateful to the authors behind LOTUS, BioGPT, and Seq2rel for sharing their data or code.
Funding
This work was supported by the IDIAP Research Institute and has been done in collaboration with the company Inflamalps SA and is supported by the Ark Foundation. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 965397. The funding bodies played no role in the design of the study, research, writing, and publication of the article.
Notes
The chemical classes of a compound are determined according to NP-classifier (Kim et al. 2021) annotations in LOTUS.
Each random sample is composed of 500 random literature items sampled per kingdom.
However, we have chosen to focus only on abstracts and not on full texts because of their much greater availability and their synthetic forms.
With the exception of the instance trained on the Full training set.
Measured with F1-score, because it penalizes models with unbalanced performance between recall and precision.
version v1.3 from 22/06/2023: https://huggingface.co/lmsys/vicuna-13b-v1.3.
llama.cpp github repo: https://github.com/ggerganov/llama.cpp.
Link to Seq2rel GitHub: https://github.com/JohnGiorgi/seq2rel.
Link to BioGPT model card: https://huggingface.co/microsoft/biogpt.
Link to BioGPT-Large model card: https://huggingface.co/microsoft/BioGPT-Large.
Link to GPT-2 model card: https://huggingface.co/openai-community/gpt2-medium.
Also, the encoder used for Seq2rel is also PubMedBERT: https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext.
On the 15190 generations used for the Diversity-synt dataset.
References
Author notes
Action Editor: Byron Wallace