Abstract
This paper examines the comparative effectiveness of a specialized compiled language model and a general-purpose model such as OpenAI’s GPT-3.5 in detecting sustainable development goals (SDGs) within text data. It presents a critical review of large language models (LLMs), addressing challenges related to bias and sensitivity. The necessity of specialized training for precise, unbiased analysis is underlined. A case study using a company descriptions data set offers insight into the differences between the GPT-3.5 model and the specialized SDG detection model. While GPT-3.5 boasts broader coverage, it may identify SDGs with limited relevance to the companies’ activities. In contrast, the specialized model zeroes in on highly pertinent SDGs. The importance of thoughtful model selection is emphasized, taking into account task requirements, cost, complexity, and transparency. Despite the versatility of LLMs, the use of specialized models is suggested for tasks demanding precision and accuracy. The study concludes by encouraging further research to find a balance between the capabilities of LLMs and the need for domain-specific expertise and interpretability.
PEER REVIEW
1. INTRODUCTION
In the realm of artificial intelligence (AI), the rise of large language models (LLMs) such as OpenAI’s generative pretrained transformer (GPT) series has introduced unprecedented capabilities in text summarization and classification (Min, Ross et al., 2021; Yoo, Park et al., 2021). These AI juggernauts can dissect vast quantities of text, distill key points, and even classify documents with a level of speed and accuracy that leaves human ability far behind (Jiang, Hu et al., 2022). While we applaud these advances, it is imperative to keep a clear perspective on their inner workings, particularly their training data and decision-making procedures.
The advent of LLMs has undoubtedly revolutionized text analytics, but it has also introduced novel challenges concerning sensitivity and potential biases (Albrecht, Kitanidis, & Fetterman, 2022; Liang, Wu et al., 2021). Inherent in the training of these models is their susceptibility to embedding the biases present in the training data, a subtle yet pervasive issue that can later be extremely difficult to detect and rectify (Alvi, Zisserman, & Nellåker, 2019; Zhang & Verma, 2021). It is crucial, therefore, to scrutinize not only the LLMs themselves but also the mechanisms that train them. The broad and diverse nature of subjects that these models deal with, ranging from mundane queries to sensitive matters, necessitates a systematic and rigorous training approach.
Specialized language models that are trained meticulously, keeping the aforementioned factors in mind, can significantly reduce the risk of introducing biases and inaccuracies. Such models allow researchers to engage deeply with the training process, collecting appropriate data, performing diligent feature engineering, and fine-tuning the model’s sensitivity to ensure that it is capable of handling the complexities of real-world texts.
To demonstrate the value of this specialized approach, we turn our focus to a case study involving the Sustainable Development Goals (SDGs) initiative. Established through the United Nations’ 2030 Agenda in 2015, the SDGs provide a shared framework that guides stakeholders, from countries to corporations, in addressing pressing social, environmental, and economic challenges (Rosati & Faria, 2019; UN General Assembly, 2015; VNK, 2020).
The paper follows with a case study focused on the SDGs initiative, demonstrating the value of a specialized approach in SDG detection. Section 2 provides context for the study and highlights various interpretations and categorizations of SDGs. Section 4 outlines the development process of the specialized SDG detection model and the experimental designs for comparing the performance of the GPT-3.5 model and the specialized model. Section 5 presents the findings from the comparative analyses, discussing the overlap and limitations of each model. Section 6 reflects on the implications of the observed differences and provides insights into the trade-offs between general and specialized models. Finally, Section 7 summarizes the main findings and reflects on the learnings from these experiments.
2. BACKGROUND
Despite widespread adoption of the 2030 Agenda, the SDGs lack unanimous interpretation, and some argue that the goals are too vague (Sianes, Vega-Muñoz et al., 2022; Spangenberg, 2017). Scholars have developed competing categorizations and indicators to enhance understanding of SDG application, suggesting a diversity of interpretation even among experts (Diaz-Sarachaga, Jato-Espino, & Castro-Fresno, 2018; Hametner & Kostetckaia, 2020; Lehtonen, Sébastien, & Bauler, 2016; Tremblay, Fortier et al., 2020). Nonetheless, expert consensus has been successfully employed for the purpose of validating natural language processing (NLP) models built to automate SDG classification tasks (Guisiano, Chiky, & De Mello, 2022). This demonstrates the potential of NLP models to achieve classification performance on par with experts, with the added benefit of reduced subjectivity bias.
General and specialized language models exhibit distinct architectural and methodological variances. General language models, such as BERT, are trained on vast amounts of diverse text from various domains, enabling them to have broad language understanding capabilities (Quevedo, Cerny et al., 2023). In contrast, specialized language models are tailored to specific domains, such as legal documents or biology, where utilizing in-domain text for training can enhance their performance (Quevedo et al., 2023). Recent advances in NLP have led to the development of powerful LLMs capable of performing a wide range of tasks. However, despite their impressive capabilities, general-purpose LLMs often struggle with domain-specific tasks that require specialized knowledge (Brown, Mann et al., 2020; Shen, Tenenholtz et al., 2024; Singhal, Azizi et al., 2023; Touvron, Lavril et al., 2023; Vinod, Chen, & Das, 2023).
The challenges of adapting LLMs to specialized domains such as healthcare, as discussed in Shen et al. (2024) and Zhang, Zheng et al. (2023), suggest that simpler domain-specific models may be more practical and easier to develop in some cases, especially when large in-domain data sets are not available for LLM fine-tuning. It is important to note that not all applications require the adaptation of an LLM, which can be difficult to diagnose in terms of its performance. In critical cases where adequate training data is available, training a classifier using machine learning (ML) methods is preferable. While these models are robust in representing the behavior of the training data, they are lighter and easier to diagnose compared to LLMs. Concerns around the interpretability and potential biases of large black-box models such as LLMs could make simpler, more transparent ML models preferable in high-stakes domains such as healthcare diagnosis, where clinician oversight and autonomy are important (Abu-Jeyyab, Alrosan, & Alkhawaldeh, 2023; Fisch, Kliem et al., 2024; Huang, Zheng et al., 2023; Kavakiotis, Tsave et al., 2017; Kourou, Exarchos et al., 2015; Malek, Wang et al., 2023; Reese, Danis et al., 2023; Salvatore, Cerasa et al., 2014; Takahashi, Hana et al., 2023; Zack, Lehman et al., 2023).
Lehman, Hernandez et al. (2023) explore whether LLMs, trained primarily with general web text, are suitable for specialized domains such as clinical text. Their empirical analysis across various clinical tasks suggests that smaller specialized clinical models can substantially outperform larger LLMs in contexts that demand specific domain knowledge. Yang and Piantadosi (2017) demonstrate that a general learning setup, without specific constraints, can acquire key structures of natural language from limited data, suggesting that both general and specialized language models can benefit from incorporating learning mechanisms that mimic general cognitive processes in humans. Lin, Tan et al. (2023) investigate the trade-off between the specialty of fine-tuned foundation models and their generality, showing that pursuing specialty during fine-tuning can lead to a loss of generality in the model, related to catastrophic forgetting. This highlights the importance of balancing between specialized and general capabilities in language models. Tsipras, Santurkar et al. (2018) discuss the inherent tension between adversarial robustness and standard generalization in models, indicating that training robust models may lead to a reduction in standard accuracy, which has implications for both general and specialized language models. To summarize the key differences between general and specialized language models, Table 1 provides an overview of their characteristics, strengths, and limitations, highlighting the importance of considering the specific requirements of the target domain when choosing between adapting an LLM or developing a specialized model.
Aspect . | General language models (LLMs) . | Specialized language models . | References . |
---|---|---|---|
Training data | Trained on vast amounts of diverse text from various domains | Trained on domain-specific text | Quevedo et al., (2023) |
Language understanding | Broad language understanding capabilities | Enhanced performance in specific domains | Quevedo et al., (2023) |
Domain-specific tasks | Often struggle with tasks that require specialized knowledge | Tailored to handle domain-specific tasks effectively | Brown et al. (2020), Shen et al. (2024), Singhal et al. (2023), Touvron et al. (2023), Vinod et al. (2023) |
Adaptation to specialized domains | Challenging to adapt, especially when large in-domain data sets are not available for fine-tuning | Simpler and more practical to develop when adequate domain-specific training data is available | Shen et al. (2024), Zhang et al. (2023) |
Interpretability and bias | Concerns around interpretability and potential biases due to their black-box nature | Simpler and more transparent, making them preferable in high-stakes domains such as healthcare | Abu-Jeyyab et al. (2023), Fisch et al. (2024), Huang et al. (2023), Kavakiotis et al. (2017), Malek et al. (2023), Reese et al. (2023), Salvatore et al. (2014), Takahashi et al. (2023), Zack et al. (2023) |
Performance in specialized domains | Smaller specialized models can outperform larger LLMs in contexts that demand specific domain knowledge | Designed to excel in their target domains | Lehman et al. (2023) |
Learning mechanisms | Can benefit from incorporating learning mechanisms that mimic general cognitive processes in humans | Yang and Piantadosi (2017) | |
Specialty–generality trade-off | Pursuing specialty during fine-tuning can lead to a loss of generality (catastrophic forgetting) | Balancing specialization and generalization is important | Lin et al. (2023) |
Robustness–accuracy trade-off | Training robust models may lead to a reduction in standard accuracy | Tsipras et al. (2018) |
Aspect . | General language models (LLMs) . | Specialized language models . | References . |
---|---|---|---|
Training data | Trained on vast amounts of diverse text from various domains | Trained on domain-specific text | Quevedo et al., (2023) |
Language understanding | Broad language understanding capabilities | Enhanced performance in specific domains | Quevedo et al., (2023) |
Domain-specific tasks | Often struggle with tasks that require specialized knowledge | Tailored to handle domain-specific tasks effectively | Brown et al. (2020), Shen et al. (2024), Singhal et al. (2023), Touvron et al. (2023), Vinod et al. (2023) |
Adaptation to specialized domains | Challenging to adapt, especially when large in-domain data sets are not available for fine-tuning | Simpler and more practical to develop when adequate domain-specific training data is available | Shen et al. (2024), Zhang et al. (2023) |
Interpretability and bias | Concerns around interpretability and potential biases due to their black-box nature | Simpler and more transparent, making them preferable in high-stakes domains such as healthcare | Abu-Jeyyab et al. (2023), Fisch et al. (2024), Huang et al. (2023), Kavakiotis et al. (2017), Malek et al. (2023), Reese et al. (2023), Salvatore et al. (2014), Takahashi et al. (2023), Zack et al. (2023) |
Performance in specialized domains | Smaller specialized models can outperform larger LLMs in contexts that demand specific domain knowledge | Designed to excel in their target domains | Lehman et al. (2023) |
Learning mechanisms | Can benefit from incorporating learning mechanisms that mimic general cognitive processes in humans | Yang and Piantadosi (2017) | |
Specialty–generality trade-off | Pursuing specialty during fine-tuning can lead to a loss of generality (catastrophic forgetting) | Balancing specialization and generalization is important | Lin et al. (2023) |
Robustness–accuracy trade-off | Training robust models may lead to a reduction in standard accuracy | Tsipras et al. (2018) |
In our endeavor to build a language model capable of identifying and understanding the nuances of the SDGs in text, we meticulously compiled a data set and trained our model, keeping sensitivity and bias reduction as our primary targets. A detailed explanation of this process is given by Hajikhani and Suominen (2022). The culmination of this project was a rigorous experiment, deploying both our specialized SDG model and OpenAI’s latest offering, GPT-3.5, on a selection of company text data. Our aim was to assess the models’ sensitivity, comparing their detection and categorization of references to the SDGs in these texts. This formulates our primary comparative analysis. Given the divergent constructions and objectives of the specialized language model and GPT-3.5, we would expect a fuzzy overlap in the classification results, where detections exhibit similarity for a core set of textual inputs but diverge when dealing with less evident detections in which the models’ approach to nuance and context plays a larger role.
SDG detection presents a well-suited case for conducting this comparative study, as it reflects the complexities encountered in numerous real-world scenarios in which classifications are not mutually exclusive. Determining the relevant SDG(s) within a given piece of text can be intricate, requiring the consideration of nuance and contextual factors. Further, detections may differ between experts (or models) as these factors become more complex. In addition to the primary analysis, we engage two supplementary analyses to further our understanding of GPT-3.5’s artifact detection performance. The first supplementary analysis investigates the variation in GPT-3.5’s own categorization performance when given a prescribed description in comparison with a description produced from its own capabilities. The second observes a short exercise wherein GPT-3.5’s performance is evaluated using few-shot learning and a sample of observations taken from our specialized model’s labeled training data set. This exercise presents a view into the feasibility and suitability of few-shot learning as a mechanism for guiding the GPT-3.5 model towards more specified results.
In the following sections of this article, we delve into the details of this case study, offering insights into the challenges and successes encountered during the process. Through this analysis, we highlight the importance of an active role in training and developing specialized language models, showcasing how a thoughtful approach to AI development can lead to a more sensitive, unbiased, and accurate understanding of text, particularly in critical domains where the use of machine learning classifiers trained on adequate domain-specific data may be preferable to adapting large language models.
3. DATA
The data sample analyzed in the following analyses is used in conjunction with the ongoing INNOSDG project: Mapping Sustainable Development Activity; Its Evolution and Impact in Science, Technology, Innovation and Businesses. The project aims to operationalize big data approaches and create empirical tools to capture sustainable development activities resulting from Research and Development (R&D), public funding, or ecosystem collaboration.
The data sample is sourced from Finland’s innovation policy agency’s (Business Finland), public data bank (tietopankki). The sample is derived from a set of companies identified by Business Finland as young and ambitious. It totals 3,299 Finnish firms founded between 2009 and 2022. Prescribed company descriptions are sourced from three data providers: Vainu, CB Insights, and Pitchbook. Prescribed descriptions are available for 2,576 of the companies. Both the OpenAI GPT-3.5 model and the specialized SDG model were deployed on this set of company descriptions. From the GPT-3.5 deployment, a classification result was returned for all prescribed descriptions. From the specialized SDG model deployment, 187 prescribed descriptions did not pass the model’s text segment eligibility requirement. The final sample of companies eligible for the first comparison between the specialized SDG model and the GPT-3.5 model totals 2,389.
Note that an even distribution of SDG detection is not expected. Certain SDGs are likely to be more relevant to industry than others, and certain SDGs are more widely applicable across sectors while the relevance of others may be more sector specific. For example, SDG9 (Industry, Innovation, and Infrastructure) is expected to be detected at high rates, because this SDG is inherently relevant to young and ambitious firms likely engaging in innovation activity across sectors. On the other hand, one would expect relatively low detection of, for example, SDG3 (Good Health and Well-being) as it is unlikely to be heavily detected outside of health sector firms. Meanwhile, little to no detection would be expected for SDGs such as SDG16 (Peace, Justice, and Strong Institutions) and SDG17 (Partnership for the Goals) because these SDGs are more relevant to the activity of large, public institutions, and unlikely to be heavily represented in the business descriptions of young and ambitious firms.
For the second analysis focusing on GPT-3.5’s performance on prescribed descriptions versus GPT-3.5 description, firms founded after September 2021 were removed from the sample. OpenAI has noted that the GPT-3.5 model was trained on data through September 2021, so it has no basis for a GPT-based description of companies founded past this date. This results in a total sample of 2,550 for the second comparative analysis. The final analysis testing the use of few-shot learning in GPT-3.5’s categorization performance utilizes a small sample derived from the specialized model’s labeled training data set. This data is further described with the analysis, and a more detailed description can be found in Hajikhani and Suominen (2022).
4. METHODS
In this section, we delineate our methodological approach for evaluating the performance of GPT-3.5 against a specialized machine learning (ML) compiled model, focusing specifically on the detection of SDGs. We first elucidate the development process of the specialized SDG detection model, then elaborate on the strategies and experimental designs conceived to create benchmarking scenarios. One avenue of experimentation involves contrasting the specialized SDG detection model against GPT-3.5’s understanding of SDGs, which is invoked through multistage prompting. This is supplemented with an additional experiment that capitalizes on GPT-3.5’s knowledge of companies. In practical terms, this implies querying GPT-3.5 for a company’s description by providing the company’s name as an input.
The next experimental design leverages our initial training data, originally used for constructing the specialized ML model, to exploit GPT-3.5’s few-shot learning capability. The objective of this test was to feed labeled data into the GPT-3.5 and evaluate its performance in assigning labels to unseen text. Figure 1 provides a visual representation of our methodological pipeline, further illustrating our approach.
4.1. Specialized SDG Detecting Model
The custom model was designed to discern SDGs annotations within science, technology, and innovation literature. The first stage entailed creating a lexical query utilizing an SDG terminology database. This was achieved by first utilizing the taxonomy curated by Scopus SciVal and its effort for compiling SDG queries in “Identifying research supporting the United Nations Sustainable Development Goals.” Then, the process of curating the lexical keywords was complemented by involving the analyses of the UN Sustainable Development Goals documents. From a semantics perspective, each word or concept was expanded to lexically similar words. The extracted list of keywords was then matched with existing taxonomies from sources such as Elsevier (2015), Jia, Wei, and Li (2019), UNSDG (2019), and Vatananan-Thesenvitz, Schaller, and Shannon (2019). Using the curated keywords, queries were compiled for each SDG and searched in the SCOPUS publication database to obtain relevant scientific publications from 2015–2020. This led to a keyword search on scientific publications from 2015–2020, chosen due to its correlation with the 2030 SDG Agenda’s initiation. This process produced publications pertinent to SDGs.
An advanced taxonomy, incorporating extant taxonomies and UN SDG document analyses, was constructed to categorize SDG-relevant publications. Lexically similar words were identified for each term, followed by compilation and search of individual SDG queries within the SCOPUS database. Bibliometric data were subsequently extracted per SDG category. The selected publications’ titles and abstracts were utilized to train a model for automated detection of unseen SDG documents.
Machine learning (ML) classification algorithms were utilized to construct an SDG-detection model that is adept at identifying publications related to Sustainable Development Goals (SDGs). In preparing the training data set for the ML model, an even distribution of observations across all SDGs was meticulously ensured, thus preventing any particular SDG from being overrepresented. Python was used for data structuring and ML model creation. Several classification methods were compared, using 70% of the data for training and the remaining 30% for testing.
Preclassification text underwent preprocessing for consistency. Various text modeling strategies were evaluated, including Term Frequency Inverse Document Frequency (TF-IDF), Word2vec, and Doc2vec.
TF-IDF transformed text into a numerical representation, recognizing a term’s significance within a document relative to other documents. Word2vec employed a neural network to generate a vector space model with contextually similar words in proximity. The skip-gram variant of Word2vec was used for improved accuracy with rare words. Doc2vec, an extension of Word2vec, provides representation for multiple documents. This was accomplished using Python’s Gensim Word2vec feature, with the Google News corpus as a pretrained model. The process can be seen in Figure 2.
4.2. GPT-3.5’s (Logic and Knowledge) for SDG Detection
4.2.1. Experiment I
The first experiment compares SDG detection by the GPT-3.5 model with the Specialized SDG model using prescribed company descriptions sourced from Vainu, CB Insights, and Pitchbook as described. This is followed by Experiment II, which focuses on the GPT-3.5 model to explore its SDG detection when performed on a prescribed description versus GPT-3.5 description.
The GPT-3.5 model was executed on the sample in April 2023. To facilitate this process, the model was deployed within the Google Sheets environment, utilizing OpenAI’s integration with the platform. A connection was established with the OpenAI API using the required authentication key, and a given prompt was iterated over the provided set of descriptions. Each description, along with the prompt, was sent to the API for analysis, and the resulting responses were efficiently retrieved and stored in the same workbook as the original input. The limited length of the descriptions (median 73 words) meant that OpenAI’s token limits were not a barrier for this particular task.
For analysis of the prescribed company descriptions, the model was implemented in two rounds using the prompts and specifications shown in Table 2.
Prompt step . | Prompt . | Value . | Temperature . | Token limit . | Model . |
---|---|---|---|---|---|
1.1 | Does this text indicate direct contribution to any SDGs? If no SDG is directly relevant, just say NA. | Prescribed company descriptions | 0 | No max | GPT-3.5 Turbo |
1.2 | List the SDGs mentioned in this text before the word “however.” | Response to prompt 1.1 | 0 | No max | GPT-3.5 Turbo |
Prompt step . | Prompt . | Value . | Temperature . | Token limit . | Model . |
---|---|---|---|---|---|
1.1 | Does this text indicate direct contribution to any SDGs? If no SDG is directly relevant, just say NA. | Prescribed company descriptions | 0 | No max | GPT-3.5 Turbo |
1.2 | List the SDGs mentioned in this text before the word “however.” | Response to prompt 1.1 | 0 | No max | GPT-3.5 Turbo |
Both prompts were deployed using a temperature of 0 to minimize creativity, and no maximum token limit to allow for comprehensive response “reasoning.” GPT-3.5 Turbo, the default OpenAI model, was utilized for all analyses. The initial prompt (1.1) requires direct SDG contribution in the text and provides an exit if no relevant SDG is detected to mitigate irrelevant classification. With no token limit, the model was given the freedom to elaborate on its SDG detection, enabling us to manually validate and sanity check a sample of responses. It was observed that the model occasionally indicated positive SDG contributions followed by negative ones, consistently separated with the word “however.” To address this, a second prompt (1.2) was introduced as a simple cleaning mechanism, extracting positive SDGs mentioned only.
GPT-3.5 and the specialized model can both detect multiple SDGs from a single input. As a result, the models may simultaneously identify the same SDGs and different SDGs for a single company description. We use a nonrestrictive intersection to identify overlaps in SDG detection, considering any common SDGs between the model results as an overlap. For example, from the same text input one model may detect SDGs 7 and 9 while the other model detects only SDG7. Based on our chosen method, this is considered an overlap rather than a divergent result. This approach captures all overlaps, accounting for variations in sensitivity and lexical detection of each SDG by the models. The results can thus be considered an upper bound of common SDG detection by the two models.
4.2.2. Experiment II
In this second experiment, our methodology involved harnessing the language model’s capabilities to generate descriptions for a broad range of companies. This strategy provided us with an opportunity to evaluate the language model’s competency in portraying corporate activities, a process informed by the model’s comprehensive knowledge base derived from its training data. Subsequently, we deployed the same model to spotlight the SDG orientation within these company descriptions. This aspect of the methodology is based on the model’s inherent comprehension and logical categorization of SDGs, which can potentially illuminate the presence and extent of SDG-aligned activities within each company’s operational framework.
For implementation using the model’s GPT-3.5 description capabilities, only one prompt, specified in Table 3, was required. The response to this prompt relies on the model’s capacity to recall and analyze information it has been exposed to during its training. It is assumed that the information regarding the companies in the sample would have been accessible on the open internet (via company websites, articles, or other online media) and incorporated into the model’s training process. OpenAI has acknowledged that the model was trained on the open internet, though specific details regarding the training corpus have not been disclosed.
Prompt step . | Prompt . | Value . | Temperature . | Token limit . | Model . |
---|---|---|---|---|---|
1 | Give a comma-delimited list of any SDG(s) this company’s work contributes to. If no SDG is relevant just say NA. | Company names | 0 | No max | GPT-3.5 Turbo |
Prompt step . | Prompt . | Value . | Temperature . | Token limit . | Model . |
---|---|---|---|---|---|
1 | Give a comma-delimited list of any SDG(s) this company’s work contributes to. If no SDG is relevant just say NA. | Company names | 0 | No max | GPT-3.5 Turbo |
Similar to Experiment I, a nonrestrictive intersection is used to identify cases in which GPT-3.5 detects a common SDG between its deployment on prescribed company descriptions and its own GPT-3.5 derived information.
4.3. GPT-3.5 for Few-Shot Learning
The methodology of this experiment consisted of assessing the performance of the GPT-3.5 model under a few-shot learning scenario. The model utilized a labeled training data set derived from SCOPUS journal article abstracts, each labeled according to the corresponding SDG using Scopus SciVal’s taxonomy.
We derived a random sample of 200 observations from the original data set of 31,998 entries. For this exercise, we selected two specific SDGs—SDG2 and SDG7—to test and utilized the “GPT_Tag” function of the Google Sheets GPT-3.5 extension, providing 10 examples (five each of SDG2 and SDG7). The examples were selected randomly from the SDG2 and SDG7 stratifications of the original data set, excluding the sample of 200 already selected. The GPT-3.5 Tag function provided ideal deployment because it allows for user-defined classification tags and examples, guiding the model’s outputs. The specification of the model parameters and prompts is in Table 4.
Prompt step . | Prompt . | Value . | Temperature . | Token limit . | Model . |
---|---|---|---|---|---|
1 | 10 abstracts and their SDG labels were provided as examples (five SDG2 and five SDG7). “SDG2, SDG7” were specified as the list of tags. | Journal abstract (label unseen) | 0 | No max | GPT-3.5 Turbo |
Prompt step . | Prompt . | Value . | Temperature . | Token limit . | Model . |
---|---|---|---|---|---|
1 | 10 abstracts and their SDG labels were provided as examples (five SDG2 and five SDG7). “SDG2, SDG7” were specified as the list of tags. | Journal abstract (label unseen) | 0 | No max | GPT-3.5 Turbo |
We set the temperature to 0, minimizing creativity, and used the GPT-3.5 turbo model for this analysis. We did not enforce a token or a maximum number of tags limit. The model was then deployed against the sample of 200 abstracts.
5. RESULTS
This section first presents an analysis and comparison of the OpenAI GPT-3.5 model and the specialized model for SDG detection. A sample of company descriptions was used to examine the models’ effectiveness in identifying SDGs. The results were also analyzed in terms of the quantity and relevance of the SDGs detected by each model. Further, a comparative assessment of SDG detection using the GPT-3.5 model was carried out on prescribed descriptions and generated descriptions. Lastly, we conducted an experiment using the GPT-3.5 model for few-shot learning on a labeled data set to understand the model’s capacity to detect based on limited examples. Each section within this chapter delves into the results and interpretations of these experiments, providing detailed insights into the strengths and limitations of the GPT-3.5 and specialized models for SDG detection.
5.1. Experiment I: Comparison of GPT-3.5 and Specialized Model for SDG Detection
The performance and capabilities of OpenAI GPT-3.5 and the Specialized SDG model in detecting SDGs from prescribed company descriptions were analyzed for a sample of 2,389 companies. Descriptive statistics are provided in Table 5. Of the total sample, 62.45% (1,492) showed an overlap between the two models, regardless of whether they detected any SDGs. Among the cases where both models detected at least one SDG, the overlap was only 10.46% (250 observations), indicating that the majority of overlaps occur where both models detected no SDGs.
Statistic . | Value . | Percent of total companies . |
---|---|---|
Total companies | 2,389 | – |
Intersection: GPT-3.5 (prescribed description) vs specialized model (including companies with no detected SDGs) | 1,492 | 62.45 |
Companies with detected SDGs | ||
GPT-3.5 (prescribed description) | 1,019 | 42.65 |
Specialized model | 421 | 17.62 |
Intersection: GPT-3.5 (prescribed description) vs. specialized model | 250 | 10.46 |
Average number of SDGs detected per company | ||
GPT-3.5 (prescribed description) | 1.74 | – |
Specialized model | 1.12 | – |
Statistic . | Value . | Percent of total companies . |
---|---|---|
Total companies | 2,389 | – |
Intersection: GPT-3.5 (prescribed description) vs specialized model (including companies with no detected SDGs) | 1,492 | 62.45 |
Companies with detected SDGs | ||
GPT-3.5 (prescribed description) | 1,019 | 42.65 |
Specialized model | 421 | 17.62 |
Intersection: GPT-3.5 (prescribed description) vs. specialized model | 250 | 10.46 |
Average number of SDGs detected per company | ||
GPT-3.5 (prescribed description) | 1.74 | – |
Specialized model | 1.12 | – |
Figure 3 illustrates the SDG detection between the two models for companies where at least one SDG was detected by either model. OpenAI GPT-3.5 identified SDGs for 42.65% (1,019) of the companies (the blue circle on the left), while the specialized model detected SDGs for 17.62% (421) of the companies (the red circle on the right). Although the GPT-3.5 model had a higher detection rate, this does not necessarily imply better performance in identifying SDGs. The brevity and generality of company descriptions (median 73 words) limit the opportunity for SDG-related text to be present. Unless an SDG is highly relevant to a company’s activity, it is unlikely to be detected in such a concise statement. SDG detection by the specialized model indicates a high level of reliability, although with a conservative detection rate. In contrast, the GPT-3.5 model casts a wider net, detecting SDGs in more than twice the number of company descriptions. This suggests a more liberal interpretation of SDG classification by the GPT-3.5 model, which may also dilute the meaningfulness of detections.
Returning to Table 5, both models can detect multiple relevant SDGs for a given input text. On average, OpenAI GPT-3.5 identified 1.74 SDGs per description, while the specialized model detected 1.12 SDGs. This aligns with the GPT-3.5 model’s liberal identification approach—along with detecting SDGs in a broader range of descriptions, the GPT-3.5 model also identifies more SDGs per description.
Figure 4 shows each SDG’s detection rate, representing the percentage of descriptions in which the SDG is detected, between the two models. The rates differ significantly in magnitude, reflecting the higher detection rates by the GPT-3.5 model. However, the distributions are relatively similar and share two out of three top detected SDGs. The top three SDGs detected by the GPT-3.5 model in this sample are SDGs 9, 4, and 3, while the top three from the specialized SDG model are SDGs 9, 3, and 12.
Overall, the results suggest that the specialized model exhibits a more conservative and robust approach to detecting SDGs among the analyzed companies. With a lower percentage of companies (17.62%) and a lower average number of SDGs (1.12) detected per company, the specialized model focuses on more relevant and specific indicators of SDGs. On the other hand, OpenAI GPT-3.5 demonstrates a more liberal approach, identifying SDGs for a higher percentage of companies (42.65%) and detecting a higher average number of SDGs (1.74) per company. While this may indicate a broader coverage, it also suggests the possibility of including less relevant or tangential information related to SDGs.
When choosing between these models, it is important to consider the specific context and purpose of SDG detection. The specialized model prioritizes precision, focusing on highly relevant SDGs, resulting in a more reliable and focused assessment. On the other hand, OpenAI GPT-3.5 provides a broader analysis which could be useful for exploratory purposes as it captures a wider range of information. However, a caveat is that this may include both highly relevant and minimally relevant SDGs.
5.2. Experiment II: Comparison of GPT-3.5 SDG Detection on Prescribed Description vs. GPT-3.5 Description
Table 6 presents the comparison of OpenAI GPT-3.5’s SDG detection on prescribed description and GPT-3.5 generated description for a sample of 2,550 companies. Of the total sample, 81.10% (2,086) show an overlap between the prescribed description and GPT-3.5 description, regardless of whether an SDG was detected. 40.71% (1,038) of companies had SDGs identified through the prescribed description approach, while 48.27% (1,231) had SDGs detected through the GPT-3.5 description. Figure 6 shows that the distribution of SDGs detected between the two approaches is also fairly similar, though overall magnitudes differ.
Statistic . | Value . | Percent of total companies . |
---|---|---|
Total companies | 2,550 | – |
Intersection: prescribed description vs GPT-3.5 Description (including companies with no detected SDGs) | 2,086 | 81.10 |
Companies with detected SDGs | ||
Prescribed description | 1,038 | 40.71 |
GPT-3.5 description | 1,231 | 48.27 |
Intersection: prescribed description vs GPT-3.5 description | 890 | 34.90 |
Average number of SDGs detected per company | ||
Prescribed description | 1.73 | – |
GPT-3.5 description | 2.89 | – |
Statistic . | Value . | Percent of total companies . |
---|---|---|
Total companies | 2,550 | – |
Intersection: prescribed description vs GPT-3.5 Description (including companies with no detected SDGs) | 2,086 | 81.10 |
Companies with detected SDGs | ||
Prescribed description | 1,038 | 40.71 |
GPT-3.5 description | 1,231 | 48.27 |
Intersection: prescribed description vs GPT-3.5 description | 890 | 34.90 |
Average number of SDGs detected per company | ||
Prescribed description | 1.73 | – |
GPT-3.5 description | 2.89 | – |
A large number of overlaps occur where both approaches detected SDGs, with 890 companies having at least one SDG in common between the two approaches. This overlap, focusing on the companies for which SDGs were detected, is depicted in Figure 5.
The average number of SDGs detected per company is 1.73 for the prescribed description and 2.89 for the GPT-3.5 description. There are two possible explanations for this substantial difference. On one hand, the approach that uses GPT-3.5 description, which likely relies on a broader range of company information beyond the prescribed description, may result in a higher number of SDGs detected per company because it has access to a greater quantity of information about the company. This would indicate a potentially more comprehensive analysis. Alternatively, given a broader base of information to draw from for the company, the GPT-3.5 description approach may be detecting a greater number of SDGs with low or tangential relevance to the companies’ activity, due to the liberal tendency of the GPT-3.5 model’s SDG detection.
Figure 6 shows that the distribution of SDGs detected between the two approaches is fairly similar, though overall magnitudes differ.
Overall, the prescribed description and GPT-3.5 description approaches have a considerable intersection and perform similarly in terms of coverage. However, there is substantial difference in the average number of SDGs detected per company when using the GPT-3.5 description, which exhibits a much higher average number of SDGs per company compared to detection from the prescribed description.
5.3. GPT-3.5 for Few-Shot Learning Exercise
In addition to the GPT-3.5 deployments for this overlap analysis, a separate exercise was performed to investigate GPT-3.5’s performance under few-shot learning using data from the specialized model’s labeled training data set. This data set is comprised of SCOPUS journal article abstracts that have been labeled with the appropriate SDG to which the article relates via the taxonomy curated by Scopus SciVal. The data is described in greater detail by Hajikhani and Suominen (2022). For this exercise, a random sample of 200 observations was selected from the labeled training data set of 31,998 observations. The distribution of SDG labels across this sample of 200 is shown in Table 7 (N).
For this analysis, the Google Sheets GPT-3.5 extension’s integrated function “GPT_Tag” was utilized due to its suitable construction. The Tag function includes in its list of parameters a user-defined set of classification tags and a table of examples. This allows the user to pass a list of specified outputs, as well as a set of expected input-output pairs, to the model for consideration in its evaluation of the input value. Due to the limited number of tokens GPT-3.5 allows per API call and the average length of abstracts in the data sample (avg. 218 words), a maximum of 10 examples were permissible per call. This precluded the ability to provide the model with one example per SDG, much less multiple examples per SDG. As a result, only two SDGs were chosen for this test—SDG2 and SDG7. The list of tags specified was limited to “SDG2, SDG7” with the expectation that this would confine the model’s output to only tag these SDGs for which it was provided examples. Ten observations (five × SDG2 and five × SDG7) were randomly selected from the full training data (excluding the 200 observations already selected for the validation sample) to populate the table of examples. In the model’s deployment, a temperature of 0 was utilized to minimize creativity, no token limit was enforced, no maximum number of tags were enforced in the output (to allow for both SDG2 and SDG7 to be tagged simultaneously), and the GPT-3.5 turbo model was used. The analysis was run in June 2023. Table 7 displays the analysis results.
As described, a random sample of 200 observations selected from the Specialized model’s training set forms the validation data of this analysis. The distribution of SDG labels across this sample is given in column 2 and the expected model output in column 3. Because we limited the tags and examples to SDGs 2 and 7, we expect the model to give 0 tags for the abstracts associated with the remaining SDGs. However, the model does not behave in this manner. Despite the tag list not including the other 15 SDGs, the model identified these SDGs (except for SDG17) in its output anyway. This suggests that the model disregards the limitation and employs its knowledge of the SDGs not just on the tags specified, but further extrapolates to tag unlisted SDGs as well.
Total identification (correct or incorrect) shows extreme overidentification in the two SDGs for which examples were provided. However, these are not the only SDGs that experience overidentification, as SDGs 3, 13, and 14 are also overidentified. Meanwhile, SDGs 9 and 16 show the most underidentification. While 195 SDGs are identified overall, the model only identifies SDGs for 67.5% of the abstracts. In some cases, the model identified multiple SDGs per abstract, with an overall average of 1.44 SDGs identified per abstract, of those with an identified SDG. When accounting for abstracts in which we would expect the model to return no SDGs (i.e., abstracts labeled with SDGs other than 2 and 7), we observe that 45% of the model’s output is in line with expectation. Importantly, the model detects SDGs 2 and 7 in 100% of the abstracts with those SDG labels. This suggests a very high capture rate with few-shot learning. Note however that overidentification was also extremely high in these two SDGs.
Finally, despite the limitation to SDG2 and SDG7 in the tag list and table of examples provided to the model, it does identify the remaining SDGs, as well. However, these are largely misidentifications. Notably, by comparing the total identifications (columns 4 and 5) with the correct identifications (columns 8 and 9) of SDGs 9, 10, 11, and 16, we can see that the model did not identify the correct SDG in any of these cases. Other SDGs have a better rate of correct detection: for example, SDG6 (60%) and SDG14 (50%). Overall, 34% of the abstracts are identified with the correct label.
6. DISCUSSION
The comparison between a specialized SDG classification model and OpenAI GPT-3.5 elucidates notable differences in SDG detection performance. The specialized SDG model is specifically trained and tuned to detect relevant SDGs from the input text with precision and reliability, thus revealing a more robust but conservative approach. On the other hand, OpenAI GPT-3.5, as a general language model, is more liberal, identifying SDGs in a broader spectrum of company descriptions.
A significant point of deviation between these models is the apparent difference in specificity of SDG detection, as observed in the analysis. The OpenAI GPT-3.5 model, with its general training, leans towards detecting a higher number of SDGs per description and in a higher percentage of companies overall. This reflects the model’s ability to capture a range of information, but at the same time might lead to less meaningful detection due to the more liberal application of SDG relevance.
Meanwhile, the specialized SDG model, by nature of its focused training, detected fewer SDGs in company descriptions. This is an indication of the model’s conservative approach, limiting detections to SDGs that are highly relevant to the company’s activities. The specificity in the model’s detection suggests that the detected SDGs are likely to be more meaningful and significant in relation to the company’s activity. This aligns with the notion that expert consensus can be successfully employed for validating NLP models built to automate SDG classification tasks, as demonstrated by Guisiano et al. (2022). The specialized model’s performance, on par with experts but with reduced subjectivity bias, highlights the potential of such models in achieving reliable SDG classification.
When working with the OpenAI GPT-3.5 model, it is important to consider not only the model’s analysis capabilities but also the input on which it is exercised. GPT-3.5 has been trained on a vast amount of data, granting it an impressive ability to retrieve, summarize, and analyze information. However, specific details about the training data have not been publicly disclosed. This lack of transparency makes it challenging to determine the sources from which the GPT-3.5 description is based. This becomes even more of a black box when the information requested from the GPT-3.5 model does not have abundant or clear sources for spot check verification (such as details of a young startup). The GPT-3.5 description approach using the GPT-3.5 model yielded substantially different results in terms of detection rate than the prescribed description approach. This may be seen as wider detection ability, but it should also highlight the need for caution when relying on GPT-3.5 generated content for analyses requiring focused and robust detection.
The role of human expertise in interacting with automated SDG classification models cannot be overstated. Despite the advances in AI and NLP, the SDGs lack unanimous interpretation, and some argue that the goals are too vague (Sianes et al., 2022; Spangenberg, 2017). Scholars have developed competing categorizations and indicators to enhance understanding of SDG application, suggesting a diversity of interpretation even among experts (Diaz-Sarachaga et al., 2018; Hametner & Kostetckaia, 2020; Lehtonen et al., 2016; Tremblay et al., 2020). In this context, the involvement of SDG experts in validating and fine-tuning these models becomes crucial. Their domain knowledge and understanding of the nuances in SDG interpretation can guide the development of more accurate and reliable automated classification systems.
Our final experiment presents thought-provoking considerations for the use of GPT-3.5 with few-shot learning. A key takeaway of this experiment lies in the limitations it reveals for few-shot learning with GPT. There are considerable limits to the use of few-shot learning with GPT-3.5 due to the model’s constraints. The API used to access the GPT-3.5 model has a restricting token limit. Considering the average length of abstracts in our data set, this confined us to a maximum of 10 examples per interaction. This imposed a significant constraint on our ability to test the model with few-shot learning, as it prohibited the presentation of one example per SDG, let alone multiple instances per goal. As a result, it was not possible to evaluate the GPT-3.5 model’s capability to generalize from a handful of examples to the full array of SDGs in a controlled manner. Despite our adjustment limiting the experiment to two SDGs, the model heavily extrapolated and categorized 16 SDGs in the test, with varying degrees of accuracy. This suggests that restrictions placed on the model through available channels may not be reliably binding.
The use of few-shot learning with GPT-3.5 as an avenue for task adaptation or more specialized performance should be carefully considered. Our experiment demonstrates the limitations of this method for use cases involving long input or a high number of classification categories. Further, unexpected results may be difficult to interpret given the relative obscurity of the model’s procedures. While this method may be useful under certain conditions, it does not replace the abilities of a specialized language model with task-specific training. In critical domains such as healthcare, where clinician oversight and autonomy are important, simpler, more transparent ML models may be preferable due to concerns around the interpretability and potential biases of large black-box models such as LLMs (Fisch et al., 2024; Huang et al., 2023; Reese et al., 2023; Takahashi et al., 2023).
Our methodological contribution involves a comparative analysis between GPT-3.5 and the specialized SDG model, providing valuable insights into the strengths and weaknesses of both general and task-specific AI models. We not only assessed performance metrics but also scrutinized their usability, interpretability, and limitations. The involvement of human expertise in the development and validation of these models is paramount, as it can help address the challenges posed by the lack of unanimous interpretation of SDGs and the potential biases inherent in large language models.
Furthermore, the method of juxtaposing a general model’s performance with a specialized model, considering differences in training data and potential biases, adds another dimension to our understanding of AI capabilities. This study allows us to contemplate the potential of large language models beyond their initial training objectives, providing insights into how we can more effectively leverage their capabilities while being mindful of their limitations and the importance of human expertise in guiding their development and application.
Last, our analysis evaluating both false positives and false negatives presents a holistic picture of the models’ capabilities and highlights potential areas for future research. This could prove instrumental in the fine-tuning of these models or even in the development of hybrid models that blend the strengths of both general and specialized models, with the guidance of domain experts to ensure their reliability and relevance in real-world applications.
7. CONCLUSION
The observed deviation between the Specialized SDG model and OpenAI’s GPT-3.5 underscores the need for careful consideration in the application of these models. It accentuates the trade-off between the vast coverage of general models such as GPT-3.5 and the precision of specialized models. This disparity, resulting from differences in training data and model parameter tuning capabilities, warrants thoughtful contemplation. Importantly, despite some observed overlap in the results produced by LLMs such as GPT-3.5 and specialized models, their performance should not be considered interchangeable. The choice between a general or specialized model must be dictated by the specific requirements of a given task. Broad, catchall classification tasks may benefit from LLMs, whereas precision-focused tasks necessitate the use of specialized models.
While the progression of LLMs suggests the potential for more nuanced SDG detection in the future, it is contingent upon researchers engaging in more specific data training and further model parameter tuning. This approach could allow LLMs to match the precision of specialized models without compromising their broad coverage. However, this observation comes with a vital cautionary note. It is clear that LLMs, such as GPT-3.5, operate as black-box models, leaving us without a clear view of how they arrive at their conclusions. Consequently, their expansive, liberal application could lead to unpredicted and, in some cases, undesired outcomes. As such, a more reliable approach, especially when accuracy and transparency are of utmost importance, would be to utilize a specialized model tailored to the task at hand.
Our experiment extends beyond the specific comparison of GPT-3.5 and the specialized model. It offers a unique vantage point on contrasting a highly specialized machine learning model with an autonomous LLM. It invites scholars to consider carefully the trade-offs of using LLMs, including their cost, complexity, and opacity. It also underlines that for many applications, developing a specialized model tailored to the task at hand might be more straightforward, cost-efficient, and transparent. While LLMs are undoubtedly powerful and versatile, their use should not be considered a one-size-fits-all solution. Researchers and practitioners are encouraged to explore other alternatives, such as compiling and training a machine learning model on their own data. In this light, our study underlines that there is no universal answer, and the choice of the model should be dictated by the task, the data, and the specific requirements of each use case.
Moreover, our study underscores the importance of an active role for human expertise in training and developing specialized language models for SDG classification. By leveraging the knowledge and insights of SDG experts, we can work towards developing more sensitive, unbiased, and accurate AI models that can effectively support the understanding and implementation of the SDGs across various domains. The comparative analysis between the general GPT-3.5 model and the specialized SDG model highlights the trade-offs between broad applicability and domain-specific accuracy, emphasizing the need for a thoughtful approach to AI development that considers the specific requirements of the target domain and the role of human expertise in guiding the process.
AUTHOR CONTRIBUTIONS
Arash Hajikhani: Conceptualization, Data curation, Formal analysis, Funding acquisition, Methodology, Resources, Supervision, Visualization, Writing—original draft, Writing—review & editing. Carolyn Cole: Conceptualization, Data curation, Formal analysis, Methodology, Visualization, Writing—original draft, Writing—review & editing.
COMPETING INTERESTS
The authors have no competing interests.
FUNDING INFORMATION
This work is supported by Business Finland under the project “Mapping Sustainable Development Activity; Its Evolution and Impact in Science, Technology, Innovation and Businesses (INNOSDG)” and VTT Technical Research Centre of Finland project 132376. An earlier version of this paper was presented at the AII2021 workshop and was listed in arXiv as a preprint: https://doi.org/10.48550/arXiv.2307.15425.
DATA AVAILABILITY
The company descriptions data set used for the comparative analyses between the GPT-3.5 model and the specialized SDG detection model was sourced from Business Finland’s data bank. Additionally, we compiled other descriptions which include data from paid providers such as Vainu, CB Insights, and Pitchbook; due to the proprietary nature of these sources, the full data set cannot be made publicly available, but access can be requested from the corresponding author, subject to obtaining necessary permissions. The composition of the specialized machine learning model and details on the data used for its development can be accessed from Hajikhani and Suominen (2022).
REFERENCES
Author notes
Handling Editor: Vincent Larivière