Evaluating approaches to identifying research supporting the United Nations Sustainable Development Goals

The United Nations (UN) Sustainable Development Goals (SDGs) challenge the global community to build a world where no one is left behind. Recognizing that research plays a fundamental part in supporting these goals, attempts have been made to classify research publications according to their relevance in supporting each of the UN's SDGs. In this paper, we outline the methodology that we followed when mapping research articles to SDGs and which is adopted by Times Higher Education in their Social Impact rankings. We compare our solution with other existing queries and models mapping research papers to SDGs. We also discuss various aspects in which the methodology can be improved and generalized to other types of content apart from research articles. The results presented in this paper are the outcome of the SDG Research Mapping Initiative that was established as a partnership between the University of Southern Denmark, the Aurora European Universities Alliance (represented by Vrije Universiteit Amsterdam), the University of Auckland, and Elsevier to bring together broad expertise and share best practices on identifying research contributions to UN's Sustainable Development Goals.


Introduction
Numerous approaches to mapping research to the United Nations (UN) Sustainable Development Goals (SDGs)1 have been documented (Armitage et al., 2020;Bordignon, 2021b;Confraria et al., 2021;Jayabalasingham et al., 2019;LaFleur, 2019).These approaches vary with regard to the framework used to define inclusion and exclusion criteria, the methodology employed to retrieve publications, and the publication database used.For example, the approach to defining inclusion and exclusion criteria may be set conservatively to limit publications to those documenting actions made to achieve the SDG targets, or conversely, may be set using a more liberal approach, thereby including any papers that increase knowledge on the overall topic.With regard to the methodology employed to retrieve publications, publication sets for a specific SDG can use a Boolean approach only or be complemented by machine learning algorithms.The source of publications that the methodology is applied to can also introduce variability, given the availability of many data sources, ranging from open access, subscription-based, or a mixture of both.
To date, there is no broadly agreed-upon methodology for mapping research to the SDGs, and existing methods produce quite different results (Armitage et al., 2020).A common approach to identifying research related to a topic is to use Boolean search expressions.The Boolean method involves the use of keywords, either alone or in combination, using conditional functions and applied to specified text sections (title, abstracts, keywords, etc.) of scientific publications, and results in the exclusive retrieval of articles within which the defined search expressions were found.The authors of (Armitage et al., 2020) applied the Boolean method, taking an approach to limit their SDG publication sets to publications with a direct contribution to targets and/or indicators, with efforts made to reduce the impact of issues raised on the Boolean technique, resulting in a more restrictive publication set.Bordignon's strategy (Bordignon, 2021b) aimed at reducing the polysemy of terms by limiting keywords from Elsevier 2020 queries (Jayabalasingham et al., 2019) to relevant subject areas using the All Science Journal Classification (ASJC).A text-mining tool (CorTexT) was then used to enrich those selected publications.The Aurora European Universities Alliance (Schmidt & Vanderfeesten, 2021) developed and released their 169 target-level SDG queries (Vanderfeesten, Otten, & Spielberg, 2020) also using keyword combinations, boolean-and proximity operators.The University of Auckland (Jingwen & Weiwei, 2022) developed queries informed by the researchers within their network, resulting in a localized version that takes into account more papers that are specific to Australian and New Zealand research topics.In (Confraria et al., 2021), the authors employ a two-step approach involving building SDG-specific terms obtained from many sources (policy reports, publications, forums, etc), applying a selection process to the terms and then using the terms to identify citation-based communities of publications.However, as described in (Armitage et al., 2020), such a keyword-based-based approach involves challenges related to the interpretation of the themes and concepts of the SDGs, decisions around which publications to designate as a "contribution" to the chosen interpretation of the SDG, and the translation of concepts into a search query that will accurately identify publications.
An alternative or complementary approach to query-based methods involves using machine learning to map research articles to SDGs: either in a supervised manner, i.e., performing classification, or an unsupervised manner, i.e., performing clustering.Supervised methods typically resort to the same SDG queries to obtain a labeled dataset to train the model (South African SDG Hub, 2020;Zhang et al., 2020).Clustering is typically done with paper text representations or citation graphs where the resulting clusters are later mapped to SDGs either directly or via intermediate clusters, e.g., "topics" (Nakamura et al., 2019;Wastl et al., 2020).Refer to (Pukelis et al., 2020) for an overview of some more methods of classifying documents into SDGs.However, they all face the same challenges noted above, and machine learning further introduces the problem of interpretability of the model predictions or the clusters attained.
Since 2018, Elsevier has endeavored to map research to the SDGs, releasing publicly available queries to facilitate transparency and reproducibility (Jayabalasingham et al., 2019).Herein, we describe the approach taken to improve former attempts to map research to the SDGs, taking feedback into account, resulting in the creation of a more comprehensive query set with sub-queries addressing targets and indicators and the application of a machine learning model to increase recall.This methodology ("Elsevier 2021 SDG mapping" (Rivest et al., 2021)) captures, on average, twice as many articles as the 2020 version while keeping precision above 80%.Times Higher Education (THE) is using Elsevier SDG mapping as part of their Social Impact rankings (Ross, 2022)."Elsevier 2023 SDG mapping" (Bedard-Vallee et al., 2023) is the most up-to-date simplified version of the queries & ML model differing from the 2021 version in Covid-related enhancement to SDG 3 queries and queries designed for SDG 17 "Partnerships for the goals".
To evaluate the approach, the output generated using the developed methodology was compared to the results generated by Aurora European Universities Alliance (Vanderfeesten & Jaworek, 2022), the University of Auckland (Jingwen & Weiwei, 2022), the University of Bergen (Armitage et al., 2020), SIRIS Academic (Duran-Silva et al., 2019), Bordignon queries (Bordignon, 2021a), and the ML classifier by the South-African SDG Hub (South African SDG Hub, 2020).
We haven't seen much research aimed at doing similar benchmarking of different SDG mapping approaches with hand-labeled datasets.(Wulff et al., 2023) is the closest investigation to ours; apart from benchmarking, the authors explore the extent to which SDG queries produce false positives by marking non-SDG-related content with SDG labels.They also investigate the bias in SDG labeling systems defined as the normalized difference in the number of predicted and observed (i.e.put by human experts) SDG labels.
The novel contributions of this paper can be summarized as follows: • We solve the problem of recall assessment for keyword queries mapping research articles to Sustainable Development Goals, while other approaches typically focus on precision; • We are among the first to quantitively evaluate existing sets of such keyword queries against several validation datasets.

Developing SDG queries
The SDGs are goals to achieve rather than research topics, each SDG encompassing many targets.Using Boolean search expressions to build SDG-specific publication sets presents many challenges.Elsevier implemented a bottom-up approach to the construction of each SDG-relevant publication set, whereby several sub-queries were first constructed for each SDG target, and then aggregated at the SDG level.
1.1.1Building a query for each target within an SDG Criteria for delineating the publication sets relevant to each SDG were designed by a team consisting of a minimum of four analysts and were based on an extensive literature review done by the team to gain an understanding of the SDG.As a first step, the SDG was further subdivided into themes to facilitate the creation of specific criteria linked to specific SDG targets.The criteria defined for each theme aimed to specify topics of focus as well as any requirements for "action terms" in association with the topics (Armitage et al., 2020).For example, for the topic of "poverty", the action term "alleviate", or other action terms holding similar meaning might be deemed a requirement.To ensure homogeneity in the approach, criteria developed by the team of analysts were submitted to a review committee consisting of both those on the SDG team and those external to the team.The review committee was responsible for reviewing the criteria, recommending changes, and final approval of the criteria.Table 1 presents the criteria for SDG1 overall (SDG1-Main) and subcategories related to SDG1.These criteria were defined for each SDG and theme related to the SDG.
Following the establishment of criteria defining the research areas of focus relevant for each SDG (overall and per SDG-Theme), these criteria were used to guide the development of queries to retrieve publication sets.Where possible, the analyst responsible for query development was selected due to subject matter expertise in the field.Otherwise, the process was informed by a literature review.An iterative approach was taken to assess the precision with which individual keywords and sets of keywords identified publications that met the criteria.Keywords from the Elsevier 2020 (Jayabalasingham et al., 2019) and Aurora European Universities Alliance (Schmidt & Vanderfeesten, 2021) queries were assessed first.Additional keywords were identified using term-frequency and inverse-document frequency (TF-IDF) analyses of text from titles, abstracts, and author keywords from publications meeting the criteria.Additional efforts were taken to identify publications that may have been excluded based on the developed query.Specifically, (1) the query results were analyzed to identify specialized journals that would be expected to include a high percentage of publications that fit the criteria, and (2) the citation network of the publications retrieved using the query was assessed to identify publications within the citation network of the results (i.e., publications citing or cited by the publications retrieved by the query) that were not retrieved by the query.Publications from these specialized journals or the citation network that were not being retrieved by the query were assessed to identify additional keywords to include in the query to increase recall.Relevant exclusions were built into the queries to increase precision and could result in the exclusion of specific terms using Boolean operators or the exclusion of fields of science deemed to be outside the scope of the criteria.
To facilitate the continuous evaluation of the query, publications were manually reviewed to assess their fit against the criteria shown above (see Table 1 for SDG1).An evaluation of a minimum of 100 random publications by two independent analysts was done to support the calculation of precision metrics for each query, and a minimum precision threshold of 90% was required for a query to be considered acceptable.The recall was assessed against independent publication sets developed by an analyst consisting of publications from specialized journals identified to fit the criteria.As most specialized journals do not exclusively focus their content on a single SDG, a minimum recall of 60% was required for a query to be considered acceptable.In cases where no single journal was specific enough for all publications within that journal to fit the criteria set for an SDG or SDG-Theme, a publication set was constructed by manually selecting publications from a journal with high relevance to the SDG (or SDG-Theme), and recall was assessed against this set.Table 1: An example of SDG 1 subtopics and associated SDG targets.

Subset code
Criteria Associated target

SDG1-Main
Research focused on poverty and research as defined for any SDG1-subset below."Action term" specified: the action term, "alleviate" was applied to make the topic term "poverty" more specific.
Target 1.1: eradicate extreme poverty Target 1.2: reduce poverty by half All Targets associated with SDG1-Subsets

SDG1-Theme1
Research focused on social programs, including all articles discussing social security systems related to health, finance, and work.No "action terms" were required for the inclusion of the topics above.
Target 1.3: Implement nationally appropriate social protection systems

SDG1-Theme2
Research focused on microfinance, access to property, inheritance, natural resources, and new technologies as they relate to facilitating access, equality, and human rights."Action term" specified: the action term, "access to" was applied.
Target 1.4: equal rights to economic resources and basic services

SDG1-Theme3
Research focused on resilience, exposure, and vulnerability to disasters (financial, climaterelated, social..), particularly on understanding poor and vulnerable people and communities.
No "action terms" were required for the inclusion of the topics above.
Target 1.5: build the resilience of the poor

SDG1-Theme4
Research focused on financial aid, policies, government support (such as food banks and support distribution strategies), and strategies to eradicate poverty.No "action terms" were required for the inclusion of the topics above.
Target 1A: Ensure significant mobilization of resources from a variety of sources Target 1B: Create sound policy frameworks

Precision assessment
As described earlier, queries were composed gradually, starting from the seed queries developed first by analysts.These queries were developed by concatenating queries together with Boolean "OR" expressions after evaluating the keywords suggested by the TF-IDF analysis on the seed dataset.Before adding a new search expression to the global SDG dataset, analysts were encouraged to sample at least 10 documents to ensure high precision was maintained throughout most queries and not simply for the global SDG dataset.This is quite important; otherwise, some keywords bringing a small number of new publications but covering mostly content not relevant to the SDG could be included in the dataset, and while their impact on global precision would be relatively small, it would still mean that analyst would be forcing bad content with such terms.The sampling was performed directly in the exploration window, which could be used to quickly draw random samples of publications containing the selected keywords.This enabled analysts to vet new keywords quickly, which was necessary given the complexity of the queries needed to delineate the SDGs properly.As a target, a 90% precision level was required to commit the tentative search expression, as otherwise, a lower level would lead to diminished precision for the global dataset at the end of the iterative process.This was especially critical for keywords adding a lot of new documents to the global dataset as lower precision for these would more greatly influence the global precision.
Although precision was assessed throughout the whole process, a more formal precision estimate was performed at the end of the whole process to provide a final assessment, which would guide analysts as to whether they could stop their work or if an additional effort was needed to remove content that was deemed too broad and resulted in lowered precision.A large sample of 100 publications was pulled from the global dataset, and analysts performed a manual inspection of these, the tool enabling to tag publications as good, bad, and also in-between for cases where the analyst was unsure if documents should be included or not.This feature presented the advantage that final precision assessments were stored in the tool and could be consulted at any time in the future.This was especially helpful when additional validation steps were performed by the QA analyst, who was able to validate the precision assessment by assessing the same sample.If final precision was in the 90%-95% range or above, precision was deemed sufficient.
As a final step, a final QA was performed by an expert bibliometrician with more than a decade of experience in the field and in building datasets.Each query was analyzed by this expert and tested again for precision, reusing the samples pulled from each analyst but often pulling new samples as well to further solidify confidence.This additional layer of validation helped cement the process, ensuring a unified view of all SDGs in a similar way to what was accomplished when defining definitions as groups at the beginning of the process.The QA round led to multiple modifications, removals, and additions across most SDGs, often resulting in relatively minor changes in publication counts but further increasing the robustness of the alignment between the definitions and the final content retrieved by the queries.

Recall assessment
To determine the recall of the queries developed by analysts, a selection of specialized journals was identified for each SDG to serve as a stand-in for a gold standard, representing the subjects at hand.This pragmatic 'proxy' for recall measurement was developed in the absence of a true gold standard for testing the recall of the queries.The absence of a gold standard is unsurprising; should such a gold standard exist, it would imply that perfectly delineated document sets for SDGs would already exist, thus rendering the current exercise irrelevant.For each SDG, sets of highly relevant journals were identified using a combination of keyword searches in journal names and percentages of journal content covered by the keyword queries.This dual approach ensured that no relevant journals would be missed simply because their name was not declarative enough to be captured.After these journals were identified, analysts aimed to maximize recall across each of these journals while maintaining high precision.Recall levels of 60%-70% were set as the original minimal level for the current exercise based on two decades of expertise in building such datasets.Increasing recall for some categories without comprising precision is sometimes easy in subjects relying on highly declarative vocabulary, while it can become quite tricky in others, especially those mixing multiple dimensions as their core concepts.In the case of the targets of the SDGs, this notion is especially relevant, as SDGs often mix basic research with economic and social concepts.
During the process, recall against the selected gold standard of journals was tested frequently to determine if more investigation was needed to add new keywords to the queries.Analysts performed recurring analyses of the content of these journals not captured by the queries to detect any research subject not covered.TF-IDF analyses on these documents not retrieved were performed to obtain lists of suggested terms for inclusion to further increase recall.At the end of the process, if recall remained low, corrected recalls were computed by sampling amongst the publications not retrieved with the keyword queries, estimating which part was truly relevant to the subject at hand.Indeed, specialized journals, while usually having targeted scopes, are not always fully relevant to the topic at stake.By sampling about 50 publications, analysts were able to compute corrected recall scores by estimating the fraction of the content not covered that was indeed relevant to subjects.
As a final step, a final QA was again performed by the expert bibliometrician.Each query was analyzed by this expert and tested for recall, investigating whether areas of each target might have been missed or left out by the analyst.The QA round led to multiple modifications, removals, and additions across most SDGs, often resulting in relatively minor changes in publication counts but further increasing the robustness of the alignment between the definitions and the final content retrieved by the queries.
Below, we refer to the mentioned recall evaluation dataset as to the Elsevier recall dataset.

Machine learning applied to SDG classification
On top of the mapping produced by the queries described above, additional articles are mapped to the SDGs by a machine learning model.
In a nutshell, the model is a logistic regression trained with TF-IDF representations of titles, keywords, abstracts, and two more optional text fields -main terms extracted from the full text and subject areas of the journal that published the paper.Thus, the model learns similar keyphrases for each SDG and helps to improve the recall of the queries.To keep precision high, we keep only those papers that are classified by the model with 95% or higher predicted probability for some SDG.
In the "Elsevier 2021 SDG mapping" release (Rivest et al., 2021), the Elsevier team specifies the input data for the model, the targets that it's trained with, the technical details of the model itself, and model performance.Also, to ease the interpretation of the model classification outcomes, we share the SDG-specific key phrases learned by the model, as well as sample articles classified by the model.Please refer to the mentioned documentation for more details on the machine learning component of our approach.

Combining the queries and the model
The end-to-end approach to mapping scholarly records to SDGs is two-staged: • first, the keyword SDG queries are run (orange in Fig. 1) • then, the ML model adds about 3.5% of papers (blue in Fig. 1) on top of what is classified by the keyword queries.We only keep the most confident model predictions by thresholding predicted scores at 0.95.
It's worth noting that the approach is limited to the Scopus database as the queries are written in Scopus search syntax. 2 Results

Comparison between the SDG queries
Below, we describe the SDG queries and validation datasets that we used for the comparison in terms of precision, recall, and F1 scores.

Validations sets; collection method, sizes, and quality
Table 3 provides details on the validation datasets used in the comparison.It also mentions the associated limitations and biases.It is important to mention that there's no single best validation dataset to evaluate the output of SDG classification.

Performance; query models measured against validation sets
Table 4 provides the evaluation results for the SDG classification methods outlined in Table 2 and evaluation datasets described in Table 3.Each cell shows 2 values: micro-average F1-score and macro-average F1-score (the micro-average F1-score aggregates performance metrics across all classes by treating each instance equally, while the macro-average F1-score computes the F1-score for each class independently and then takes the average, giving equal weight to all classes regardless of their sizes), in percent (%).Both precision and recall were calculated with respect to the validation sets, i.e., all predictions beyond the validation sets were ignored: To gain a better understanding of our research contribution, the University of Auckland SDG Keywords Dictionary Project seeks to build on the processes developed by the United Nations and THE in order to create an expanded list of keywords that can be used to identify SDG-relevant research (Wang et al., 2023) link Aurora ML v0.2 (Aurora_ml) "AI for mapping multi-lingual academic papers to the United Nations' Sustainable Development Goals (SDGs)" (Vanderfeesten & Jaworek, 2022)  For 2023, the SDGs use the exact same search query and ML algorithm as the Elsevier 2022 SDG mappings, with only minor modifications to five SDGs, namely SDG 1, 4, 5, 7 and 14.In these cases, the queries were shortened by removing exclusion lists based on journal identifiers.These exclusion lists often contained thousands of items to filter out content in journals that were not core to the SDGs.The SIRIS queries were developed by extracting key terms from the UN official list of goals, targets, and indicators as well as from relevant literature around SDGs.The query system has subsequently been expanded with a pre-trained word2vec model and an algorithm that selects related words from Wikipedia.There are multiple queries per SDG (Duran-Silva et al., 2019) link Bordignon SDG queries (Bordignon) These queries aimed at reducing the polysemy of terms by limiting keywords from Elsevier 2020 queries (Jayabalasingham et al., 2019) to relevant subject areas using the All Science Journal Classification (ASJC) (Bordignon, 2021b) link • precision is calculated as the number of correctly predicted SDG IDs divided by the number of Scopus IDs tagged with the same SDG ID in the given validation set; • recall is calculated as the proportion of correctly predicted SDG IDs within the given validation set.
The same comparisons for precision and recall are found in the Appendix, see.Tables 9-12.
Note that micro-averaging favors well-represented, frequent classes (like SDG 3 in our case), while high macro-averaged scores mean that the method works fairly well across all SDGs because bad results for a single SDG affect macroaveraged metrics much more than micro-averaged ones.By attending to both micro-and macro-averaged F1 scores, we try to assess both aspects: how well the method is at classifying papers into frequent or rare classes."Survey data of "Mapping Research output to the SDGs" by Aurora European Universities Alliance (AUR)" 244 senior researchers from different universities in Europe and the US filled in a survey.They were only allowed to enter the survey if they were familiar with the SDG they had selected to evaluate.The first question was to provide a list of research papers they believe are relevant to that selected SDG.The second question was to handpick from a given set of 100 randomly drawn papers in the Aurora query result set, the papers they believe (based on reading the title, abstract, journal name, and authors) belong to the selected SDG.The suggested papers and the selected papers are included in the validation set.(Vanderfeesten, Spielberg, & Gunes, 2020) link 6741 Bias: the researchers are located at Western European universities.
Aurora Suggested Papers (Aurora2) The papers suggested by researchers, see "Survey data of "Mapping Research output to the SDGs"".link 3964 The researchers involved in the survey identified themselves as having expertise in a specific SDG.They might also have the incentive to cite their own research.Elsevier multilabel SDG dataset (Els_multilabel) The dataset consists of 6000 papers annotated by 3 experts each.These papers come from 5 data sources to span as diverse as possible set of SDG-related papers.30% of the papers are not mapped to any of SDGs  Note that micro-averaging favors well-represented, frequent classes (like SDG 3 in our case), while high macro-averaged scores mean that the method works fairly well across all SDGs because bad results for a single SDG affect macroaveraged metrics much more than micro-averaged ones.By attending to both micro-and macro-averaged F1 scores, we try to assess both aspects: how well the method is at classifying papers into frequent or rare classes.
We conclude that there is no single best approach performing well across all validation datasets: some approaches are, on average, better at precision (e.g., Elsevier 2020 and South African SDG ML model, see Tables 9 and 10), others shine at recall (e.g., Auckland queries and Aurora ML model, see Tables 11 and 12).This finding supports the general criticism that SDG classification faces: different mapping methods typically kick off with the same keywords but then result in poorly overlapping mappings (Armitage et al., 2020;Purnell, 2022).Apart from these "qualitative" problems with SDG mappings, we now establish the "quantitative" problem: when evaluated against several hand-labeled SDG datasets, different approaches fail to select a clear winner.
We notice a clear "overfitting" phenomenon: Elsevier queries+ML 2022 are best when validated against Elsevier's multi-labeled dataset, while Aurora queries v.5/Aurora ML model achieve the highest F1 scores against the Aurora survey dataset.A probable explanation is that the datasets were crafted for a specific definition/operationalization of SDGs, and these definitions are undoubtedly different from one project to another.
It is important to conclude that there is no single "golden" SDG validation dataset; each one considered in our experiments comes with its own shortcomings (see Table 3, remarks on quality), and each dataset used in query development reflects some certain interpretation of SDGs by the query developers.Similarly, to how (Armitage et al., 2020) concluded that there's a poor overlap in publications found by different sets of queries, we conclude that there's

Tracking the progress of Elsevier queries
The progress with SDG query development at Elsevier was tracked both in terms of recall as described in Section 1.1.3and in terms of precision/recall/F1 when validated with the independently labeled Elsevier multi-label dataset.
Table 6 shows recall scores for different Elsevier queries as measured against the Elsevier recall dataset described in detail in Section 1. Table 7 shows precision, recall, and F1 scores for different Elsevier queries as measured against the Elsevier multi-label SDG dataset described in Table 3.Note that due to the specifics of the SDG query creation methodology, it makes sense to report only recall for the first dataset.The reason is that it's labeled in a noisy way (the assumption that all papers from an SDG-specific journal contribute to the same Goal is far from perfect); thus, looking at precision (and hence F1) is not meaningful.However, reporting recall makes perfect sense -it shows how many SDG-related papers from this large dataset the queries can detect.
From Tables 6 and 7, we see that all of Elsevier 20211-2023 queries perform about the same in terms of metrics and provide a considerable improvement in recall (and hence F1) over the earlier 2020 version of the queries.
The metrics are pretty close for the 2021-2023 versions of the queries because the 2022 and 2023 updates were not as considerable as the ones in 2021.Namely, the 2022 version (Roberge et al. 2022) introduced only COVID-related changes to SDG 3. The 2023 version of the queries (Bedard-Vallee et al. 2023) introduced changes to SDGs 1, 4, 5, and 14, removing long lists of journal identifiers and replacing them with keywords.

Discussion and Perspectives
In previous sections, we described the methodology and evaluation results.Below, we outline possible improvements to the SDG mapping approach, including localization of the SDG queries, query generalization to non-English languages, and extending the approach to non-article content.

Localization
Research activities do not stand alone; they are an integral part of the geographical place they were initiated and the communities they serve.An attempt to measure SDG-related research activities can be improved by infusing the local context within which the research activities take place.A localization approach can further foster understanding of, for example, the degree to which the prevailing SDG mapping approaches capture SDG research in the geographical region that may or may not have been described by keywords and keyphrases with close semantic relatedness to the keyword-based queries, e.g., Elsevier 2020 queries.Keywords with a high rank were then evaluated in more detail and manually reviewed and improved for SDG alignments.
Table 8 shows the number of University of Auckland publications between 2009 and 2020 captured by the University's queries compared with those captured by Elsevier 2020 queries.
For 13 out of the 16 SDGs documented in Table 8, the Auckland queries capture more SDG-related publications.In some cases, the number of publications captured by the Auckland queries doubled that captured by the Elsevier 2021 approach.A significant proportion of the additional publications are captured through localized keywords and search terms.For example, "Te Whariki"¯-the New Zealand national curriculum document for early childhood educationwas used as an SDG4 keyphrase under the Auckland approach as it pinpoints what makes a quality early childhood education curriculum with an indigenous Māori lens.It retrieved 19 SDG4 papers published by the University of Auckland, of which only 6 papers were counted by the Elsevier 2020 approach.A manual inspection of these 19 'Te Whariki' papers unsurprisingly suggests the high relevance of all 19 papers to SDG4 Target 4.2 on ensuring quality early childhood development, care, and pre-primary education.In some other cases, the Auckland queries also gave rise to additional keywords potentially fitting for the global settings.For example, "marine biodiversity" as an Auckland keyphrase retrieved 24 SDG14 papers published by the University of Auckland, of which 19 papers were counted by the Elsevier 2020 approach.
Figure 2: F1 scores for the Auckland approach applied to the Aurora, Elsevier, Chilean, and OSDG datasets.
This suggests that, while the localized approach adds useful keywords and themes in some context, further work is required to examine each keyword and keyphrase independently to understand their impact on precision and recall and to refine the search conditions upon which they should be applied.In future work, it would also be interesting to develop a contextualized SDG-label set that aligns with the contextualized SDG mapping approach (e.g., an Auckland SDG validation set) to better test out the performance of the contextualized approach against more generic, global approaches.

Multilingual queries
In CRISs systems3 and repositories, there are many more publications that are not included in Scopus and are written in the local language of the country to serve a different audience.We found out we could not simply replace the keywords in the queries and have the search work the same in other languages because of the syntax and morphology rules.That's why Aurora chose to train mBERT models to classify SDGs.Due to the lack of non-English SDG labeled data, we used only English training data, specifically paper abstracts.
During the evaluation, the models for SDGs 1 to 5 and 11 were applied to classify 888 German paper titles.To have a qualitative benchmark, we performed a manual SDG classification only on titles as well.In doing the latter, we tended to take a strict approach and tried to stick very close to the respective SDG indicators (e.g., non-assignment of SDG4 to publications on teacher training in Germany, as the SDG indicators only refer to teacher training in the Global South).
The manual classification resulted in 43 SDG-related publications, whilst the ML models resulted in 58 publications SDG-related publications.The total overlap between these two methods was 8 publications.It was mainly for SDG3 -Good Health and Well-being (5/8) and can most likely be explained by great similarities in terminology between English and German for issues such as multiple sclerosis, psychotherapy, suicide, alcohol, and illegal drugs (in German: Multiple Sklerose, Psychotherapie, Suizid, Alkohol, illegale Drogen).
At the current phase of evaluation, the multilingual ability of the ML models for research output in German cannot be positively assessed.However, further analysis, including the abstracts of publications for the classification of the ML models, may offer improvements in classification quality.

Generalization to other types of content
In addition to SDG-related research outputs, higher education institutions have a strong interest in understanding SDG-related educational activities, as done in the Aurora SDG Course Catalogue.4These SDG labels have been added manually by the course coordinators, but such a process is labor-intensive and not sustainable since this needs to be done year after year.
Similar to publication metadata (e.g., title, abstract, keywords), many course catalogs and curriculum management systems capture metadata in a similar way (e.g., title, course short description, course long description).Whether the SDG research mapping techniques can be translated and applied to SDG course mapping represents an interesting topic to many.
A study was conducted by the University of Auckland to apply the Auckland queries to classify courses taught by the university.The mapping results identified 792 SDG courses out of 2441 courses in total offered to students in the academic year 2020.Compared with the frequency and distribution of keywords in research mapping, course mapping demonstrated a higher concentration of keywords used to convey the SDG topics.For example, the 24 University of Auckland courses related to SDG14 are fully captured by the top ten keywords in the Auckland queries by frequency (i.e., marine; fisheries; coastal management; pollut*; aquaculture; marine environment; fisheries management; eutrophical*; aquatic ecosystem; alga*).

Conclusion
In this paper, we outlined the methodology behind research mapping to the United Nations (UN) Sustainable Development Goals (SDGs), how it compares to other existing methods, and how well it performs with existing SDG validation datasets.We conclude that there is no single best approach performing well across all validation datasets, although Elsevier queries are slightly more stable.We also conclude that there's no single "golden" SDG validation dataset; each one considered in our experiments comes with its own shortcomings, and each dataset used in query development bears the intrinsic bias of the SDG interpretations by the query developers.We observed that Elsevier's queries have seen a measurable improvement from the original 2020 version to the 2021/2022/2023 versions.Finally, we discussed possible improvements to the existing approach: localization of the queries and generalization to other languages and data types.

Appendix
Table 9: Precision scores with micro/macro averaging (percentages, %) for 10 classification methods and 5 validation datasets.Bolded is the best result in the column, asterisk for multiple "winners" depending on micro-or macroaveraging.

Figure 1 :
Figure 1: Distribution of the number of papers mapped by the queries (SM, orange) and by the model (ML, blue), by SDG (ignoring SDG 17).
created queries for Web of Science to retrieve SDG-related publications for a limited number of SDGs.The queries have been translated for Scopus and a sample of the results has been taken as positive examples.These have been supplemented by other publications which did not appear in the queries as negative examples.Two datasets were created, one based on the Action Approach queries and one based on the Topic Approach queries -referred to as Bergen TAA and Bergen TBA respectively(Armitage et al., 2020) link Elsevier queries 2020 (Els_2020) "Identifying research supporting the United Nations Sustainable Development Goals"(Jayabalasingham et al., 2019) link Elsevier queries+ML 2021 (Els_2021) "Improving the Scopus and Aurora queries to identify research that supports the United Nations Sustainable Development Goals (SDGs) 2021"(Rivest et al., 2021

Table 2 :
SDG classification methods (both keyword queries and ML models) used in the evaluation.

Table 3 :
SDG validation datasets used in the evaluation.

Table 4 :
F1 scores with micro/macro averaging (percentages, %) for 10 classification methods and 5 validation datasets.Bolded is the best result in the column, asterisk for multiple "winners" depending on micro-or macro-averaging.

Table 6 :
Elsevier queries validated against the Elsevier recall dataset (see Section 1).Micro-and macro-averaged values for recall are reported.

Table 7 :
Elsevier queries validated against the Elsevier multi-label SDG dataset (See Table3).P stands for precision, R -for recall, F1 -for F1-score.Micro-and macro-averaged values are reported.

Table 8 :
Wang et al., 2023)land v2 queries and Elsevier 2020 queries.Wang et al., 2023)is one such localization attempt based on Elsevier's earlier 2020 queries, a mixture of the UN official targets and indicators, and the suggested search terms by the Sustainable Development Solutions Network (SDSN).The n-gram model was applied to two samples of Scopus publication metadata, i.e., a global publication sample and a University of Auckland publication sample.The n-gram tokens were scored by a range of factors, including counts and measures of frequency, and were then ranked by those scores.

Table 11 :
Recall scores with micro/macro averaging (percentages, %) for 10 classification methods and 5 validation datasets.Bolded is the best result in the column, asterisk for multiple "winners" depending on micro-or macroaveraging.