OpenCitations Meta

OpenCitations Meta is a new database that contains bibliographic metadata of scholarly publications involved in citations indexed by the OpenCitations infrastructure. It adheres to Open Science principles and provides data under a CC0 license for maximum reuse. The data can be accessed through a SPARQL endpoint, REST APIs, and dumps. OpenCitations Meta serves three important purposes. Firstly, it enables disambiguation of citations between publications described using different identifiers from various sources. For example, it can link publications identified by DOIs in Crossref and PMIDs in PubMed. Secondly, it assigns new globally persistent identifiers (PIDs), known as OpenCitations Meta Identifiers (OMIDs), to bibliographic resources without existing external persistent identifiers like DOIs. Lastly, by hosting the bibliographic metadata internally, OpenCitations Meta improves the speed of metadata retrieval for citing and cited documents. The database is populated through automated data curation, including deduplication, error correction, and metadata enrichment. The data is stored in RDF format following the OpenCitations Data Model, and changes and provenance information are tracked. OpenCitations Meta and its production. OpenCitations Meta currently incorporates data from Crossref, DataCite, and the NIH Open Citation Collection. In terms of semantic publishing datasets, it is currently the first in data volume.

While the coverage of the OpenCitations Indexes has approached parity with that of commercial proprietary citation indexes (see https://opencitations.hypotheses.org/1420), there have been outstanding issues not formerly addressed by OpenCitations.
First is citation disambiguation.Sometimes, bibliographic resources will have been assigned multiple identifiers, such as a DOI and a PMID.In such cases, the same citation may be multiply represented in different ways depending on the data source.For example, OpenCitations will describe in COCI a citation between two publications using metadata derived from Crossref as a DOI-to-DOI citation, and in POCI the same citation using metadata derived from PubMed as a PMID-to-PMID citation.This duplication poses problems when counting the number of ingoing and outgoing citations of each document, a crucial statistic for libraries, journals, and Scientometrics studies.Use of OpenCitations Meta permits us to deduplicate such citations and solve the problems that such duplication would otherwise cause.
Second, the assignment of globally persistent identifiers to documents is not universal practice across all scholarly fields.Gorraiz et al. (2016) demonstrated that the Natural and Social Sciences communities adopt DOIs to a much greater extent than the Arts and Humanities community.From that research, carried out on Scopus and the Web of Science Core Collection, it emerged that almost 90% of the publications in the Sciences and Social Sciences are associated with a DOI, while in the Art and Humanities that figure is only 50%.In addition, concerning the Humanities, citations of ancient primary sources lacking DOIs (e.g.Aristotle) are required in many fields (e.g. in History).If a document has no identifier, its metadata does not respect the FAIR principles (Wilkinson et al., 2016) that scholarly digital research objects must be findable, accessible, interoperable and reusable.A globally unique and persistent identifier is critical to make metadata findable and accessible.Moreover, a bibliographic resource without an identifier prevents citations involving it from being described adhering to the FAIR principles.This is the reason why, according to the Open Citation Definition (Peroni & Shotton, 2018) governing the population of OpenCitations Indexes, any two entities linked by an indexed citation must both be identified by a persistent identifier coming from the same identifier scheme, for example both with DOIs, or both with PubMed IDs.For example, COCI (Heibi et al., 2019b) only stores citation information where the citing and cited entities are described in Crossref and both have DOIs.Citations involving publications lacking DOIs or other recognised PIDs have hitherto been excluded from the OpenCitations citation indexes.
But now, OpenCitations Meta solves the problems posed by bibliographic resources identified by multiple identifiers and also bibliographic resources that lack persistent identifiers, by associating a new globally persistent identifier to each document described in OpenCitations Meta -an OpenCitations Meta Identifier (OMID).In this way, all citations can be represented as OMID-to-OMID citations (Fig. 1).By providing a unique identifier for every entity stored in OpenCitations Meta, the entity's OMID acts as a proxy between different external identifiers used for each entity, en-abling disambiguation.Moreover, OpenCitations Meta can contain metadata for all scholarly publication, each identified by an OMID, without the mandatory need for an external persistent identifier to be provided by the source of the metadata.
Thus, thanks to OpenCitations Meta, metadata for all scholarly publications can now be stored by OpenCitations, and citations linking all such publications can be included within a new inclusive OpenCitations Index, of which the other indexes (COCI, DOCI, POCI, etc.) will be sub-indexes, according to the various input sources of the citation information.
Third is the previously poor temporal performance of the OpenCitations' services, in particular API operations returning basic bibliographic metadata of citing and cited resources.This is because the OpenCitations Indexes themselves have hitherto contained only citation-related metadata (citations being treated as First Class data entities with their own metadata), but have not held bibliographic metadata relating to the citing and cited entities (title, authors, page numbers, etc.).Rather, those metadata have hitherto been retrieved on-the-fly by means of explicit API requests to external services such as Crossref, ORCID and DataCite.
Over the past three years, to address the issues mentioned above, we have developed and tested the software we are now using to create a new bibliographic metadata collection, namely OpenCitations Meta, which we launched in December 2022.The software supporting this database is open source, and available at https://github.com/opencitations/oc_meta.The metadata exposed by OpenCitations Meta includes the basic bibliographic metadata describing a scholarly bibliographic resource.In particular, it stores all known bibliographic resource identifiers for the bibliographic resource (e.g.DOI, PMID, ISSN, and ISBN), the title, type, publication date, pages, the venue of the resource, and the volume and issue numbers where the venue is a journal.In addition, OpenCitations Meta contains metadata regarding the main actors involved in the publication of each bibliographic resource, i.e. the names of the authors, editors, and publishers, each including their own persistent identifiers (e.g.ORCIDs) where available.It is our intention to add additional metadata fields (e.g.authors' institutions and funding information) at a later date.The process of generating OpenCitations Meta can be divided into two steps.The first step involves the curation of the input data.The curatorial procedure concerns the automatic correction of errors, the standardisation of the data format, and the deduplication of separate metadata entries for the same item.The deduplication process is based only on identifiers.This approach favours precision over recall: for instance, people are deduplicated only if they have an assigned ORCID, and never by other heuristics.After the normalization and deduplication stages, each entity is assigned an OpenCitations Meta Identifier (OMID), whether or not it already has an external persistent identifier (e.g.DOI, PubMed ID, ISBN).
The second step in populating OpenCitations Meta involves converting the raw input data into RDF (Linked Open Data format) compliant with the OpenCitations Data Model (OCDM) (Daquino et al., 2020), to enable querying such data via SPARQL.During this process, great attention is given to provenance and change-tracking: every time an entity is created, modified, deleted or merged, such changes are recorded in RDF, and are characterised by their dates of creation, primary sources, and responsible agents.
The rest of the paper is organised as follows.Section 2 reviews other semantic publishing datasets.Subsequently, in Section 3, the methodological approach adopted to produce OpenCitations Meta is presented in detail, starting with the curatorial phase (3.1), then describing error correction (3.2), moving to an explanation of the data translation to RDF according to the OCDM (3.3), and concluding with a description of the production of the RDF provenance and change-tracking data (3.4).Section 4 provides some descriptive statistics regarding the current OpenCitations Meta dataset.Finally, Section 5 discusses some present limitations of OpenCitations Meta, and a consideration of where OpenCitations Meta stands among similar scholarly datasets.

Related works
In this section, we will review the most important scholarly publishing datasets to which access does not require subscription, i.e. publicly available datasets holding scholarly bibliographic metadata.Since OpenCitations Meta uses Semantic Web technologies to represent data, special attention will be given to RDF datasets, namely Wikidata, Springer Nature SciGraph, BioTea, the OpenResearch Knowledge Graph and Scholarly Data.In addition, the OpenAIRE Research Graph, OpenAlex and Scholarly Data will be described, as they are the most extensive datasets in terms of the number of works, although they do not represent data semantically.
OpenAlex (Priem et al., 2022) rose from the ashes of the Microsoft Academic Graph on January 1st 2022, and inherited all its metadata.It includes data from Crossref (Hendricks et al., 2020), Pubmed (Maloney et al., 2013), ORCID (Haak et al., 2012), ROR (Lammey, 2020), DOAJ (Morrison, 2017), Unpaywall (Dhakal, 2019), arXiv (Sigurdsson, 2020), Zenodo (Research & OpenAIRE, 2013), the ISSN International Centre1 , and the Internet Archive's General Index2 .In addition, web crawls are used to add missing metadata.With over 240 million works3 , OpenAlex is the most extensive bibliographic metadata dataset to date.OpenAlex assigns persistent identifiers to each resource.In addition, authors are disambiguated through heuristics based on co-authors, citations, and other features of the bibliographic resources.The data are distributed under a CC0 licence and can be accessed via API, web interface or downloading a full snapshot copy of the OpenAlex database.
The OpenAIRE project started in 2008 to support the adoption of the European Commission Open Access mandates (Manghi et al., 2010), and it is now the flagship organisation within the Horizon 2020 research and innovation programme to realise the European Open Science Cloud (European Commission.Directorate General for Research and Innovation., 2016).One of its primary outcomes is the OpenAIRE Research Graph, which includes metadata about scholarly outputs (e.g.literature, datasets and software), organisations, research funders, funding streams, projects, and communities, together with provenance information.Data are harvested from a variety of sources (Atzori et al., 2017): archives, e.g.ArXiv (Sigurdsson, 2020) Europe PMC (The Europe PMC Consortium, 2015), Software Heritage (Abramatic et al., 2018) and Zenodo (Research & OpenAIRE, 2013); aggregator services, e.g.DOAJ (Morrison, 2017) and OpenCitations (Peroni & Shotton, 2020); and other research graphs, e.g.Crossref (Hendricks et al., 2020) and DataCite (Brase, 2009).As of June 2023, this OpenAIRE dataset consisted of 232,174,001 research products4 .The deduplication process implemented by OpenAIRE takes into account not only PIDs but also other heuristics, such as the number of authors and the Levenstein distance of titles.However, the internal identifiers OpenAIRE associates with entities are not persistent and may change when the data are updated.Data of the OpenAIRE Research Graph can be accessed via an API and the Explore interface.Dumps are also available under a Creative Commons Attribution 4.0 International Licence.
Semantic Scholar was introduced by the Allen Institute for Artificial Intelligence in 2015 (Fricke, 2018).It is a search engine that uses artificial intelligence to select only papers most relevant to the user's search and to simplify exploration, e.g. by producing automatic summaries.Semantic Scholar sources its content via web indexing and partnerships with scientific journals, indexes, and content providers.Among those are the Association for Computational Linguistics, Cambridge University Press, IEEE, PubMed, Springer Nature, The MIT Press, Wiley, arXiv, HAL, and PubMed.As of June 2023, it indexes 212,605,886 scholarly works5 .Authors are disambiguated via an artificial intelligence model (Subramanian et al., 2021), associated with a Semantic Scholar ID, and a page is automatically generated for each author, which the real person can redeem.Semantic Scholar provides a web interface, APIs, and the complete dataset is downloadable under the Open Data Commons Attribution Licence (ODC-By) v1.0.
Wikidata was introduced in 2012 by Wikimedia Deutschland as an open knowledge base to store in RDF data from other Wikimedia projects, such as Wikipedia, Wikivoyage, Wiktionary, and Wikisource (Mora-Cantallops et al., 2019).Due to its success, Google closed Freebase in 2014, which was intended to become "Wikipedia for structured data" and migrated it to Wikidata (Tanon et al., 2016).Since 2016, the WikiCite project has contributed significantly to the evolution of Wikidata as a bibliographic database, such that, by June 2023, Wikidata contained descriptions of 39,864,447 academic articles6 .The internal Wikidata identifier referring to any entity (including bibliographic resources) is associated with numerous external identifiers, e.g.DOI, PMID, PMCID, arXiv, ORCID, Google Scholar, VIAF, Crossref funder ID, ZooBank and Twitter.The data are released under a CC0 licence as RDF dumps in Turtle and NTriples.Users can browse them via SPARQL, a web interface and, as of 2017, via Scholia -a web service which performs real-time SPARQL queries to generate profiles on researchers, organisations, journals, publishers, academic works and research topics, while also generating valuable infographics (Nielsen et al., 2017).
While OpenAIRE Research Graph and Wikidata aggregate many heterogeneous sources, Springer Nature SciGraph (Hammond et al., 2017), on the other hand, aggregates only data from Springer Nature and its partners.It contains entities concerning publications, affiliations, research projects, funders and conferences, totalling more than 14 million research products7 .There is no current plan to offer a public SPARQL endpoint, but there is the possibility to explore the data via a browser interface, and a dump is released monthly in JSON-LD format under a CC-BY licence.
BioTea is also a domain-oriented dataset, and represents the annotated full-text open-access subset of PubMed Central (PMC-OA) (Garcia et al., 2018) using RDF technologies.At the time of that 2018 paper, the dataset contained 1.5 million bibliographic resources.Unlike other datasets, BioTea describes metadata and citations and defines the annotated full-texts semantically.Named-entity recognition analysis is adopted to identify expressions and terminology related to biomedical ontologies that are then recorded as annotations (e.g. about biomolecules, drugs, and diseases).BioTea data are released as dumps in RDF/XML and JSON-LD formats under the Creative Commons Attribution Non-Commercial 4.0 International licence, while the SPARQL endpoint is currently offline.
A noteworthy approach is that adopted by the Open Research Knowledge Graph (ORKG) (Auer et al., 2020).Metadata are mainly collected either by trusted agents via crowdsourcing or automatically from Crossref.However, ORKG's primary purpose is not to organise metadata but to provide services.The main scope of these services is to perform a literature comparison analysis using word embeddings to enable a similarity analysis and foster the exploration and link of related works.To enable such sophisticated analyses, metadata from Crossref is insufficient; therefore, structured annotations on the topic, result, method, educational context and evaluator must be manually specified for each resource.The dataset contains (as of June 2023) 25,680 papers8 , 5153 datasets, 1364 software and 71 reviews.Given the importance of human contribution to the creaton of the ORKG dataset, the platform keeps track of changes and provenance, athough not in RDF format.The data can be explored through a web interface, SPARQL, and an API, and can also be downloaded under a CC BY-SA licence.
ScholarlyData collects information only about conferences and workshops on the topic of the Semantic Web (Nuzzolese et al., 2016).Data are modelled following the Conference Ontology, which describes typical entities in an academic conference, such as accepted papers, authors, their affiliations, and the organising committee, but not bibliographic references.Up to June 2023, the dataset stored information about 5678 conference papers.Such a dataset is updated by employing the Conference Linked Open Data generator software, which outputs RDF starting from CSV files (Gentile & Nuzzolese, 2015).The deduplication of the agents is based only on their URIs using a supervised classification method (Zhang et al., 2017), while ORCIDs are added in a further step.This methodology does not address the existence of homonyms.However, this is a minor issue for ScholarlyData, since only a few thousand people are involved in the conferences being indexed.ScholarlyData can be explored via a SPARQL endpoint, and dumps are available in RDF/XML format under a Creative Commons Attribution 3.0 Unported licence.
To conclude, we would like to point out that none of these other datasets mentioned above exposes change-tracking data and the related provenance information in RDF.
Table 1 summarises all the considerations made on each dataset.

Methodology
OpenCitations Meta is populated from input data in CSV format (i.e.tabular form).This choice is not accidental.We have found that data exposed by OpenCitations in CSV format (e.g. from COCI (OpenCitations, 2022)) are downloaded more frequently, in comparison to the same data in more structured formats (i.e.JSON Scholix and RDF N-Quads).This is due to the smaller file size (compared to N-Quads and Scholix) and, above all, to the higher readability of the tabular format for a human.The latter is the main reason why the input format adopted by OpenCitations Meta is CSV, to facilitate the future crowdsourcing of bibliographic metadata from human curatorial activities (Heibi et al., 2019a).
The input table of OpenCitations Meta has eleven columns, corresponding to a linearisation of the OCDM (Daquino et al., 2020): id, title, author, editor, publication date, venue, volume, issue, page, type, and publisher.For an in-depth description of how each field is structured, please see (Massari & Heibi, 2022).Once the CSV tabular data have been acquired, the data are first automatically curated (Curator step) and then converted to RDF based on the OCDM (Creator step).Finally, the curated CSV and RDF are stored as files, while a corresponding triplestore is incrementally populated.Fig. 2 summarises the workflow.
Figure 2: OpenCitations Meta workflow.First, the input data in CSV format is automatically corrected (1), deduplicated, and enriched with pre-existing information from within a triplestore (2).The corrected CSV is returned as output (3a).Second, the data are transformed into RDF (3b), saved to file (4a) and finally entered into the triplestore (4b)

Curator: deduplication, enrichment and correction
The curation process performs three main actions to improve the quality of the received data: deduplication, enrichment, and correction.
The approach chosen for data deduplication is based strictly on identifiers.In other words, two different entities are considered the same if, and only if, both have the same identifier, e.g. a DOI for articles, an ORCID for people, an ISBN for books, and an ISSN for publication venues (e.g.journals).
Different resources with the same identifier are merged following a precise rule: (1) if the resources are part of the same CSV file, the information of the first occurrence is favoured.However, (2) if the resource is already described in the triplestore, the information in the triplestore will be favoured.In other words, we consider the information stored in the triplestore as trusted, and it can only be incremented with additional data coming from a CSV source.
Once an entity is deduplicated, it is assigned a new, permanent internal identifier called an OpenCitations Meta Identifier (OMID).The OMID has structure [entity_type_abbreviation]/[supplier_prefix][sequential_number].For example, the first journal article ever processed has OMID br/0601, where br is the abbreviation of "bibliographic resource", and 060 corresponds to the supplier prefix, which indicates the database to which the bibliographic resource belongs (in this case, OpenCitations Meta).Finally, 1 indicates that this OMID identifies the index's first bibliographic resource ever recorded for that prefix.
The entities that are subject to deduplication and subsequently identified with an OMID are external identifiers (abbr.id), agent roles (i.e.authors, editors, publishers, abbr.ar), responsible agents (i.e.people and organisations, abbr.ra), resource embodiments (i.e.pages, abbr.re), and venues, volumes and issues (which are all bibliographic resources, abbr.br).Volumes and issues have OMIDs because they are treated as first-class citizens, not attributes of articles.This has the advantage of permitting one, for instance, to search for the papers within a specific issue, the volumes of a named journal, or journal issues published within a certain time period.In contrast, titles and dates are treated as literal values, not as entities.
Fig. 3 illustrates the deduplication decisional tree.Given an input entity and its identifiers, there are six possible outcomes: 1.If the entity has no identifiers, or they do not exist in the triplestore, then a new OMID is created for the entity; 2. If the entity does not have an OMID, and if one of its external identifiers has already been associated with one and only one other entity, then the two entities are merged and treated as the same; 3. If the entity's external identifiers in the CSV connect two or more entities within the triplestore that had hitherto been distinct, and no OMID is specified in the CSV, then a conflict arises that cannot be resolved automatically and will require manual intervention.A new OMID is minted for this conflictual entity.For example, in the CSV, the same journal name is associated with two identifiers, issn:1588-2861 and issn:0138-9130; however, in the triplestore, there are entries for two separate entities, one with identifier issn:1588-2861 and the other with identifier issn:0138-9130, which in reality refer to the same entity; 4. If an entity in the CSV has an OMID that exists in the triplestore and no other IDs are present, then the information in the triplestore overwrites that in the CSV.The triplestore is then updated only by the addition of missing details.In other words, specifying its OMID for an entity in the CSV is a way to update an existing entity within OpenCitations Meta; 5. If an entity has an existing OMID and additional identifiers are associated with other entities without an OMID (in the CSV) or with the same OMID (in the CSV or triplestore), then the entities are merged.Moreover, the information in the CSV is overwritten with that already available in the triplestore, and missing details present in the CSV are then added to the triplestore; 6.Finally, if external identifiers connect several entities in the triplestore with different OMIDs, then a conflict arises.In this case, the OMID specified in the CSV takes precedence, and only entities with that OMID are merged.
Given these general rules, three particular cases deserve special concern.The first notable issue concerns the order of authors and editors, which must be maintained according to the OCDM.In the event of a merge, the order recorded when the entity was first created overwrites subsequent ones, and any new authors or editors are added to the end of the existing list, as shown in Fig. 4. The last significant case involves the containment relationship between articles, issues, volumes and venues.This structure is preserved in the case of a merge, where two volumes or issues are considered the same only if they have the same value, which may be a sequential number (e.g."Volume 1") or an arbitrary name (e.g."Clin_Sect").

Curator: error proofing
Once all entities have obtained an OMID, data are normalised, and the errors that can be handled automatically are corrected.All identifiers are checked based on their identifier scheme -for instance, the syntactic correctness of ISBNs, ISSNs and ORCIDs is computed using specific formulas provided by the documentation of the identifier scheme.However, the semantic correctness of identifiers is verified only for ORCIDs and DOIs, which is done using open APIs to verify their actual existence -since, for instance, it is possible to produce an ORCID that is valid syntactically, but that is not in fact assigned to a person.
All ambiguous and alternative characters used for spaces (e.g.tab, no-break space, em space) are transformed into space (Unicode character U+0020).Similarly, ambiguous characters for hyphens within ids, pages, volumes, issues, authors and editors (e.g.non-breaking hyphens, en dash, minus sign) are changed to hyphen-minus (Unicode character U+002D).
Regarding titles of bibliographic resources ("venue" and "title" columns), every word in the title is capitalised except for those with capitals within them (that are probably acronyms, e.g."FaBiO" and "CiTO").This exception, however, does not cover the case of entirely capitalised titles.The same rule is also followed for authors and editors, whether individuals or organisations.
Dates are parsed considering both the format validity, based on ISO 8601 (YYYY-MM-DD) (Wolf & Wicksteed, 1997), and the value (e.g. 30 February is not a valid date).Where necessary, the date is truncated.For example, the date 2020-02-30 is transformed into 2020-02 because the day of the given date is invalid.similarly, 2020-27-12 will be truncated to 2020 since the month (and hence the day) is invalid.The date is discarded if the year is invalid (e.g. a year greater than 9999).
The correction of volume and issue numbers is based on numerous rules which deserve special mention.In general, we have identified six classes of errors that may occur, and each different class is addressed accordingly: 1. Volume number and issue number in the same field (e.g."Vol.35 N°spécial 1").
The two values are separated and assigned to the corresponding field.
5. Volume classified as issue (e.g."Volume 1" in the "issue" field).If the volume pattern is found in the "issue" field and the "volume" field is empty, the content is moved to the "volume" field, and the "issue" field is set to null.However, if the "issue" field contains a volume pattern and the "volume" field contains an issue pattern, the two values are swapped.
6. Issue classified as volume (e.g."Special Issue 2" in the "volume" field).It is handled in the same way as case 5, but in reversed roles.
Finally, in case a value is both invalid in its format and invalid because it is in the wrong field, then such a value is first corrected and then moved to the right field, if appropriate.
Once the input data has been disambiguated, enriched and corrected, a new CSV file is produced and stored.This file represents the first output of the process (3a in Fig. 2).

Creator: semantic mapping
In this phase, data are modelled in RDF following the OCDM (Daquino et al., 2020).This ontology reuses entities defined in the SPAR Ontologies to represent bibliographic entities (fabio:Expression), identifiers (datacite:Identifier), agent roles (pro:RoleInTime), responsible agents (foaf:Agent) and publication format details (fabio:Manifestation).The agent role (i.e.author, editor or publisher) is used as a proxy between the bibliographic resource and the responsible agent, i.e. the person or organisation.This approach helps us define time-dependent and context-dependent roles and statuses, such as the order of the authors (Peroni et al., 2012).Fig. 5 depicts the relationships between the various entities through the Graffoo graphical framework (Falco et al., 2014).Furthermore, the person (foaf:Agent) Glenn Hunt (foaf:givenName, foaf:familyName) is the first author (pro:RoleInTime) in the context of this article (pro:isDocumentContextFor).Similarly, the second author is Michelle Cleary (pro:hasNext).
Once the mapping is complete, the RDF data produced can be stored (4a in Fig. 2) and uploaded to a triplestore (4b in Fig. 2).

Creator: provenance and change tracking
In addition to handling their metadata, great importance is given to provenance and change tracking for entities in OpenCitations Meta.Provenance is a record of who processed a specific entity by creating, deleting, modifying or merging it, when this action was performed, and what the primary source was (Gil et al., 2010).Keeping track of this information is crucial to ensure the reliability of the metadata within OpenCitations Meta.Indeed, the truth of a statement on the Web and the Semantic Web is never absolute, and integrity must be assessed by every application that processes information by evaluating its context (Koivunen & Miller, 2001).
However, besides storing provenance information, mechanisms to understand the evolution of entities are critical when dealing with activities such as research assessment exercises, where modifications, due to either corrections or misspecification, may affect the overall evaluation of a scholar, a research group, or an entire institution.For instance, the name of an institution might change over time, and the reflection of these changes in a database "make it difficult to identify all institution's names and units without any knowledge of institution's history" (Pranckutė, 2021).This scenario can be prevented by keeping track of how data evolved in the database, thus enabling users to understand such dynamics without accessing external background knowledge.To our knowledge, no other semantic database of scholarly metadata keeps track of changes and provenance in standard RDF 1.1.
The provenance mechanism employed by OpenCitations describes an initial creation snapshot for each stored entity, potentially followed by other snapshots detailing modification, merge or deletion of data, each marked with its snapshot number, as summarised in Fig. 6 Figure 6: A high-level description of the provenance layer of the OCDM to keep track of the changes to an entity.To keep track of the full history of an entity, we need to store all the triples of its most recent snapshot plus all the deltas built by modifying the previous snapshots Regarding the semantic representation, the problem of provenance modelling (Sikos & Philp, 2020) and change-tracking in RDF (Pelgrin et al., 2021) has been discussed in the scholarly literature.To date, no shared standard achieves both purposes.For this reason, OpenCitations employs the most widely shared approaches, i.e. named graphs (Carroll et al., 2005), the Provenance Ontology (Lebo et al., 2013), and Dublin Core (Board, 2020).
In particular, each snapshot is connected to the previous one via the prov:wasDerivedFrom predicate and is linked to the entity it describes via prov:specializationOf.In addition, each snapshot corresponds to a named graph in which the provenance metadata are described, namely the responsible agent (prov:wasAttributedTo), the primary source (prov:hadPrimarySource), the generation time (prov:generatedAtTime), and, after the generation of an additional snapshot, the invalidation time (prov:invalidatedAtTime).Each snapshot may also optionally be represented by a natural language description of what happened (dcterms:description).
In addition, the OCDM provenance model adds a new predicate, oco:hasUpdateQuery, described within the OpenCitations Ontology (Daquino & Peroni, 2019), which expresses the delta between two versions of an entity via a SPARQL UPDATE query.The deduplication process described in Section 3.1 takes place not only on the current state of the dataset but on its entire history by enforcing the change-tracking mechanism.In other words, if an identifier can be traced back to an entity deleted from the triplestore, that identifier will be associated with the OMID of the deleted entity.If the deletion is due to a merge chain, the OMID of the resulting entity takes precedence.For more on the time-traversal queries methodology, see (Massari & Peroni, 2022).For more details on the programming interface for creating data and tracking changes according to the SPAR Ontologies, consult (Persiani et al., 2022).
Editors and authors have been counted as roles, without disambiguating the individuals holding these roles.Conversely, bibliographic entities, publishers, and venues were counted by OMID.However, for venues (e.g.journals), we have taken an extra precaution: many are duplicated in OpenCitations Meta because they have no identifiers other than the OMID.Therefore, in the figures shown above, we found it reasonable to disambiguate the venues by title in the absence of other identifiers.
As shown in Table 2, Springer Science is the publishing entity with the highest number of venues (2097), followed by Elsevier BV (1961) andIEEE (1775).When counting the number of publications, Elsevier is in the lead (16,933,610), followed by Springer Science (11,507,498) and Wiley (7,262,893) in Table 3.
Considering the venues in Table 4, Wiley's ChemInform has the most publications (421,735), followed by Elsevier's SSRN Electronic Journal (337,223) and Springer's Journal On Data Semantics (330,093).
Table 5 lists all the types of bibliographic resources in OpenCitations Meta.The current dataset contains mostly journal articles (67,904,323), which exceed the number of book chapters in second place (6,476,623) by about ten times, and proceedings articles in third place (5,046,165) by about thirteen times.
Table 6, which lists the number of publications per year, shows an increasing trend, with a greater number of publications from year to year.OpenCitations Meta allows the users to explore such data either via SPARQL (https://opencitations.net/meta/sparql) or via an API (https://opencitations.net/meta/ api/v1).In particular, the OpenCitations Meta API retrieves a list of bibliographic resources and related metadata starting from one or more publication identifiers, an author's ORCID, or an editor's ORCID.Textual searches are currently under testing and will be released in the future as one further operation of the OpenCitations Meta API.In particular, text searches on titles, authors, editors, publishers, IDs, and venues can be performed.They can also be achieved on volume and issue numbers, provided the venue is first specified.Indeed, searches on multiple fields can be combined using the Boolean conjunction and disjunction operators.For example, once the operation is released, the user will be able to search for all bibliographic resources whose title contains the word "micro-chaos" published either by Philosophical Studies or the Journal of Nonlinear Science: title=micro-chaos&&venue=philosophical%20studies||title=micro-chaos&&venue=journal%20of%20nonlinear%20science, where "&&" is the conjunction operator, while || is the disjunction operator.

Discussion
As shown in Section 2, when considering only semantic publishing datasets, OpenCitations Meta, which currently includes data from Crossref, DataCite, and the NIH Open Citation Collection (ICite et al., 2022), is first in data volume.Moreover, work is already underway to ingest data from new sources, such as the Japan Link Center (Hara, 2020), the OpenAIRE Research Graph (Atzori et al., 2017), and the Dryad Digital Repository (Vision, 2010).
When compared to the OpenAIRE Research Graph, OpenCitations Meta has advantages in terms of functionality: namely the use of OMIDs, globally unique persistent identifiers used internally to identify every entity within OpenCitations Meta.This usage makes it possible to represent and index citations between bibliographic resources that lack an external persistent identifier such as a Digital Object Identifier (DOI).This feature adds significant value for the OpenCitations Indexes, as it allows for the first time the ingestion of many citations which until now were not possible to be characterised, particularly citations between publications from the humanities and social sciences (Gorraiz et al., 2016), and citations involving primary sources, e.g. a statue, a painting, or a codex, which typically lack a persistent identifier.Importantly, having an OMID also permits the identified resource to be assigned a unique URL, for example https://w3id.org/oc/meta/br/061401975837for omid:br/061401975837. Another feature that, to the best of our knowledge, is only present in OpenCitations Meta is the mechanism for change-tracking management within the provenance information stored in RDF.This information can be queried using the Python timeagnostic-library software (Massari & Peroni, 2022).It can perform time-traversal SPARQL queries, i.e. queries across different snapshots together with provenance information.
As far as other bibliographic datasets that do not use Semantic Web technologies go, OpenAlex (Priem et al., 2022) is an important case to consider for comparison with OpenCitations Meta.OpenAlex uses web crawls to add missing metadata, a feature that allows it to automatically correct a higher number of errors appearing in the data of the sources, when compared to OpenCitations Meta.
Indeed, currently, the main limitation of OpenCitations Meta concerns the quality of the data, which is strictly dependent on the quality of the sources.Crossref does not double-check the metadata provided by publishers, and thus many errors are preserved.For instance, it is possible to encounter articles published in the future (the metadata available at https://api.crossref.org/v1/works/10.12960/tsh.2020.0006say that the article will be published in print in 2029).Some of these errors can be corrected automatically without any background knowledge, while others require either the use of web crawlers or manual intervention.While OpenAlex is pursuing the path of web crawls, OpenCitations is working on a framework that will allow the editing and curation of data by trusted human domain experts (such as academic librarians).
OpenCitations Meta fulfils its primary purpose by holding the bibliographic metadata required to describe the citing and cited publications involved in the citations within the OpenCitations Indexes.In addition to these bibliographic metadata elements, however, we are well aware that there are additional metadata elements of great importance for the academic community: Abstracts, for text mining, domain and subject field determination, and indexing (even if the full texts of the publications are available open access elsewhere), and Funder IDs, Funding information and Institutional identifiers, essential for determining performance metrics and undertaking research assessment.Once we have completed the provision of our textual search operations, expanded our coverage in the ways indicated, and enhanced the computational infrastructure upon which OpenCitations Meta and the OpenCitations Indexes run, we will proceed to integrate and populate these additional metadata fields.
The provision of high-quality bibliographic metadata is a complex and difficult goal to achieve by automated operations, while the scale of the operations precludes manual curation except for a minority of records.No bibliographic dataset is currently able to achieve this goal on its own.For this reason, all the available bibliographic databases should be viewed as complementary.For example, while at the moment OpenAlex provides better quality metadata, OpenCitations Meta has complete provenance data openly available, and enables more complex searches, thanks to the potentialities given by Semantic Web technologies.For example, "Search for all authors who co-authored with Silvio Peroni or Fabio Vitali in conference proceedings that were published by Springer after 2009".Furthermore, OpenAlex is only partially free, since a fee must be paid to make more than a hundred thousand requests per day via the API and to access data updated every hour via the API (instead of every month via the dump)9 .In contrast, users can make unlimited requests to the latest version of OpenCitations Meta for free.
Also, although the OpenAIRE Research Graph currently contains more metadata, such data are released under a CC-BY attribution licence, while the data released by OpenCitations Meta is under a CC0 public domain waiver, permitting complete freedom for reuse, including commercial reuse, and for machine processing without any requirement for attribution.

Conclusion
This article detailed the methodology used to develop OpenCitations Meta, a database that stores and delivers bibliographic metadata for all publications involved in the OpenCitations Indexes.This process involves two main phases: (1) an automatic curation analysis aimed at deduplicating entities, correcting errors and enriching information, and (2) a data conversion to RDF, while keeping track of changes and provenance in RDF.
Information about new publications is continuously being added to Crossref, Dat-aCite, and PubMed, and we will develop procedures to ingest these new metadata into OpenCitations Meta in a regular and timely manner.Furthermore, work is already underway to ingest bibliographic metadata from the Japan Link Center and the OpenAIRE Research Graph, and other sources will be included as our human and computational resources permit.OpenCitations Meta will thus continue to grow.
OpenCitations Meta has three major benefits.First, the use of OMIDs (OpenCitation Meta Identifiers) for all stored entities enables OpenCitations Meta to act as a mapping hub for publications that may have more than one external PID (for example a journal article described in Crossref with a DOI (Digital Object Identifier), and the same publication described in PubMed with a PMID (PubMed Identifier), while also making it possible to characterise citations involving resources lacking any external PIDs.Consequently, the second benefit is that OpenCitations Meta allows citations in OpenCitations Indexes to be described as OMID-to-OMID, disambiguating citations between documents with different identifier schemes, e.g.represented as DOI-to-DOI on Crossref and PMID-to-PMID on PubMed.Third, OpenCitations Meta speeds search operations to retrieve metadata on publications involved in the citations stored in the OpenCitations Citation Indexes, since these metadata are now kept in-house, rather than being retrieved by on-the-fly API calls to external resources.Future challenges will be to elaborate a disambiguation system for people lacking an ORCID identifier, to improve the quality of the existing metadata, to enhance the search operations and the storage efficiency, to add additional metadata fields for Abstracts, Funder IDs, Funding information, and Institutional identifiers, and to populate these where these metadata are available from our sources.
Finally, an interface will be implemented and made available to trusted domain experts to permit direct real-time manual curation of metadata held by OpenCitations Meta.Such a system will track changes and provenance, will preserve the delta between different versions of each entity, and will retain information such as the agent responsible for the change, the primary source, and the date.In this way, we will strive to make OpenCitations Meta not only comprehensive but also an accurate and fully open and reusable source of bibliographic metadata to which members of the scholarly community can directly contribute.

Figure 1 :
Figure 1: If a document is described by multiple identifiers, e.g., a DOI from Crossref and a PMID from Pubmed, the citations involving it may be described in multiple ways, creating an ambiguity and deduplication problem.Use of the OpenCitations Meta Identifier solves this issue by acting as a proxy between different external identifiers

Figure 5 :
Figure 5: Part of the OCDM used in OpenCitations Meta.Yellow rectangles represent classes, green polygons represent datatypes, and blue and green arrows represent object properties and data properties, respectively Fig. 7 displays the model via a Graffoo diagram.

Figure 7 :
Figure 7: The Graffoo diagram describing snapshots (prov:Entity) of an entity (linked via prov:specializationOf) and the related provenance information

Table 1 :
Open scholarly datasets ordered by the number of contained research entities, and compared regarding change-tracking, provenance, disambiguation method, presence of an internal ID, accessibility, and data usage licence

Table 2 :
The top ten publishers by number of venues

Table 3 :
The top ten publishers by number of publications

Table 4 :
The top ten venues by number of publications

Table 6 :
Top ten years of publication by the number of publications in that year