Open access at the national level: A comprehensive analysis of publications by Finnish researchers

Open access (OA) has mostly been studied by relying on publication data from selective international databases, notably Web of Science (WoS) and Scopus. The aim of our study is to show that it is possible to achieve a national estimate of the number and share of OA based on institutional publication data providing a comprehensive coverage of the peer-reviewed outputs across fields, publication types, and languages. Our data consists of 48,177 journal, conference, and book publications from 14 Finnish universities in 2016–2017, including information about OA status, as self-reported by researchers and validated by data-collection personnel through their Current Research Information System (CRIS). We investigate the WoS, Scopus, and DOI coverage, as well as the share of OA outputs between different fields, publication types, languages, OA mechanisms (gold, hybrid, and green), and OA information sources (DOAJ, Bielefeld list, and Sherpa/Romeo). We also estimate the role of the largest international commercial publishers compared to the not-for-profit Finnish national publishers of journals and books. We conclude that institutional data, integrated at national and international level, provides one of the building blocks of a large-scale data infrastructure needed for comprehensive assessment and monitoring of OA across countries, for example at the European level.


INTRODUCTION
While open access (OA), free of cost and other access barriers, has been gradually emerging for over two decades, it has recently gained a lot of momentum through science policy. In 2016, the European Union member states agreed to " […] open access to scientific publications as the default option by 2020 and to the best possible re-use of research data as a way to accelerate the transition towards an open science system." (Council of the European Union, 2016). The European Commission supports the transition with a strong open science agenda (European Commission, 2018). Recently, a group of European research funders formed cOAlition S, where funders from around the world are invited to join and make a shared commitment to make immediate OA and a n o p e n a c c e s s j o u r n a l Citation: Pölönen, J., Laakso, M., Guns, R., Kulczycki, E., & Sivertsen, G. (2020). The aim of our study is to show that it is possible to achieve a national estimate of the number and share of OA based on institutional publication data providing the most complete source of the universities' peer-reviewed output. In this paper, we explore and use the institutional publication data from 2016-2017 stored in the VIRTA Publication Information Service, which integrates data from the different types of commercial and noncommercial CRIS solutions of 14 Finnish universities (Puuska, Guns, et al., 2018;Pölönen, 2018), to describe the landscape of OA publishing in Finland, including all publication types, languages, and fields. To identify OA publications, we also use OA status information from VIRTA, which has been self-reported by researchers and validated by data collection personnel at universities (Ilva, 2017a).
More specifically, we investigate the added value of institutional data for OA study compared with WoS, Scopus, and DOIs in terms of national publication output coverage, the OA share of outputs across different fields, publication types, and languages based on comprehensive data, the coverage and information value of international sources for gold (DOAJ and Bielefeld list) and green OA journals (Sherpa/Romeo), and dominance of the largest international commercial publishers. Although our analyses are based on data concerning Finnish universities, our findings are also relevant for an international audience with regard to the options, challenges, and advantages of using institutional publication data for OA monitoring at the national level in other countries as well as across countries, for example at the European level.
In the introduction, we provide background information on the existing data sources and methods of OA monitoring. Our literature review, presented in section 2, shows that CRIS data remains underexploited in study of OA uptake. In section 3 we present our research questions, data, and methods. The results of our empirical analysis are presented in section 4, followed by discussion in section 5 and conclusions in section 6.

Data Sources and Methods for OA Monitoring
Science policy, specifically concerning OA, would benefit from having frequently updated comprehensive metrics collected through a consistent methodology and definitions to support decision-making and monitoring. However, this is an area where there is still a lot of room for improvement. A practical example of key problems with regards to publication information availability is the European Open Science Monitor (Waltman, 2019). This is a service funded by the European Commission and intended to provide regularly updated country-level metrics on OA development (European Commission, 2019). While the use of Elsevier, the largest scholarly journal publisher with an ongoing influence and financial interest in the development of the OA landscape, as a subcontractor has been appealed to no avail (Tennant, 2018), this is a concern not just limited to the potential impact on business and market competition. What is also a concern for scholarship more widely is that the metrics used for the monitor are based on Elsevier's Scopus bibliographic database, which is an index widely used for various purposes where the scholarly publication landscape is to be represented. While Scopus is more inclusive than its closest competitor, WoS by Clarivate Analytics, both still leave out a substantial part of the scholarly record and have been found to be limited in many regards to what they include (see, e.g., Archambault et al., 2006;Chavarro, Ràfols, & Tang, 2018;Hicks, 1999;Hicks & Wang, 2011;Larivière & Macaluso, 2011;Mongeon & Paul-Hus, 2015;Nederhof, 1989Nederhof, , 2006Somoza-Fernández, Rodríguez-Gairín, & Urbano, 2018). The literature review section of this article will reveal that almost all studies of national OA uptake have relied on publication data from either Scopus or WoS, which is a fundamental limitation of perspective.
In many countries, universities annually report their complete bibliographic record of peerreviewed publications to the government as part of performance-based research funding systems (PRFS) (Giménez-Toledo, Mañana-Rodríguez, et al., 2017Hicks, 2012;Sı le, Guns, et al., 2017;Sı le, Pölönen, et al., 2018). In Norway, Denmark, Finland, Flanders (Belgium), and Polandcountries that have in some form adapted the so-called Norwegian model of PRFS-the national bibliographic database either substitutes the universities' local Current Research Information Systems (CRIS) or integrates publication data from the local CRIS (Aagaard, 2018;Kulczycki & Korytkowski, 2018;Pölönen, 2018;Sivertsen, 2016aSivertsen, , 2016cSivertsen, , 2017Sivertsen, , 2018a. Comparisons with the comprehensive national publication data have shown that especially in the social sciences and humanities (SSH), WoS and Scopus coverage is seriously lacking, mainly due to the importance of national language and book publishing (Aksnes & Sivertsen, 2019;Giménez-Toledo et al., 2016Kulczycki, Engels, et al., 2018;Kulczycki, Guns, et al., 2020;Ossenblok, Engels, & Sivertsen, 2012;Sivertsen, 2016b;Sivertsen & Larsen, 2012). In many SSH disciplines, the majority of journal articles are published in national or regional outlets not indexed in WoS or Scopus (den Hertog, Jager, et al., 2014;Sivertsen, 2016b). In addition, up to half of peer-reviewed outputs in the humanities, and around one-third in the social sciences, are book publications, including chapters and monographs (Engels, Starc ic , et al., 2018). The implication is that only countries in which a national bibliographic database with full coverage of the SSH publications (Sı le et al., 2018) has been developed can provide an accurate picture of publications across all fields and publication types.
In addition to coverage issues of publications in international bibliographic databases, OA monitoring is conditioned by OA definitions and methodologies for identifying what is available OA among these publications. The Directory of Open Access Journals (DOAJ) and Sherpa/Romeo are the most frequently used information sources to identify gold and green OA journals. But not all gold OA journals are included in DOAJ. Bielefeld University, for example, provides an ISSNmatching of gold OA journals based-in addition to DOAJ-also on the Directory of Open Access Scholarly Resources (ROAD), PubMed Central (PMC), and Open APC (OAPC) (Wohlgemuth, Rimmert, & Winterhager, 2016). According to the most recent analysis, Bielefeld's list contained 7,755 gold OA journals, of which DOAJ covered 33% (Bruns, Lenke, et al., 2019). Recently, Björk (2019) identified 437 OA journals published in the Nordic countries, of which DOAJ covered 42%. There were also considerable differences between the Nordic countries, as DOAJ covered 68% of OA journals from Norway but only 23% of those published in Finland. The Federation of Finnish Learned Societies and DOAJ have started a pilot project to encourage Finnish OA journals to apply to DOAJ (DOAJ, 2019). The Sherpa/Romeo register of self-archiving policies has extensive coverage of journals, but the information value of the color codes used for classification of the policiesnotably the identification of green OA journals-has been questioned, as publishers have increasingly introduced additional requirements not captured by the color codes (Gadd & Troll Covey, 2016). Sherpa/Romeo recently launched a new version of the service, in which the color codes are no longer used.
It is in the interest of the European Commission to have a comprehensive open science monitor, based on open and transparent data-infrastructure independent of private operators (Tennant, 2018;Waltman, 2019). Therefore, it is important to investigate the institutional CRIS data not only from the national OA perspective but also because it potentially contributes to the large-scale international data infrastructure needed for evaluation, assessment, and monitoring of research activities at the European level (European Commission, 2010;Lauer, 2016;Mahieu, Arnold, & Kolarz, 2014;Sivertsen, 2019). OpenAIRE and Crossref are important building blocks of such an infrastructure for open metadata (Waltman, 2019). Since 2018, data collected and made available by the service Unpaywall can be used to identify different types of OA publications based on DOIs (Piwowar et al., 2018). Recent analyses show, however, that the availability of DOIs is far from complete, and there are considerable differences in DOI availability between publication types, fields, and countries (Boudry & Chartron, 2017;Fasae & Oriogu, 2018;Gorraiz, Melero-Fuentes, et al., 2016). The added value of integrated CRIS data is that it can provide well-structured and curated metadata of all publications, whether they are included in WoS and Scopus or not, are in printed or digital format, have DOI or not, and are openly available on the internet or not. Indeed, the Finnish VIRTA Publication Information Service, a national solution for integrating publication data from diverse local CRISs, has already been tested to integrate CRIS data from four European countries (Puuska et al., 2018).

Challenges with Using CRIS-based Data
The national bibliographic databases also have their own challenges of data coverage and quality. If assessed based on included publication types and languages, types of research organizations and organizational units, seniority and job positions of authors, fields of science, intended audience of publications, and peer-review status, most CRIS-based national databases can be described as very comprehensive (Sı le et al., 2017(Sı le et al., , 2018. Several studies, referred to above, have indeed demonstrated the substantially larger coverage of national publication data compared to WoS and Scopus. Similarly, a study of a single Dutch university showed a substantially larger coverage of outputs, especially in the SSH, in the local CRIS compared to WoS (van Leeuwen, van Wijk, & Wouters, 2016). Further studies are needed, however, to investigate to what extent publications included in WoS or Scopus may be missing from the CRIS-based data (e.g., due to researchers' failure to report). Especially in the case of national databases supporting PRFS, it promotes their comprehensiveness that universities not only have a considerable financial incentive to secure as complete reporting of publications as possible but the reporting is also legally mandated (Sı le et al., 2018;Sivertsen, 2018aSivertsen, , 2019. CRISs are needed, among other things, to provide comprehensive, reliable, comparable, and transparent information on research activities (Science Europe, 2016). Completeness is an important aspect of data quality, in addition to correctness, consistency, and timeliness of data (Azeroual & Schöpfel, 2019;Sı le, Guns, et al., 2019). A major challenge is the variety and complexity of publication information (e.g., OA status of publications) and the diversity of data providers and sources (e.g., researchers, data-collection personnel, external databases). Diversity of practices between fields and publication types can increase ambiguity over definitions, such as peer-review status of publications (Kaltenbrunner & de Rijcke, 2016;. In national databases supporting PRFS, standardization and interoperability of data are promoted by means of national level data-collection guidelines with definitions and requirements for reported publications (Sivertsen, 2019). Nevertheless, research-performing organizations (e.g., universities) have different procedures for maintaining records about the publications that affiliated researchers have authored (van Leeuwen, van Wijk, & Wouters, 2016). The quality of the data stored in national bibliographic databases has not yet been extensively researched, but Azeroual and Schöpfel (2019) shed some light on how representatives from 17 European institutions perceive the aspect of data quality in their CRISs. The survey showed that the institutions have several ways that they support and improve the data quality stored in their CRISs, both through internal validation processes and by matching entries to external data.
The growth of OA both in terms of uptake and weight in science policy has introduced a need for new data fields and functions for publication data stored in CRISs. What makes recording of OA information in CRIS systems challenging is the versatility of ways that content can be made available OA, where mechanisms are not necessarily mutually exclusive. The OA status is also likely changing over time, with overlapping access mechanisms, and not clearly or uniformly understood by all information providers (Ilva, 2017a). In addition to journals that publish all their content OA immediately, many subscription-based journals allow individual papers to be made OA on the publisher's website for a one-time fee: so-called hybrid OA. Subscription-based journals that allow self-archiving of manuscript versions of published articles may impose embargoes for the peer-reviewed postprint and publisher version, making them not compliant for example with the Plan S requirements. It has also been observed that publishing in journals that allow selfarchiving does not automatically mean that publications are actually deposited in OA repositories, highlighting a gap between potential and uptake (Björk, Laakso, et al., 2014;Laakso, 2014). OA versions of articles can also be provided on, for example, personal websites or academic social networks that do not guarantee persistent access.

LITERATURE REVIEW
This study focuses on the context of OA measurements at the country level. There are a number of earlier studies that have contributed to this line of research, where the goal has been to cover publication records for an individual country, or multiple individual countries but each country reported separately, and study OA from some perspective. The central methodological variation in earlier studies concerns mainly (a) the data source(s) used for the baseline publication records and (b) how the identification and classification of various OA mechanisms enabling access to these publications is implemented. No studies summarized here include content that might be retrievable from Sci-Hub, a pirate website running since 2011 containing 85% of articles published in subscription journals (Himmelstein, Romero, et al., 2018).
The written summaries, ordered chronologically, provide details on how each study has approached the two central factors of data source selection and OA identification. The summaries of research focusing on country-level OA measurement are divided into two subsections depending on whether the studies are based on WoS or Scopus data or whether they use publication data from CRIS. A third subsection is reserved for studies that do not provide country-level OA measurement but are in other ways relevant to the study. A fourth and final subsection is dedicated to sources describing the Finnish environment for academic publishing and research.

Studies Using Web of Science or Scopus
The United Kingdom has been a pioneer in implementing science policy measures facilitating OA, which has also led to many reports concerning monitoring the development over time. The most recent report by Research Information (2017) presents an analysis of the 2016 scholarly journal output by UK-affiliated authors, utilizing Scopus as the source of baseline publication data. To identify OA publications, the study adopted various methods which are documented in a methodological annex, making use of DOAJ, information on publisher websites, and manual sampling to estimate shares. Some 36% of articles were available from publishers as either gold OA, hybrid OA, or delayed OA, with a further 16% as green OA through online postings in line with journal policies. Although the study is efficient in differentiating between various OA mechanisms, being based on only Scopus-indexed outputs limits the level of insight it can provide about the entire scholarly publishing landscape. The data is also not made openly available.
In a broad study, Martín-Martín, Costas, et al. (2018) studied the OA status of 2,269,022 journal articles (including reviews) recorded in the three central WoS citation indexes for the years 2009 and 2014. For identification of OA availability, and provision mechanisms, the authors queried Google Scholar for each article in conjunction with matching to data from the DOAJ, CrossRef, OpenDOAR, and ROAR. The study found the world average of OA provided through publishers or repositories to be 35.8% of all articles published in 2014, with an additional 20% of articles being available through other freely accessible pages on the web indexed by Google Scholar. There was considerable variation in the OA levels among countries, where each publication was assigned to a country if at least one author was affiliated with an organization in that country. Focusing on the OA share provided by either publishers or repositories, the lower end of the spectrum was represented by Iran (18.6%), Russia (20.3%), and India (23.1%). The highest end of the spectrum was populated by Scotland (56.6%), England (50.9%), and Sweden (50.2%). Finland was not included among the 25 countries in the study. While the study incorporates one of the broadest lenses yet for identifying various OA mechanisms and reporting on them separately (breakdown is provided for gold OA, hybrid OA, delayed OA, bronze OA, green OA, and other free availability) it is limited by restricting the set of publications to those indexed in WoS and by only incorporating journal articles as publication type. The categories of bronze OA and other free availability consist of content to which access might be revoked at any time and their terms and licensing for redistributed openness are often unclear.
Bosman and Kramer (2018) provide a study available in preprint form based on WoS journal publication data that includes longitudinal OA development for journal articles (+ reviews) in the period 2010-2017 for 76 individual countries. To identify OA content, the authors utilize oaDOI, which is a database that harvests information about OA versions available for articles based on DOI information from various openly available sources (including DOAJ and BASE; Impactstory, 2017). The oaDOI database and API have since been made part of Unpaywall. The results demonstrate a large discrepancy in OA levels between the countries. For European countries the spread was 20% (Romania) to 42% (Netherlands) for 2016. Finland had an OA share of 32% for the year 2016. The general longitudinal trend for all countries was of increasing OA share over time, outside of the most recent measurement year (2017), which the authors suggest to be due, for example, to certain time-bound OA mechanisms not being immediately in effect.
In a report incorporating research outputs spanning a decade, Science-Metrix (2018) presents a bibliometric study of the degree of articles being OA in WoS for the years 2006-2015, which includes a country-level analysis covering 20 countries. Finland was not among the studied countries. The world average was 41% for 2015, and this share included all versions of articles that can be downloaded for free from the web that have been harvested into the 1science database (which provides data for the 1findr product that is sold by Science-Metrix, which is now part of Elsevier, to help organizations discover OA content). The study also presents a table differentiating between gold OA and green OA shares between countries for WoS articles in 2014, where significant differences are present showcasing the results of different science policy approaches that countries have adopted to facilitate OA.
Demonstrating the variety of ways in which OA shares can be measured for a set of publications, van Leeuwen, Tatum, and Wouters (2018) compared the use of three different bibliographical methods to assess gold OA publishing at the national level, focusing on research output from the Netherlands, Denmark, and Switzerland for 2000-2013. The three approaches differ in how they rely on either only one or multiple of the following data sources: WoS publication data, DOAJ data, and a customized WoS database hosted at the Centre for Science and Technology Studies (CWTS) at Leiden University. While the three approaches differ in how the OA status of publication records is obtained (the first on OA journal status data in WoS, the second on DOI matching of articles to DOAJ journals, and the third based on ISSN matching of journal articles to DOAJ), they are all limited to the realm of publications included in WoS. Each of the three approaches had their individual pros and cons, with no approach being a full replacement of the others. The authors conclude with a discussion about the potential for utilizing CRIS data for similar purposes in the future to get around the limitations in publication data and OA identification.
While not an academic study, the previously mentioned European Open Science Monitor provided by the European Commission is a resource for regularly updated country-level metrics on OA development (European Commission, 2019). The use of Scopus data provided by Elsevier for the underlying journal publication data comes with limitations on the index coverage as well as potential conflicts of interest, with Elsevier being the largest scholarly journal publisher with a large ongoing influence on the OA landscape (Mongeon & Paul-Hus, 2015;Tennant, 2018). The monitor presents its results split into two categories: gold OA and green OA. The methodology documentation describes that DOAJ, ROAD, PubMed Central, Crossref, and OpenAIRE are used as the main data sources for identifying OA status (European Open Science Monitor, 2018). The monitor has recently also added Unpaywall to its list of data sources and pins the global OA share of 2017 publications to 35.7%, with 13.9% of articles being available as gold OA and 24% as green OA. Shares for 36 individual countries are only given for one period (2009-2017) as a single snapshot, which makes it hard to perceive recent developments. Most countries have similar shares of gold OA publishing in this timespan and the largest differences are based on variation in the level of green OA. The top end is populated by Switzerland (52.2%), the United Kingdom (50.9%), and Denmark (47.7%), while the lower end contains Russia (21.3%), China (22.9%), and India (30.1%). Finland was measured to have an OA share of 41.6%, with 11.2% as gold OA and 31.4% as green OA.

CRIS-Based Studies
In a Swedish-language report by the National Library of Sweden, Kronman (2017) analyzed CRIS publication data for 2010-2016 concerning peer-reviewed articles (including articles in conference proceedings) of 42 research-performing organizations in Sweden. The OA status of 278,195 articles was assessed by augmenting the OA publication metadata found in SwePub with matching to oaDOI. Of the articles for 2010-2016, 39% could be matched through oaDOI, with 14% being in full OA journals, 22% green open access (uniquely available), and 3% hybrid OA. This study demonstrates that combining CRIS data with external sources for article-level OA identification is possible. The main drawback of utilizing oaDOI for this purpose is being limited to articles that have been assigned a DOI.
The most comprehensive analysis so far concerning OA in the Finnish publication landscape is a study by Ilva (2017b) which is available in Finnish. The author provides an overview of summarized data for the publication year 2016 of all publication types based on CRIS data reported from all universities, universities of applied sciences and research centers in Finland. The data includes OA status information submitted by the organizations themselves, with each publication being published in an OA journal or as hybrid OA, and/or self-archived in a repository. An embargo is allowed for the alternative of self-archiving but not for the alternative of full OA journal which omits the alternative of OA through delayed OA journals (Ilva, 2017b). The study provides a breakdown of OA availability of peer-reviewed articles published by universityaffiliated authors, with 18.7% of articles being in full OA journals, 3.4% on the publisher's website but not in a full OA journal (e.g., hybrid OA), and 18.5% self-archived to a repository. This study demonstrates the viability and challenges of basing national OA measurement on CRIS data alone, avoiding many of the limitations in scope concerning which publications are included, but with some added ambiguity in OA identification, as data is self-reported in a decentralized way and can contain inconsistencies. Mikki (2017) studied the openness of 70,882 journal articles published by Norwegian authors for 2011-2015 by analyzing data reported from the CRIS systems of Norwegian institutions and querying Google Scholar with either the DOI or name of the article to determine openness status. The study did not discern between OA mechanisms and found that 67.6% of all articles were openly available in some full-text form through Google Scholar. The web domains providing the most articles for download were researchgate.net and academia.edu, suggesting that a notable part of the measured OA is likely not in line with publisher policies. The study also included analysis of OA shares across the 15 largest publishers, research disciplines, and the four largest universities in Norway. While the author found the organizational variance in OA shares to be fairly even, disciplinary differences were notable (high shares for natural sciences and technology, low shares for SSH). In terms of publisher proportions of OA, the variation among the 15 largest publishers ranged from over 70% of articles with only paywalled article access available (Routledge and Universitetsforlaget) to corresponding shares of 35% and 32% for Elsevier and Springer. A further study built upon a similar Norwegian CRIS-based publication data set for journal articles as in Mikki (2017) is by Mikki, Gjesdal, and Strømme (2018) that extends to study one additional year of journal publications (2011)(2012)(2013)(2014)(2015)(2016). In this study a comparison is made between the capabilities of Google Scholar, oaDOI, and 1findr to retrieve OA copies of articles documented in the data set describing 87,439 journal articles. Google Scholar was found to be the best at retrieving full-text OA copies of articles queried, doing so for 70% of all queries. The corresponding figures for 1findr and oaDOI were 52% and 31%.
In a recent report for The Association of Universities in the Netherlands ( VSNU), Bosman and Kramer (2019) evaluated the OA status of all publications from 2017 by Dutch universities assigned with a DOI in WoS, Scopus, or Dimensions. OA status was assessed by querying each DOI to the Unpaywall database in June/July 2018, finding that the OA share of article publications varied between 45% and 55% across publications included in the three databases. The report acknowledges that relying on DOI publications favors article publications over other publication types, which in turn also likely increases the OA share should all publication types be more comprehensively included. where identification of OA status is managed by comparing journal records to those indexed in the DOAJ. The results are presented per country per discipline, where the share of articles in DOAJindexed journals ranged from 5.7% (Social sciences, Flanders) to 17.3% (Medical & health sciences, Norway). The study found that publishing in full OA journals was on the rise in all four countries. As disciplinary OA shares varied between countries, the authors suggest that uptake of OA should not be seen as exclusively steered by the availability of OA outlets within a discipline, but rather also on local and contextual factors.
Based on this review of earlier country-level studies it can be concluded that national information sources remain underexploited in analysis of OA, and previous studies have focused predominantly on journal publishing (with the exception of Ilva [2017b]). The focus on journal articles is explained partly by OA policies and research funder mandates that have, so far, mainly concerned only this publication type. The reliance on Scopus and WoS for defining which publication outputs are considered and included is also a common limitation among many studies. There seem to be very heterogeneous approaches to how OA mechanisms are defined across the studies, but a common issue is noncomprehensiveness. Publication types other than journal articles are often excluded in the initial stages of studies. Though they are increasingly common, not all journal articles have DOIs (see, e.g., Boudry and Chartron (2017). The use of CRIS data in the context of national OA measurement has so far been limited to Nordic countries, where CRIS use has long been part of practice and reporting routines, with the exception of Sivertsen et al. (2019) where CRIS data for Finland, Flanders, Norway, and Poland were explored through the lens of articles being published in DOAJ-included journals.

Other Relevant Studies
In a recent preprint, Huang, Neylon, et al. (2020) thoroughly evaluate the coverage discrepancies between WoS, Scopus, and Microsoft Academic. The study results show that each database differs a lot in coverage, which suggests that any bibliometric evaluation of organizational or country output aiming to be comprehensive should include publication data from several sources rather than relying on just one. An interesting finding was that Microsoft Academic contained most unique DOIs of the three databases, including, in particular, more book chapters and conference proceedings than the other two. Given the lack of standardized CRIS data across institutions and countries, it thus seems that Microsoft Academic is the best current solution regarding output comprehensiveness.
The findings and implications of Huang et al. (2020) share a lot of commonality with those of the report by Bosman and Kramer (2019), which focuses mainly on comparing nationally aggregated Dutch publication data with the indexing coverage of WoS, Scopus, and Dimensions. The coverage between the databases varied a lot between them, often showing strengths in indexing of specific publication types, and the study found that only 43% of publications with a DOI were identifiable in all three databases. Also comparing any of the international indexes to the nationally aggregated data, in particular, nonarticle output and even very substantial shares of Arts/Humanities journal articles were left out of the population if restricting inclusion criteria to only items with DOIs. The report also provides a brief inquiry into the comprehensiveness of LENS, BASE, NARCIS, and OpenAIRE, but further insight into the comprehensiveness of these databases is limited due to lack of reliable affiliation identification. The reports provide evidence for strong disciplinary differences in publication types, which together with the knowledge that international bibliometric indexes are limited and skewed in their comprehensiveness, should have implications for using bibliometric databases for assessing and potentially influencing publication behavior with policy interactions.
In a study looking into facilitating factors for consistent institutional use of CRIS systems and OA policies in three countries (Italy, the Netherlands, and Germany), Biesenbender, Petersohn, and Thiedig (2019) concluded that such practices are particularly facilitated if national evaluation or quality assessment policies are in place. As the next section will describe, this is very much the case in the context of Finland. The authors highlight that the role of CRIS data is often overlooked in the context of open science development, even though such data, in conjunction with selfarchiving in repositories, already play a major role, with a lot of future potential for growth. Crawford (2019) provides an extensive analysis of all journals included in the DOAJ, including annual publication volumes for each journal and detailed breakdowns of differences between research areas and regions of the world. The study includes thorough analysis of journals and articles that are published in journals that are free for authors, and the pricing levels for journals that charge APCs. Based on Crawford (2019), there were 11,465 active journals in 2018 across all major disciplines, most of which were free for authors. There are often accessible OA journals available for researchers to publish in, but there might be other incentives rather than openness guiding their publishing preferences.
Despite not including a country analysis, the most recent and robust measurement of OA availability of journal articles provided by Piwowar et al. (2018) warrants highlighting. The study is relevant by demonstrating how a wide breadth of various OA mechanisms can be classified and studied by using the Unpaywall API for articles with DOIs. As we pointed out in the introduction, the main limitation of using Unpaywall for OA measurement is the reliance on publications having DOIs. Important information in OA measurement studies is the breakdown of which mechanisms OA is being provided through, but arguably equally important is to look at the share and likely reasons why certain parts of the literature have not been made available. Laakso (2014) provides an analysis of the maximum potential for self-archiving journal articles among the 100 largest journal publishers indexed in Scopus. While the results of this study are already outdated, the methodological concept of calculating article-level realized and unused potential based on publisher self-archiving policies is something that the current study will carry forward.

The Context of Finland
As this study concentrates on the publication output of researchers at Finnish universities, it is beneficial to briefly describe the national science policy environment, and in particular how OA has become an important part of it over time. Like many European countries, Finland has been at the forefront of developing national strategies for advancing OA. In 2014-2017, the Ministry of Education and Culture funded a national project, the Open Science and Research Initiative, which set ambitious national targets for the share of open access research publications: 65% in 2017, 75% in 2018, and 100% in 2020 (Ilva, 2017b).
Finnish universities and universities of applied sciences receive a substantial part of their public performance-based funding on the basis of their publication activities, which is one of the reasons that CRIS data in Finland is so comprehensive compared to, for example, WoS or Scopus. Like in Norway and Denmark, a nationally constructed rating based on evaluation by panels of experts, referred to as the Publication Forum (in Finnish Julkaisufoorumi, or JUFO), is used in Finland to categorize publication channels (i.e., journals, book publishers) into four different levels, which determines the weight of individual peer-reviewed outputs for calculating public funding (Pölönen, 2018). Based on calculations from realized funding from 2016, a top-ranked article generated approximately A17,000 for institutions with an affiliated author or coauthor on such a publication, while the three lower levels were approximately A12,600, A4,200, and A420 respectively (Seuri & Vartiainen, 2018). Because of this, there is a strong motivation for the organizations to provide comprehensive data on their publications on time. Universities have reported the OA status of their publications since 2011, but the data fields used in the collection of publication data were changed from 2016 onwards to give a more comprehensive picture of both OA journal publishing and/or self-archiving through the data (Ilva, 2017b;VIRTA Wiki, 2018). Recently, the Finnish government approved a revised funding model for allocating core funding annually to universities in 2021-2024, which incorporates an extra 20% weight for the funding contributed by each publication if it is reported as being available OA (Ministry of Education and Culture, 2019a), accepting gold, green, and hybrid OA.
In Finland, university contracts with international journal publishers are mostly handled centrally by FinELib, a consortium of Finnish universities, research institutions, and public libraries. Finland has been among the pioneers in making the costs of all publisher agreements publicly available since 2016 (Etsin, 2018). FinELib is a signatory of the OA2020 initiative and has included OA elements as part of the negotiated contracts since at least 2015, aiming to include substantial OA publishing elements into all new agreements (FinELib, 2019). Given that the five largest international commercial publishers account for more than half of the global journal output indexed in WoS (Larivière, Haustein, & Mongeon, 2015), most attention at both the international and national levels is focused on negotiating with these publishers to enable OA options.
The Academy of Finland, the major national research funder, has been mandating OA for funded research projects since 2015, accepting both green OA and gold OA as viable paths to fulfilling the requirement (Academy of Finland, 2019). The Academy of Finland became a signatory of Plan S soon after the initial plan was revealed.
For national journals, there have not been strong financial incentives to convert to OA (e.g., major funding mechanisms requiring it). Nevertheless, the Federation of Finnish Learned Societies allocates state subsidies annually to journals and book series, one of the criteria being an open access plan. In a recent study of Nordic peer-reviewed OA journals, which included a subset of journals published in Finland, Björk (2019) calculated that 97 out of 334 (30%) journals were published as full OA in the autumn of 2018. A centralized publishing platform, Journal.fi, is available for any national journals that are OA with a maximum delay of 12 months from publication.
A consortium-based funding-model for journals' transition to OA is still being sought . Since 2018, the Federation of Finnish Learned Societies has organized national coordination for the open science agenda in Finland, which recently produced the National policy and executive plan 2020-2025 (Open Science Coordination in Finland, 2019). The agreed objective is that "no later than 2022, all new scientific articles and conference publications will be immediately openly accessible" with CC-license, and that "the research community creates a jointly funded publishing model that enables immediate open access to research articles published in Finland."

RESEARCH QUESTIONS, DATA AND METHODS
Our introduction and literature review show that national bibliographic databases provide potential but have remained an underexploited information source to study OA at the national level. Given that Finland has very comprehensive CRIS data that is aggregated nationally, with standardized OA status information being included since 2016, it is a unique opportunity to explore the most central questions concerning such CRIS data from an OA perspective. Our research questions concerning Finnish peer-reviewed outputs published in 2016-2017, and aggregated at the national level in the VIRTA publication information service, are the following: RQ1: What is the added value of institutional data for the study of OA at the national level? First, we establish the number of different types of publication channels and outputs. Second, we establish the share of outputs published in WoS and Scopus-indexed journals, and the share of outputs that do not have DOIs. Third, we estimate what difference the additional publication data from VIRTA makes with regard to OA levels, by comparing the OA share of journal articles included in WoS and Scopus with articles not included in these databases, and by comparing the OA share of journal articles that have or do not have DOIs.
RQ2: What is the share of OA outputs across all fields, publication types, languages, and OA mechanisms? First, we establish the overall OA share of peer-reviewed outputs in different fields, and how OA share differs between journal articles, conference articles, and book publications, as well as between English, Finnish, Swedish, and other publication languages? Second, we analyze what share of journals/series and book publishers are identified in VIRTA data as gold, hybrid, and green channels? (The definition of these categories is provided below.) Third, we investigate how large a share of journals/series and book publishers have all Finnish outputs OA, have only closed outputs, or have both OA and closed outputs, and how OA level differs between gold, hybrid, and green channels.
RQ3: What is the coverage of sources for gold and green OA journals? First, we establish the total number and share of gold OA journals that can be identified based on DOAJ, the Bielefeld list, and VIRTA data. Second, we investigate the OA share of outputs in gold OA journals based on DOAJ, the Bielefeld list, and VIRTA. Third, we establish the coverage of Sherpa/Romeo color codes and the OA share of outputs in journals with different types of self-archiving policies.
RQ4: How dominant are the largest international commercial publishers? First, to establish the publishers' market shares we investigate what share of journal articles, conference articles, and book publications, as well as of outputs in different languages, are published with the six largest commercial publishing companies (Elsevier, Springer Nature, Wiley-Blackwell, Taylor & Francis, Sage, and ACS). Second, we analyze what share of outputs by these and other publishers are OA in gold, hybrid, and green publication channels. Third, we investigate the role of Finnish journal and book publishers compared to the "big" publishers.
The data consist of unique peer-reviewed outputs published in 2016-2017 that the 14 Finnish universities have reported to the Ministry of Education and Culture and that are stored in the national VIRTA publication information service (Sı le et al., 2017(Sı le et al., , 2018Pölönen, 2018). Inclusion criteria for publications are provided by the Ministry of Education and Culture in the data collection guidelines. Universities can report all single-authored or coauthored outputs by the academic and administrative staff, including doctoral students in their service or having another contractual relationship with them (Ministry of Education and Culture, 2019b).
In VIRTA, copublications of Finnish universities appear as duplicates. However, duplicates are automatically identified on the basis of publication information and indicated in the data. In this study, we use deduplicated publication counts. For each publication, the reporting university has indicated the publication type, OECD field of science, peer review status, and open access. This study includes peer-reviewed articles in journals, books, and proceedings, as well as monographs and edited works from all fields of science.
The data for publication years 2016 and 2017 was downloaded in July 2018 from the website https://wiki.eduuni.fi/display/cscvirtajtp/ Vuositasoiset+Excel-tiedostot, where the data sets for each publication year used as the basis of PRFS are openly available in Excel format. CSC-IT Center for Science-exports these data sets from VIRTA after the data collection needed for the calculation of performance-based funding is complete and makes them available on the website. Each reported output needs to be associated with information concerning the publication being openly available immediately on the publisher's website in either a gold or hybrid OA publication channel. Publication channel is used as an umbrella term for serials with an ISSN as well as book publishers with ISBN roots: journals, proceedings series, book series, and imprints. Further, information regarding the output being openly available in an OA repository is also included for each publication record. Embargoed outputs are allowed as long as a stable URL to the resource is provided. Detailed information on embargo length or OA licenses, however, is not available in the data. Consequently, it is possible to establish if a peer-reviewed publication is openly available in a gold OA or hybrid channel, deposited in a repository, or both. Based on the VIRTA OA information we classify outputs into five exclusive categories: • VIRTA gold: outputs indicated as being immediately openly available in a gold OA channel where all outputs are OA • VIRTA hybrid: outputs indicated as being immediately openly available in a hybrid OA channel, including both OA and closed outputs • VIRTA gold and hybrid: outputs with authors from more than one Finnish university that indicated the same output differently as being immediately openly available in a gold OA or hybrid OA channel • VIRTA green: outputs indicated as being openly available in an OA repository and are not indicated as being openly available in a gold OA or hybrid OA channel • VIRTA closed: outputs not indicated as being openly available in a gold OA or hybrid OA channel, or in an OA repository These categories broadly correspond to the existing OA categories as defined, for example, by Piwowar et al. (2018), with the exception that VIRTA gold includes outputs in any channel where all outputs are immediately OA, not only outputs in DOAJ indexed journals. Thus, VIRTA gold also includes bronze OA, as well as diamond/platinum OA channels that do not charge authors article processing charges (APCs). VIRTA hybrid and green quite closely correspond to Piwowar et al.'s (2018) definitions, and VIRTA has a similar definition of closed (this includes outputs OA in Academic Social Networks and Sci-Hub). In addition to analyzing the OA share of outputs, we also use VIRTA OA information to assess the OA status of publication channels ( journals/series and book publishers).
Universities take responsibility for the OA status indicated for publications they report to the ministry. The identification of OA publications takes place at the universities and involves both researchers' self-reporting and validation by the data collection personnel from the university libraries. We know from the outset that there are some discrepancies in the identification of OA categories in the VIRTA data, as two Finnish universities may have reported the same output differently as being immediately OA in a gold or hybrid channel (the category VIRTA gold or hybrid). As Ilva (2017a) has noted earlier, the nature of the self-reported data can contain some inconsistencies that would warrant future study in detail; however, in this study we use the registered data as-is in order to obtain an unmodified baseline measurement.
In VIRTA, the publication channel-journal/series or book publisher-of each peer-reviewed output has been identified by matching the publication's bibliographic metadata to the Publication Forum authority list of publication channels. The authority list covers all journals/ series and book publishers actually used by researchers affiliated with the 14 Finnish universities. Journals/series include mostly journals, but also some book series with ISSNs, as well as some conferences without ISSNs. Book publishers mostly have a registered ISBN prefix. For journals/ series with ISSNs, the Publication Forum channel register contains the name of the publisher retrieved from the International ISSN Centre. We have complemented the ISSN Centre data with publisher information from the Scopus journal list. It is also indicated if the channel is included in DOAJ (DOAJ.org, 2019), the Bielefeld list of OA journals (Rimmert, Bruns, et al., 2017), and what the self-archiving policy is according to Sherpa/Romeo color codes (Sherpa.ac.uk, 2019).

The Added Value of Institutional Data
In 2016-2017, the 14 Finnish universities published 48,177 unique peer-reviewed outputs in 10,342 publication channels, of which 91.9% are journals/series and 8.1% are book publishers. Of the outputs, 83.5% are associated with journals/series, and 16.5% with the book publishers (Table 1). Of all outputs, 71.6% are journal articles, 13% proceedings articles and 15.3% are book publications. Practically all journal articles, 57.9% of proceedings articles, and 28.4% of book publications are associated with journals/series. 71.6% of the book publications are associated with book publishers.
Only 62% of the 48,177 peer-reviewed outputs are published in journals indexed in Scopus and 52% in WoS journals (Figure 1). We find that VIRTA brings added value in terms of coverage compared to WoS and Scopus in all fields, but the differences are most important in SSH fields. We also looked at DOI availability. Two-thirds (67%) of the peer-reviewed outputs have a DOI reported in VIRTA; however, DOIs are available more often for articles in journals (77%) and proceedings (60%) than for articles in books and monographs (22%). We also discovered that DOI availability is much more limited in the case of Finnish and Swedish language outputs (2.2%) than outputs in English (74.4%) and other languages (15.2%). In all, 69.6% of all OA outputs in VIRTA have a DOI. Note, however, that DOI is not a mandatory field in the data collection-as not all outputs have DOIs-so some outputs may have been reported to VIRTA without a DOI even if they might have one.
According to VIRTA data, the OA share among the 24,832 journal articles published in WoS indexed journals is 33%, while among 28,366 articles in Scopus indexed journals the OA share is 35%. Among 9,675 articles published in journals not indexed in WoS and 6,141 articles in journals not indexed in Scopus the OA share is 52%. This result suggests that studies based on WoS and Scopus data may underestimate the OA share of journal articles. Comparison of OA shares between the 26,705 journal articles with DOI (38%) to the share of 7,802 articles without a DOI (37%) suggests that the availability of DOIs does not seem to make a difference with regard to OA levels.

OA Levels Across Fields, Publication Types, Languages, and OA Mechanisms
Of all 48,177 peer-reviewed outputs published in 2016-2017, one-third are reported in VIRTA as being OA (33.6%) and two-thirds are reported as being closed (66.4%; Figure 2). Overall, the differences between fields are not great. Nevertheless, Natural sciences (39.2%) and Medicine (37.2%) have the largest, while Social sciences (31%), Humanities (29.8%), and especially Engineering (26%) have the smallest shares of OA outputs.
The differences between fields are at least partly explained by differences in OA levels between publication types: The share of OA outputs is larger among journal articles (38.2%) than among conference articles (28.6%) and book publications (16.5%). The differences between the two dominant publication languages of Finnish researchers also play a role: A larger share of  English (34.2%) than Finnish (28.3%) language publications are OA. The numbers of publications in Swedish (OA share 41.1%), which is the other national language in Finland, and in other languages (OA share 26.6%) are much smaller. Across all fields and publication types, gold OA is the most common OA type, followed by green OA, while hybrid OA is the least common type (Table 2). There are, however, some differences in relative share of different OA types between fields, publication types, and languages.  Of the 10,342 publication channels the Finnish researchers used in 2016-2017, 21.9% are identified in VIRTA as gold OA channels, 2.8% are identified as both OA and hybrid channels (indication that their OA status is ambiguous), 11% as hybrid OA channels, and 14.6% as neither gold nor hybrid but have self-archived OA outputs (Table 3). In the case of journals/series, there are relatively small differences in the use of different OA channel types between fields: Humanities has the largest share of gold OA channels and the smallest share of hybrid OA channels.
For book publishers there is no comprehensive source on OA-status, such as DOAJ for journals, but VIRTA data can shed some light on the OA categories of 842 book publishers used by the Finnish researchers (Table 3). According to the VIRTA data, 21.1% of these publishers are gold OA channels, 0.7% have been identified as both gold and hybrid OA channels, and 0.5% have been identified as hybrid channels (0.5%). Furthermore, 10.6% of the book publishers have outputs indicated as being self-archived in an OA repository. Our analysis (below) of OA levels among books publishers identified in VIRTA with different types of OA mechanisms suggests, however, that application of OA categories-gold, hybrid, and green-is very problematic in the case of book publishers.

OA levels of publication channels and OA categories
In the VIRTA data there is some evidence of OA of outputs for about half of the 10,342 publication channels that Finnish researchers have used in 2016-2017 (Figure 3). But there is considerable variation in the share of Finnish outputs that are reported as being openly available in different channels. In roughly one-fourth of the channels (24.7%), all Finnish outputs in VIRTA are indicated as being OA; however, in one-fourth (25.5%) of the channels, the OA of the published outputs from Finland is only partial (less than 100% but more than 0% of outputs are OA). Half (49.8%) of the publication channels do not have any publications reported in VIRTA as being OA via the gold, hybrid, or green routes. This pattern is observed, more or less, in all the main fields, although the share of channels with no reported open access is somewhat larger in SSH. This is likely due to OA being more restricted in the case of book publishers than journals/series.
There is also a considerable difference in the share of openly available outputs according to the OA status of the channel based on VIRTA, as well as according to publication channel type  Figure 4). The share of outputs indicated as being openly available in VIRTA is largest in the identified gold OA channels, followed by hybrid OA channels, and is smallest in green channels with only self-archived outputs. The same is observed in the case of both journal and book publishers, but the overall share of OA outputs is much smaller among book publishers.
In principle, all outputs published in gold OA channels should be immediately openly available (this is also the VIRTA definition of gold OA). The results, according to which there are gold OA channels with outputs that are not indicated in VIRTA as being openly available, suggest that some outputs have not been correctly identified as being OA, or that some of the channels have not been gold OA during the whole period of 2016-2017. The low share of OA outputs for book publishers identified as gold OA suggests that identifying OA categories is problematic for book publications (monographs and articles in books).

Coverage of Information Sources for Gold and Green Journals
DOAJ, Bielefeld, and VIRTA as sources of gold OA journals DOAJ-indexed journals cover 12.5% of all peer-reviewed outputs, and 35.6% of outputs that are OA according to VIRTA. However, DOAJ does not cover all OA journals. Of all 9,500 journals/ series used by Finnish researchers, 1,237 are gold OA journals indexed in DOAJ (Table 4). Furthermore, 372 journals/series are included in the Bielefeld list but are not indexed in DOAJ.  In addition, 752 journals/series can be identified as gold OA channels based on the VIRTA data (including gold/hybrid OA journals). Combining all three information sources it is possible to identify 2,553 potential gold OA journals, of which 48% are based on DOAJ, 15% are based on the Bielefeld list, and an additional 37% are based on VIRTA ( Figure 5). This finding suggests that neither the DOAJ nor the Bielefeld list cover all gold OA journals. It is important to note, however, that it has not been possible for us to manually verify the OA status of the additional 752 journals identified as gold OA channels in VIRTA. We do not know how many of them, if any, would fulfil all the DOAJ inclusion criteria.
Analysis of VIRTA OA data suggests that the inclusion of journal/series in DOAJ is the best indicator of gold OA journals and a good predictor of OA level, as 95.9% of outputs published in DOAJ-indexed journals are actually indicated in VIRTA as being openly available (Table 4). For the Bielefeld listed journals the OA share of outputs in VIRTA is also high (77.9%), but not as high as attested in the case of DOAJ journals. The OA share of outputs published in journals/series as gold OA channels based only on VIRTA is only 54%. The OA share of outputs is considerably lower for journals identified based on VIRTA as hybrid OA (35.6%) or green OA (27.3%).

Sherpa/Romeo color codes
Sherpa/Romeo codes indicating self-archiving policies cover 7,537 journals/series (79% of all journals/series) used by Finnish researchers (Table 5). Sherpa/Romeo includes almost all DOAJ journals (95%), and a considerable share of Bielefeld-listed journals (43%). The self-archiving policy as indicated by the color-codes does not, however, make a great difference with regard to the OA share of outputs published in journals, especially if we look at journals/series not included in DOAJ or the Bielefeld list ( Figure 6). This is because the share of OA outputs is much larger for the gold OA journals included also in DOAJ and the Bielefeld list, than for the other channels included in Sherpa/Romeo, in which OA is more dependent on self-archiving. This result is likely also valid with regard to the recently launched new version of Sherpa/Romeo, which was introduced after the analysis of this study.

Dominance of the Largest Commercial Publishers
Publication channels owned by Elsevier account for 19.4% of the 14 Finnish universities' journal outputs in all fields of science counted together (Table 6). Next come Springer Nature (13.2%), Wiley-Blackwell (9%), and Taylor & Francis (6.7%). Sage and the American Chemical Society (ACS), which are often also considered among the "big" commercial publishers, account for 2.7% and 1.9% respectively. Taken together, these publishers account for 53% of peer-reviewed journal output. In the case of peer-reviewed book publications and conference articles their dominance is weaker: 29.5% and 12% of all outputs respectively. If we take into account all publication types, the big publishers' joint share of Finnish output diminishes to less than half (44.1%). All journals/series 9,500 100 VIRTA data also suggest that the commercial publishers included in this study are most dominant in Medicine and Agriculture, and least dominant in the Social sciences and especially the Humanities. Thus, our study corroborates the findings of Larivière et al. (2015) concerning the Humanities being the field least dominated by the big publishers. In our analysis, however, Social  sciences is among the least, not the most, dominated fields (this holds true even if we limit our analysis to journal articles).
The dominance of big publishers is limited to English-language publications (Table 6), whereas in SSH research results are also communicated in languages other than English-in Finland notably in the national languages, Finnish and Swedish. VIRTA data shows the important role of Finnish journal and book publishers for scholarly communication at the national level: They account for almost 12% of Finnish universities' peer-reviewed publication output (Figure 7). They are practically the only publishers providing outlets for, and access to, research results in Finland's national languages. Their role is also particularly important in book publications (monographs, edited volumes, chapters).
The share of OA outputs is smaller for the big commercial publishers, with the exception of Springer, than the other publishers (Table 7). Outputs published with ACS and Taylor & Francis have the lowest OA levels. Among the other publishers and Springer, gold OA is the most common OA type, while hybrid and green OA are less important. In case of the other big publishers than Springer, green OA is the most common type. The Finnish publishers taken together are quite comparable to Springer Nature in terms of output size as well as OA share and type of output: They account for 11.6% and 12.9%, respectively, of the Finnish universities' peer-reviewed publication output, and 28% of their outputs are published via the gold OA route. Among the Finnish publishers, however, hybrid and green OA play a less important role. Overall, the OA share among Finnish publishers' peer-reviewed outputs (36.3%) is close to the average among all publishers (33.6%).

DISCUSSION
In this paper we show that it is possible to base a national estimate of the number and share of OA publications on data from institutional CRIS providing comprehensive coverage of the universities' peer-reviewed output, including all publication types and languages. In addition to the national OA estimate, it is important to investigate the institutional CRIS data also from the international perspective, because it potentially contributes to a large-scale data infrastructure needed for assessment and monitoring of OA across countries, for example at the European level. Our data source is the VIRTA Publication Information Service, which integrates at national level publication data from the different types of commercial and noncommercial CRIS solutions of the Finnish universities. Our data set consists of 48,177 unique peer-reviewed outputs (articles in journals, proceedings, and books, as well as edited volumes and monographs) authored at the 14 Finnish universities published in 2016-2017. Based on the VIRTA data we investigated the following aspects: (a) the added value of institutional data compared to WoS and Scopus in terms of publication output coverage (b) the OA share of outputs across different fields, publication types and languages (c) coverage and information value of international sources for gold and green OA journals (d) the dominance of the largest international commercial publishers What Is the Added Value of Institutional Data for OA Monitoring?
According to our analysis, Scopus journals cover only 62%, and WoS journals 52%, of Finnish universities' peer-reviewed outputs registered in VIRTA, ranging between 89% and 77% in Medicine to 21% and 14% in the Humanities, respectively. This demonstrates the main added value of institutional CRIS data: It is practically the only existing source that is able to provide close to complete criteria-based coverage of peer-reviewed publications of an institution across all fields, and-if integrated at the national level-of a country's higher education institutions. All the alternative information sources, notably international databases like WoS, Scopus, Google Scholar, Microsoft Academic, OpenAIRE, CrossRef, and Dimensions, have a more or less restricted or biased coverage of publications (Aksnes & Sivertsen, 2019;Martín-Martín et al., 2020;Visser et al., 2020). Notably, institutional data provides comprehensive coverage of peer-reviewed publications that are very difficult to cover in other sources: journal articles in regional and local journals, conference and book publications (chapters, edited volumes, and monographs), as well as outputs in languages other than English. Further research is needed to investigate the coverage of different information sources, including institutional data.
In addition, institutional CRIS data can provide criteria-based OA status information for all peerreviewed publications, including outputs not included in the international databases. In the case of VIRTA data, OA status is self-reported by researchers and validated by data collection personnel at universities for all peer-reviewed outputs. Even if self-reported OA status is susceptible to inaccuracy due to individual interpretation of complicated OA mechanisms and variation in institutional data collection and validation procedures, CRIS data can offer an alternative and complementary methodology for determining OA status of publications. This is important, as only 67% of the peerreviewed outputs in VIRTA had a reported DOI, which is a requirement for inclusion in Unpaywall and, hence, analyses based thereon. Our analysis shows that DOI availability is particularly limited in the case of book publications (22%) and those in Finland's national languages (3%). In all, DOIs cover 69.6% of all OA outputs as reported in VIRTA, and DOAJ-indexed journals only 35.6% of the OA outputs identified based on VIRTA, so these two methodologies, which are frequently used to identify OA outputs, would lead to a partial picture of OA in Finland. In our view, it is important that national scholarly publishers of journals and books operating in languages other than English also seek inclusion in the DOAJ and Sherpa/Romeo services, as well as making use of DOIs and submitting as rich metadata as possible to Crossref.
In addition to providing a more complete picture of OA at the national level, institutional CRIS data can also be used to study and understand representativeness and bias in the OA measurements based on less comprehensive international sources and different methodologies for OA assessment and monitoring. According to VIRTA data, 33% of the Finnish articles published in WoS journals and 35% in Scopus journals are OA. This result, based on researcher self-reports validated by the data-collection personnel at universities, is fairly close to OA levels established for Finland in some previous studies: 32% in 2016 based on WoS in Bosman and Kramer (2018) and 41.6% in 2017 based on Scopus in the European Open Science Monitor. The higher OA share in the European Open Science Monitor could be due to the fact that OA shares are rapidly increasing in Finland via the hybrid and green routes (Ilva, 2020), and not all outputs that are currently OA in repositories had been self-archived or openly available at the time our data was reported to VIRTA (see section 3). Nevertheless, our analysis shows that the OA share is much higher, 52%, among peer-reviewed articles published in journals not indexed in WoS and Scopus. This finding suggests that OA monitoring based on WoS and Scopus is not only based on a limited subset of publications, but may also underestimate the OA share of journal articles at country level.
What Is the Share of OA Outputs across all Fields, Publication Types, Languages, and OA Mechanisms?
Taking all peer-reviewed outputs published in 2016-2017 into account, the share of OA at the national level in Finland based on VIRTA data is 33.6%, ranging from 39.2% in the natural sciences to 25.5% in engineering. It is difficult to compare this result directly with international studies because OA levels can change quite rapidly at the country level due to national and institutional policies, incentive structures, and services for promoting OA. According to a recent analysis by Ilva (2020), which is also based on VIRTA data, the OA share of journal, conference, and book articles has more than doubled from 28% in 2016 to 65% in 2019. Another challenge is that OA definitions may differ between studies using different methodologies, and the selection of peer-reviewed publications may also differ according to the data source used.
Our analysis also shows that OA shares differ considerably between different publication types and to a lesser extent between publication languages. Overall, the share of OA is larger among journal articles (38.2%) than among conference articles (28.6%) and book publications (16.5%). Our analysis also shows that the two dominant publication languages of the Finnish universities are English and Finnish (covering 88.8% and 8.9% of all outputs), and that a larger share of English (34.2%) than Finnish (28.3%) language publications are OA. Thus, the somewhat lower OA share in Engineering is explained at least partly by the importance of conference articles, while book publications and Finnish language publications contribute to a lower OA share in the SSH fields.
VIRTA data also contain information about the OA mechanism based on the publication channel, as it is reported for each publication if it is openly available immediately in the publisher's website in either gold or hybrid OA channel, and if it has been self-archived in an OA repository. Overall, gold OA channel is across all fields the most dominant OA route accounting for 19.3% of peer-reviewed outputs. In addition, 5.4% of outputs are OA in a hybrid OA channel, and 8.9% are OA only via repositories. It may also contribute to the lower OA share in Engineering and SSH fields that the gold and hybrid OA channels appear to be less used, perhaps because gold OA journals are considered less prestigious or perhaps due to limited resources for APCs and additional OA fees.
One of the advantages of the institutional CRIS data is that it shows the complete picture of journal and book publishing profile as well as the role of OA journals and book publishers. During 2016-2017, researchers at the Finnish universities used 10,342 different publication channels as outlets for their research, including 9,500 journals/series and 842 book publishers. In 25% of the channels used, all Finnish outputs are reported as being OA in VIRTA, in 25% of the channels only part of the outputs are OA, and in 50% of the channels no OA outputs via gold, hybrid, or green routes were reported in VIRTA. The same pattern is observed, more or less, across all fields.
While countries strive to achieve national and international OA targets, the strategies often involve changing the publishing landscape by means of replacing currently used closed channels with gold, hybrid, and green OA channels, or by making those channels allow different OA routes. Different OA routes indeed lead to quite different OA levels, as the OA share of articles is 79.9% in journals identified in VIRTA as gold OA channels, while being only 33.3% in hybrid and 27.6% in green OA journals. The same pattern is visible also in the case of book publishers; however, the overall share of OA outputs is much smaller than among journals: only 31.6% in the case of book publishers identified in VIRTA as gold OA, and 15% and 9% respectively for hybrid and green book OA publishers.
As we pointed out above, the quality of self-reported OA status of outputs is subject to doubt with regard to correctness. This is because a large number of researchers reporting OA status may interpret and understand OA mechanisms differently, and validation of the reported OA information by data-collection personnel may work differently in different organizations and units (Azeroual & Schöpfel, 2019;van Leeuwen et al., 2016). Our analyses indeed highlight certain inconsistencies in the self-reporting and validation of the OA information to VIRTA. There is, first, uncertainty about the OA mechanisms of the publication channels, as some channels have been identified differently as supporting gold or hybrid OA routes. Second, all outputs published channels categorized as gold OA in VIRTA should be immediately openly available. Yet our analysis shows that there are channels identified as gold OA with outputs that are not indicated in VIRTA as being openly available. Further research is needed to investigate the accuracy of self-reported OA status of outputs, and to compare results for example with Unpaywall.
There are several possible explanations for the observed discrepancies in the OA status of publications and channels: Some outputs have simply been incorrectly identified as OA, some hybrid channels mistaken for gold OA channels, and some channels may not have not been gold OA during the entire period of 2016-2017. The low share of OA outputs for book publishers identified as gold OA suggests that identifying OA categories is particularly problematic for book publications (monographs and articles in books). One important aspect to consider is also that while OA policies have mostly focused on journals, OA mechanisms and definitions for book publications remain underdeveloped and there are no comprehensive international information sources that researchers and data-collection personnel could use to identify gold, hybrid, and green OA book publishers. Our findings highlight the need for an international register of academic/scholarly book publishers that would contain information-like DOAJ and Sherpa/ Romeo-on their peer-review practices, as well as open access status and self-archiving policies (Giménez-Toledo, 2020).
What Is the Coverage of Information Sources for Gold and Green OA Journals?
The DOAJ and Sherpa/Romeo are information sources frequently used by researchers, libraries, and policy-makers to identify gold and green OA journals. Indeed, research funders behind the Plan S initiative also rely on DOAJ and Sherpa/Romeo as international services to identify highquality gold and green OA channels. The Bielefeld list is also increasingly used in libraries as an information source for gold OA journals. According to our analysis, however, neither DOAJ nor the Bielefeld list provides a complete picture of gold OA publishing (see also Björk, 2019;Bruns et al., 2019). Among the journals used by the Finnish researchers we identified 2,553 potential gold OA journals, of which DOAJ covers 48%, the Bielefeld list an additional 15%, and 37% are identified based on VIRTA data. Most gold OA journals identified based on the Bielefeld list and especially VIRTA may not, however, fulfil all the DOAJ inclusion criteria that are set to qualify journals following the best international standards and practices of gold OA publishing.
Our analysis of the VIRTA OA data suggests that inclusion of journals in DOAJ is the best indicator of gold OA journals and a good predictor of high OA level, as 95.9% of outputs in DOAJ-indexed journals are OA also in VIRTA. For the Bielefeld-listed journals the OA share of outputs is also high, 77.9%, while for the journals identified as gold OA based only on VIRTA it is only 54%. This finding indeed suggests that identification of gold OA journals based on selfreported OA information in VIRTA is less reliable than DOAJ and the Bielefeld list (probably some hybrid journals are mistakenly identified as gold OA journals), or that researchers and datacollection personnel rely mostly on DOAJ for identification of gold OA status of journals.
The majority of journals/series used by the Finnish researchers (79%) have a self-archiving policy registered in Sherpa/Romeo. Analysis of outputs published in these journals shows that only a relatively small share is indicated as OA in VIRTA, irrespective of the self-archiving policy indicated by color-coding, unless the journal also provides OA via the gold route (DOAJ-indexed or Bielefeld-listed journals). Part of this observation can likely be due to color codes having become less useful for summarizing journal self-archiving policies, as a lot of additional restrictions have been introduced by publishers (Gadd & Troll Covey, 2016). Part is likely due to unused potential for permitted self-archiving (Björk et al., 2014;Laakso, 2014). Our results confirm that there is indeed considerable potential for advancing OA via the green route. It remains to be seen if OA incentives, such as the extra weight for open access publications in the Finnish universities' core funding model, might help to increase self-archiving activity. This development can be comprehensively monitored across all higher education institutions, fields, and publication types only by using the VIRTA data (see Ilva, 2020). The new Sherpa/Romeo service, in which the color codes have been replaced with information on the availability and conditions of the different OA routes, provides an information source for the identification of gold, hybrid, and green journals.

How Dominant Are the Largest International Commercial Publishers?
Most attention in the national and international OA policies is focused on the large international commercial publishers-Elsevier, Springer Nature, Wiley-Blackwell, Taylor & Francis, Sage, and ACS-that according to recent analyses cover more than half of the international journal publishing indexed in WoS (Larivière et al., 2015). Our analysis based on the VIRTA data shows that the "big" publishers also play a dominant role in Finland, accounting for 53% of peer-reviewed journal output and 44% of all outputs, including conferences and book publications (cf. Guns, 2018). Thus, our analysis of the VIRTA data suggests that WoS data, focused on an international subset of journal articles, somewhat exaggerates the role played by the big publishers. This is seen most clearly in the case of the social sciences, which according to Larivière et al. (2015) is among the fields most dominated by the big publishers. According to VIRTA data, based of course only on outputs from Finland, the social sciences are, together with humanities, the fields least dominated by the big publishers.
In this study we were able to contrast, because of the comprehensive coverage of peerreviewed outputs in VIRTA, the output of big publishers with that of the small-scale and notfor-profit journal and book publishers operating in Finland (Late, Korkeamäki, et al., 2020): Their combined output amounts to 11.6% of the Finnish universities' peer-reviewed publications. Thus, the Finnish publishers' share is comparable in size to some of the largest international commercial publishers, such as Elsevier (14.4%), Springer Nature (12.9%), Wiley-Blackwell (6.8%), and Taylor & Francis (6.6%). The national publishers are used in all fields, but their role is especially important in the SSH. In the humanities, the share of outputs published with Finnish publishers (35.5%) is even larger than that of the "big" publishers (18.4%). They play a unique role in the scholarly communication of Finnish researchers by publishing peer-reviewed research in the national languages, and they also play a major role in publishing scholarly books.
Transformative "read-and-publish" agreements with the largest international publishers can significantly advance OA at the national level, including in Finland. Yet, it is important to remember that these are only a partial solution. In all fields, and especially in the SSH, the advancement of OA also requires that gold, hybrid, and green OA publishing models are also adopted by a large variety of relatively small journal and book publishers operating in international and national contexts. In Finland, most OA journals do not charge authors APCs. A new Diamond Open Access study commissioned by cOAlition S is creating a global overview of this OA publishing model. Our analysis of the publisher shares strongly suggests that the Finnish research community cannot meet the international and national OA targets if immediate open access to peer-reviewed content is not secured in a sustainable way for journals and books published in Finland .

Contribution of Institutional CRIS Data to International Publication Infrastructure
Finally, we discuss the potential for using institutional data in cross-country comparisons, notably in monitoring publication activities and open science at the European level. As the European Commission has noted, the current conditions for constructing the European Open Science Monitor, based on the data provided by Elsevier (Tennant, 2018), are nonoptimal: "Overall, the Commission wishes to have an as comprehensive Monitor as possible. … as long as there is in the European Union no fully open and transparent data-infrastructure, we are dependent on a fragmented data infrastructure and data sources from private operators" (cited in Waltman, 2019, p. 5).
Our study confirms that Elsevier's Scopus provides only limited and biased coverage of the publications of Finnish universities. OpenAIRE and Crossref are in our view important building blocks of a comprehensive large-scale European infrastructure for publication information that is independent of private operators. In this study, we did not directly compare coverage of the VIRTA data with OpenAIRE or Crossref, but our findings strongly suggest that their coverage of peer-reviewed outputs across fields, publication types, and languages is far from comprehensive. This is because OpenAIRE mainly depends on the availability of documents in OA repositories, and Crossref on publishers using DOIs. The added value of integrated CRIS data is that it can provide well-structured and curated metadata, including OA information, of all peer-reviewed publications even if they are not included in WoS and Scopus, are not in digital format, do not have DOIs, and are not openly available on the internet.
Institutional publication data, which in many countries is already integrated at the national level in services such as VIRTA, is the only source of publication information to complement OpenAIRE and Crossref with a comprehensive picture of European research and open access development across all fields, publication types and languages. According to recent surveys, over 20 European countries already have national publication databases that go beyond WoS and Scopus (Sı le et al., 2017(Sı le et al., , 2018, and hundreds of universities and research organizations have institutional CRIS systems, from which publication information could be integrated to an international infrastructure. Ideally, an international infrastructure should also offer countries and institutions without CRIS a service for inputting their publication information (Puuska, Nikkanen, et al., 2020). In addition to being comprehensive, institutional data is independent of private operators, and governments, institutions, and researchers across Europe already invest much time, effort, and resources in producing it.
The need for a comprehensive European infrastructure for publication data has been called for during the past decade in several policy documents (European Commission, 2010;Lauer, 2016; see discussion in Sivertsen, 2019). In 2014, a report to the European Parliamentary Research Service recommended "the development of a European integrated research information system … having features of a distributed infrastructure, inter-connecting the existing national research information systems" (Mahieu et al., 2014). A proof of concept of a European publication infrastructure integrating data from six institutions across four different countries has already been carried out in the framework of EU COST-Action ENRESSH (www.enressh.eu). Nevertheless, there is still a lot of work to be done to improve the standardization and interoperability of CRIS data to build large-scale international solutions that can compete with commercial bibliometric databases (Puuska et al., 2018). It is an additional challenge to produce comprehensive and comparable OA information on all types of outputs. Self-reports by researchers and validation by data-collection personnel, such as used in VIRTA, offer one possible solution to complement other information sources, such as DOAJ, Sherpa/Romeo, Unpaywall, and OpenAIRE.
Yet there are important policy reasons for European stakeholders to further invest in the development of comprehensive publication data. "Open Science" in the title of the European Monitor entails a broad understanding of research impact. The main international responsible metrics statements endorsed by the European Open Science agenda-DORA (https://sfdora.org/), The Leiden Manifesto (Hicks, Wouters, et al., 2015), Metric Tide (Wilsdon, Allen, et al., 2015)-call for diversity of outputs to be taken into account in research evaluation (European Commission, 2018). EU policies for the Responsible Research and Innovation (RRI) promote broad access to research, interaction between science and society, and public understanding of science (Gerber, Forsberg, et al., 2020;Novitzky, Bernstein, et al., 2020). This requires that many different output types and languages are used in the dissemination of research results to all sectors of society (Sivertsen, 2018b). As the European University Association (EUA) states in support of the Helsinki Initiative on Multilingualism in Scholarly Communication (www.helsinki-initiative. org), "Multilingualism is particularly relevant for Europe, as its research is characterized by geographic, cultural and linguistic diversity and the common principle of excellence" (EUA, 2019; Kulczycki et al., 2020). We argue that the large-scale data infrastructure for monitoring Open Science at the European level should reflect its geographic, cultural, and linguistic diversity. Only institutional publication data, integrated at the national and international levels, can provide the needed comprehensiveness.

CONCLUSIONS
In this paper we show that institutional publication data provides an invaluable information source in terms of output coverage for assessing the number and share of OA publications at the national level-in our case, Finland. We also argue that institutional data should be used to complement other information sources-such as OpenAIRE and Crossref-in OA monitoring across countries (e.g., at the European level). This is important for two reasons: First, institutional publication data, integrated at the national and international levels, are the only source that can provide a comprehensive picture of European research and OA development across all fields, publication types, and languages. In addition, such data can also be used to analyze and test the representativeness of OA assessments based on less comprehensive international sources, such as WoS, Scopus, Google Scholar, Microsoft Academic, Dimensions, CrossRef, and OpenAIRE.
Compared to earlier studies contributing towards national-level OA measurement the methodology of this study is unique, avoiding the limitations of using only WoS or Scopus-indexed journal publications like van  and Martín-Martín et al. (2018), and at the same time including investigations of article-level OA mechanisms through either self-reported or matching to external OA information sources that have been missing from earlier CRIS-data based studies (e.g., Mikki, 2017;Mikki et al., 2018). Beyond this there is still unused potential for future research and practice to improve the flexibility, fidelity, and reliability of the self-reported OA data as well as exploring the use of additional external data sources for OA detection, such as Unpaywall. With the publication data environment still being fragmented and under constant development, the best OA data can likely be produced by matching top-down and bottom-up approaches to identification.
We conclude that national publication data provide valuable and unique information on OA of peer-reviewed outputs. To enhance comprehensive and comparable monitoring of OA we recommend the development of well-structured and comprehensive national and international publication information sources, something which should be seen as integral to working towards open science in both policy and practice (Biesenbender et al., 2019).