Research Data Management Implementation at Peking University Library: Foster and Promote Open Science and Open Data

Research Data Management (RDM) has become increasingly important for more and more academic institutions. Using the Peking University Open Research Data Repository (PKU-ORDR) project as an example, this paper will review a library-based university-wide open research data repository project and related RDM services implementation process including project kickoff, needs assessment, partnerships establishment, software investigation and selection, software customization, as well as data curation services and training. Through the review, some issues revealed during the stages of the implementation process are also discussed and addressed in the paper such as awareness of research data, demands from data providers and users, data policies and requirements from home institution, requirements from funding agencies and publishers, the collaboration between administrative units and libraries, and concerns from data providers and users. The significance of the study is that the paper shows an example of creating an Open Data repository and RDM services for other Chinese academic libraries planning to implement their RDM services for their home institutions. The authors of the paper have also observed since the PKU-ORDR and RDM services implemented in 2015, the Peking University Library (PKUL) has helped numerous researchers to support the entire research life cycle and enhanced Open Science (OS) practices on campus, as well as impacted the national OS movement in China through various national events and activities hosted by the PKUL.


Research Data Management Implementation at Peking University Library: Foster and Promote Open Science and Open Data
users, data policies and requirements from home institution, funding agencies, and publishers, the collaboration between administrative units and libraries, and concerns from data providers and users.
The significance of the study is that PKU-ORDR shows a successful example of creating an OD repository and RDM services for other Chinese academic libraries planning to implement their RDM services for their home institutions in the future. The authors of the paper have also observed since the PKU-ORDR and RDM services implemented in 2015, the PKUL has helped numerous researchers to support the entire research life cycle and enhanced OS practices on campus, as well as impacted the national OS movement in China through hosting various national events and activities.

RELATED WORKS
The related works will focus on some aspects that this paper will address such as RDM and academic libraries & librarians, RDM and open research data repositories and systems, collaborations between research units and libraries, service support and promotions, repository implementation, data curation, and research support staff's or librarians' skills training.
Tenopir et al. [6] pointed out that science becomes more collaborative, data-intensive, and computational, and academic researchers face a series of data management needs. Meanwhile, Moon's study [14] shows that research funding agencies require researchers to provide DMPs when they apply for a grant and publishers also require researchers to provide data when publishing research results. Curdt's study [15] indicated that science conducted in cross-institutional, interdisciplinary, and long-term research projects requires active sharing of data, documents, and further information. Thus, RDM services should be established to support all researchers during their entire individual research studies.
Tenopir et al. [6] also claimed that academic libraries may be ideal centers for RDM service activities on campuses. Cox et al. [10] reported an international study of RDM activities, services, and capabilities in higher education libraries. Their study found that libraries have provided leadership in RDM, particularly in advocacy and policy development. However, services provided by libraries are still limited, focused especially on advisory and consultancy services. Tripathi et al. [16] studied the RDM services implemented by different university libraries in India for managing, organizing, curating, and preserving research data generated at their universities' departments and laboratories for data reuse and sharing and suggested a model for the university libraries to follow for actually deploying RDM services.
Johnston et al. [17] compared six institutions' RDM support levels within the Data Curation Network project and developed a shared staffing model for data curation across multiple institutions to support their researchers to meet their data-sharing goals through library-based data repository and curation services. Lee et al. [18] interviewed some American university institutional repositories (IRs) staff and then provided a rich, qualitative description of research data curation and use practices in IR. In particular, Lee et al.

Research Data Management Implementation at Peking University Library: Foster and Promote Open Science and Open Data
identified data curation and use activities in IRs, as well as IRs structures, roles played, skills needed, contradictions and problems exposed, solutions sought, and workarounds applied.
Curdt and Hoffmeister [19] shared their design and implementation of RDM services for a multidisciplinary and collaborative research project. McKinney et al. [20] described that Harvard University established a diffraction data publication system, the Structural Biology Data Grid (SBDG  ), to preserve primary experimental data sets supporting scientific publications. All data sets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema  . They also shared their practices that the SBDG collaborated with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse  open-source data repository system to structural biology data sets.
Mannheimer et al. [21] described how data repositories and academic libraries can partner with researchers to deal with challenges associated with qualitative data sharing and suggested that data repositories and academic libraries could help researchers address some of the challenges associated with ethical and lawful qualitative data sharing. Dovidonytė's study [22] described the Lithuanian landscape of OS policies and institutional involvement in OS practices. The author also discussed prerequisites for sustainable and consistent OS implementation such as OS infrastructure, incentives for researchers, research assessment, and repositories' compliance with the European Council requirements on a national level.
Pontika [23] made an analysis and found that academic libraries have created some new academic librarians' positions to support OS, OD, scholarly communication, and RDM on their campuses. However, researchers are still unfamiliar with RDM best practices, and research support staff including librarians is faced with the difficulty of providing support to researchers across different disciplines and career stages [24]. Alonso-Arévalo [25] agreed that the management of research data is one of the major challenges facing scientific and research libraries in the coming years. Already half of the American universities have a work plan on this issue, and all trend reports agree that RDM will be one of the priorities and future issues to be taken up by research libraries. Söderholm et al. [26] found that the network-based collaboration model that fosters individuals' interconnectedness is crucial for surviving with the built-in dynamism of RDM. Tang and Hu emphasized in their study [27] that for growing RDM services, institutional commitment to resources and training opportunities is crucial. As an emergent profession, data librarians need to be nurtured, mentored, and further trained.
All of these studies have provided some theoretical, useful, and practical insights and examples for us and also showed us some challenges and issues in the RDM implementation process faced by researchers, academic libraries, and librarians.  data.sbgrid.org  schema.datacite.org  dataverse.org

Kick-Off and Needs Assessment
As the kicking off of the PKU-ORDR project, the PKUL conducted a campus-wide survey to get a better understanding of RDM needs and requirements from researchers and research teams in 2013. The purpose of the survey was to identify the real needs of researchers and collect data from them so that the PKUL can create a strategic roadmap and steps to create a framework or a platform to meet the needs of RDM. The analysis results were summarized and published in the Journal of Library and Information Service [28]. The survey focused on the following aspects: awareness and current practices of RDM including data preservation, sharing, and reuse; description and features of research data; the current state of RDM; and expectations of RDM services. The survey results showed that 87.5% of respondents were willing to share research data under certain conditions. The biggest motivation that they were willing to share was because the participants recognized the value of sharing data, the positive relation between data use and citations, data visibilities, and credits awarded to data providers. However, the biggest concern for researchers was the issue of plagiarism.
The PKUL also interviewed 23 research teams from multiple disciplines on the campus. The face to face communication with the teams helped discover more valuable information about the current state of RDM including long term preservation, data sharing, and data reuse. Zhu et al. [28] summarized three major findings from the interviews: (1) Research data sharing behavior is significantly influenced by disciplines. For example, biology is a data-driven and data-intensive discipline in which open access has already been a common best practice, data sharing standards and norms were already well established and put in place.
(2) An embargo period with data sharing is generally expected and required. Almost all researchers being interviewed emphasized that their data should be shared after their results are formally published, which addressed the concern of possible plagiarism. (3) Data sharing behavior is more spontaneous and passive than active and lacks proper incentives and necessary maintenance, as well as a well-established mechanism for data citation, recognition and credits, and feedback from data users.
The interview also revealed that the data management and needs of researchers in different disciplines vary greatly. Bioinformatics researchers need very large data storage so that the large amounts of process data generated from their experiments can be preserved. Researchers from Computer Science are willing to share their data; however, Computer Science data are often considered very large data and make data sharing cost more expensive. For example, the volume of the Chinese Web data set collected by the Institute of Network Computing and Information Systems in the past ten years is above 100TB. Researchers from Business hope they can obtain more valuable enterprise and government data that can be used in their classes and research. Researchers from the Institute of Social Science Survey (ISSS) hope to maximize the value of their survey data as much as possible through data sharing; however, the ISSS established a relatively strict user application procedure for users to access data. Faced with so many different data management needs of researchers, as an initial attempt, the PKUL analyzed the data needs based on priorities and decided to build an initial service infrastructure to meet the needs with the highest priorities. Due to a variety of process data associated with different disciplines, the PKUL decided to focus on data

Research Data Management Implementation at Peking University Library: Foster and Promote Open Science and Open Data
closer to the final state and easier to share and collaborate with institutions inside and outside the campus to build the PKU-ORDR, making data easier for PKU researchers to access, reuse and share.

The Establishment of the Collaborative Model
As one of the most active advocates of RDM at the University, the PUKL made numerous efforts to convince various university research administrative units to invest in and provide support to create a librarybased RDM framework. The PKUL also sought some potential partners within the University since cooperation and collaborations with administrative units and other units on campus are vital to the success of the project and critical for the sustainability of the project.
The PKUL finally selected the Institute of Social Science Survey (ISSS) as a working partner to cooperate and collaborate on the development of the RDM services. The ISSS was created to act as a social science data survey coordinator and interdisciplinary empirical research platform that enables Peking University as well as other research institutions around the world to study China's social problems and conduct social science research, mainly through undertaking large-scale social survey projects and sharing the survey data openly. So the ISSS was an ideal collaborative candidate for the Library. The ISSS also plays a leading role on campus to provide workshops and training classes in data access, curation, and methods of analysis for the social science research community.
In 2014, the Peking University was awarded a grant by the National Natural Science Foundation of China for the China Survey Data Archive (CSDA) project, which aimed to develop a data repository administrated by the University Management Science Data Center (MSDC), a department within the ISSS. This grant provided an opportunity for the PKUL to build a more collaborative relationship with the ISSS. With the assistance of the research administrative units such as the Office of Science Research and the Office of Social Science Research, the PKUL and the ISSS decided to work together on this project. Initially, the responsibilities were split as follows: The MSDC supervised by the ISSS was responsible for research data collection and cleaning-up, standardization and analysis, data repository platform testing, and feedback. The PKUL was responsible for requirements analysis, functional design, software selection, as well as the development and maintenance of data repository, data storage, classification and metadata, systems administration, and associated technical and technological services.
However, the ISSS and the PKUL soon discovered through the analysis of the data collected from the survey and interviews that it could be an opportunity to build a strong showcase for OD for the nation because there were only a very limited number of subject-specific and/or research team-oriented data storages and data services available either at the institutional level or at the national level. The initiative was named as PKU-ORDR (PKU Open Research Data Repository) project as a sub-project of the CSDA project, with its goal to develop an infrastructure to support PKU researchers to manage their data more effectively and efficiently and provide RDM services ranging from storages to consultations.
The strategic objectives of the PKU-ORDR are summarized as below:

Research Data Management Implementation at Peking University Library: Foster and Promote Open Science and Open Data
· To publish high-quality research data and disseminate academic outputs through an open platform; · To promote OS, facilitate data sharing and reuse, and encourage to reproduce research; · To enable and track data citations and usage metrics; · To explore data publishing and long-term preservation solutions; · To foster innovation and cross-disciplinary integration.
In addition to the ISSS, the PKUL also cooperated with other internal units and external organizations to enrich the data content of the PKU-ORDR. Through collaboration with the Center for Bioinformatics of Peking University, the PKUL created linked data in the PKU-ORDR linking to the Bioinformatics database. Through cooperation with the Beijing Information Resources Management Center, the PKU-ORDR interoperated with the Beijing Government Data Resource System (BGDRS) so that the registered users in the PKU-ORDR can download data from the BGDRS directly. Through cooperation with the National Information Center, the PKU-ORDR collected some valuable enterprise data sets across the country. All these collaborations have greatly enriched the data content and expanded the disciplines' scope of the PKU-ORDR.

The Establishment of the Open Research Data Repository
The first step of the PKU-ORDR was to create an open research data repository to meet the needs of data storage and data sharing. The establishment of the data repository includes selecting software as a framework and customizing the software.

Software Selection
There were some types of RDM software available at that time, including various institutional repositories The implementation team adopted a software metrics tool and created some criteria to evaluate and assess these software solutions. Some general criteria were considered such as business and industry expertise, market knowledge, program/project management capabilities, methodology, communications, and independence and objectivity. Besides, as shown in Table 1, four specific criteria were particularly considered: ① Metadata standard and interoperability; ② Permissions management and access control; ③ DOI identifier and version management; ④ Online analysis and visualization. It is noted that the Dataverse metadata schema consists of a compulsive citation metadata block and multiple optional discipline metadata blocks that can be easily customized. The default discipline metadata block is DDI for Social Sciences and the Dataverse also provides several other disciplines metadata blocks, such as Biomedical, Geospatial, Astronomy, and Astrophysics. Therefore, the Dataverse metadata schema is flexible enough and can adapt to any discipline theoretically. After systematic comparisons and assessment, the Dataverse solution was finally chosen as the development tool. The Dataverse was originally developed by Harvard's Institute for Quantitative Social Science (IQSS), along with many collaborators and contributors worldwide. As of August 7, 2020, it has had 59 installations in the world.

Software Customization
Although the Dataverse was chosen as the framework for the open research data repository, customization was a challenge. The development of the software and version release phases is shown in Figure 1  Here are the highlights of our local customization: (1) user management, (2) bilingual interface, (3) usage statistics, (4) data contests, (5) other functions such as DataCite DOI registration, data set related publications, and (6) custom home page.

Research Data Management Implementation at Peking University Library: Foster and Promote Open Science and Open Data
To enhance user management function, the PKUL implemented the PKU-IAAA single sign-on system in the platform to enable our users to quickly and securely authenticate their permissions and instant access to the OD repository, and the relevant patron information can also be carried into the data repository through the PKU-IAAA. Furthermore, the PKUL enabled group download function so users can download multiple files within one data set with one request while original Dataverse only allows users to download one file with one request. Also, the PKUL created two types of user account: regular user account and advanced user account. A regular user account can be upgraded to an advanced user account when a user submits his/her application and provides more required information to get more privileges to become an advanced user. A bilingual interface is essential for our users since our repository is open to anyone in the world. Original Dataverse provides only unilingual descriptions. Researchers always publish their research outputs in English to increase the visibility of their research. From this perspective, the English language is an ideal candidate for the user interface. However, the majority of our users come from China, and the Chinese language is their mother language and more comfortable for them to use. So the PKUL decided to make the Dataverse repository interface and metadata support both English and Chinese. The user interface can be switched between Chinese and English and search results can be displayed both in English and in Chinese. The notifications sent to users are also be customized in a bilingual format.
Regarding usage statistics, the original Dataverse system only tracks the number of downloads, which is far from satisfying the needs of the PKU data providers' statistical requirements. Therefore, the PKUL enabled log records of the user application, administrator verification, user browsing, and download, etc. ElasticSearch is used to index the logs so that the data provider can query and download real-time data. Meanwhile, Baidu Analytics was implemented in the data repository pages to analyze data such as user sources, devices used, keywords for search, and pages visited. Furthermore, the PKUL hosted two national data contests to promote open research data repositories. The contest module was added to the Dataverse repository to facilitate user enrollment and data use. Participants were allowed to form teams to enroll in the contest, submit their papers, and access research data directly by using their user accounts of the data repository. The contest module also provided functions such as the contest homepage and paper display gallery.
Additionally, the PKUL added many other functions to Dataverse. Dataverse 4.0 only provided Handle identifier registration, and the PKU-ORDR adopted DataCite DOI to register data. Our module was later

Research Data Management Implementation at Peking University Library: Foster and Promote
Open Science and Open Data adopted by Harvard University and other institutions that are using Dataverse. Since some data sets within the PKU-ORDR are of high quality, for example, the China Family Panel Studies data set has been cited by numerous research papers, the PKUL also used API to interoperate with the PKU-IR to retrieve papers from the PKU-IR and display those papers associated with the data sets on the PKU-ORDR platform. Also, Dataverse 4.0 did not support homepage customization but the PKUL developed a custom homepage and now such a homepage customization technology has been adopted by Harvard University. As shown in Figure 2, numerous efforts had been made between 2014 and 2019, and key milestones are highlighted in the diagram.

Usage
The PKU-ORDR enhanced the PKUL's infrastructure for data storage and sharing. Through collaborating with academic departments on campus, the PKU-ORDR has collected numerous high-quality data sets, examples include China Family Panel Studies, China Health and Retirement Longitudinal Study, and Beijing Area Study, Comprehensive Language Knowledge Base, and AutismKB, an Evidence-based Knowledge Base of Autism.
As of August of 2020, the PKU-ORDR has released 66 Dataverses, 305 data sets, and 2,036 data files. The total number of downloads has exceeded more than 620,000, The average number of daily visitors is about 500, and the average number of page views is 2,700. In recent years, there have been numerous visitors from more than 89 countries who visited the repository, and the top five countries are China, the United States, the United Kingdom, Japan, and South Korea. The number of registered users has reached 32,000.

Data Curation and Skills Training
To cultivate past research data for future consumption, the PKUL offered several data curation services. Collaborating with ISSS, the PKUL hosted an RDM Seminar in 2015. The PKUL invited two experts, one from the Inter-University Consortium for Political and Social Research (ICPSR), USA, and one from Data Archive, UK, respectively to deliver data management training. The trainees were teachers and students

Research Data Management Implementation at Peking University Library: Foster and Promote Open Science and Open Data
from PKU, as well as from other peer universities in China. The PKUL also sent librarians to participate in relevant training activities hosted by other universities to improve their RDM curation skills. To improve students' data search skills, the PKUL offered one-hour workshops to teach students to identify data sources and use scientific methods to acquire relevant research data, statistical data, and Internet data. The PKUL also provided a series of lectures to teach students how to use data analytics tools.

Service Promotion
To improve the visibility of data service provided by the PKUL, the PKUL promoted the use of the PKU-ORDR through various channels, including marketing the PKU-ORDR on the PKU homepage, social media's public account of student groups, annual conferences of ISSS, and various RDM related domestic and international conferences. To improve data accessibility, the PKU-ORDR provided metadata to the re3data.org which is an international data repository registration, DataCite Search, and Data Citation Index which are data discovering systems, and search engines such as Baidu and Google. Additionally, collaborating with other units on campus and the National Information Center, and the PKUL hosted two national contests entitled "National Data-Driven Research Contests for Colleges and Universities" successively in the year 2018 and 2019 to promote the PKU-ORDR use and RDM services and train students' data searching and acquiring skills.
The contest included six stages: training workshops and lectures, enrollment, paper submission, paper evaluation, and oral defense. During the first stage, the organizers provided training on contest rules, data analysis and mining, data management and sharing, data resource, and acquisition. During the enrollment stage, the contestants registered in groups, submitted their selected topics, and applied for the research data in the PKU-ORDR. During the research paper submission stage, the contestants conducted research using the data from the PKU-ORDR or collected original data on their own, wrote essays, and submitted their papers together with the data to the organizers. During the paper evaluation stage, the organizers first conducted formal assessment and plagiarism checks for the essays, and the qualified papers then were evaluated by the experts invited by the organizers. Each essay was reviewed and graded by two experts. The papers were ranked accordingly by grades. During the stage of the on-spot oral defense, several topranked teams delivered their on-spot statements and reports and then were evaluated by more than 10 experts to decide the final ranking. After the contest, the winning teams shared their research data and reported at the Jing Ling Big Data Summits, and excellent essays were published in Chinese core journals in a special topic issue.
The contests attracted numerous students from many other major research universities in the country to participate. The first contest recorded an enrollment of nearly 600 teams including about 2,000 contestants from more than 160 universities and colleges. They came from 28  Through the contests, students from different disciplines obtained experiences from the same OD platform. The competition also greatly promoted the data-driven research paradigms and the visibility of the PKU-ORDR. Between December 2017 and May 2018 during the two contests, online visitors to the PKU-ORDR increased 10 times than before, registered users increased 5 times than before, and data downloads increased 7 times than before, respectively. More and more external websites were linked to PKU-ORDR. The ranking and exposure of the data in the PKU-ORDR are greatly improved in search engines. At the same time, the original data submitted by the contestants greatly enriched the content of the data repository.

CONCLUSIONS
This paper reviewed the implementation process of the PKU-ORDR and the creation of the RDM services provided by the PKUL. Through the review, the authors of the paper found that needs assessment and collaboration is vital to the success of a library-based university-wide RDM project. Raising the researchers' awareness to OS and OD is critical. Software identification and selection is a complicated and timeconsuming process. The software must meet some essential criteria such as stability and sustainability. Communication is critical in the whole process, particularly with administrative units and other academic units on campus. Some data curation programs such as workshops, lectures, and contests can be developed to improve researchers', students', and librarians' data searching and acquiring skills and promote services on campus and to larger research communities. RDM policies must be created and put in place. In a word, this is a learning curve and a cumulative process in theories and practices. The authors of the paper will feel rewarded if this practical paper can offer some insights to those academic libraries planning to implement their OD repository and/or RDM services for their home institutions. Although the PKUL has made great efforts in the RDM construction and contributed to the OS and OD communities on campus and even in China, the PKUL feels that it still has a long way to go. There are so many challenges and opportunities ahead of libraries and librarians.