Abstract
For many government departments, uncertainty aversion is a source of barriers in the advancement of data openness. A more active response to potential risks is needed and necessitates an in-depth examination of risks related to open government data (OGD). With a cross-case study in which three cases from the United Kingdom, the United States and China are examined, this study identifies potential risks that might emerge at different stages of the lifecycle of OGD programs and constructs a taxonomy model for them. The taxonomy model distinguishes the “risks from OGD” from the “risks to OGD”, which can help government departments make better responses. Finally, risk response strategies are suggested based on the research results.
1. Introduction
Government information disclosure has played an important role in the democratic development of human society since Sweden passed the Freedom of the Press Act in 1766. In particular, since the United States issued the Freedom of Information Act (FOIA) in 1966, many countries have passed laws or regulations to protect the “right to know” of the public. With the advent of the era of big data, significant attention has been focused on the value of open government data (OGD) in promoting government transparency and accountability, public participation and social innovation [1, 2, 3, 4]. More than 70 countries have joined the “Open Government Partnership” Program till 2016 [5]. In China, 46 Chinese local governments had launched OGD websites as of May of 2018 [6]. In January 2019, the OPEN Government Data Act of the United States was signed into law [7].
Nevertheless, along with the worldwide advance of OGD, various barriers have been reported by OGD-leading countries such as the UK [8], the Netherlands [9], and the USA [10]. A number of studies have explored various barriers to OGD. We categorize them into the following six classes according to their perspectives on OGD: (1) the data user perspective, e.g. lack of knowledge of access to the OGD data sets [11, 12]; (2) the data provider perspective, e.g. institutional barriers [13], basic resources, organizational arrangement and technical capacity [14], fear of false conclusions [9], economic issues [15], etc; (3) the data perspective, e.g. fragmented data sets [16], poorly documented metadata [17], poor data quality [13, 18], poor information usability [11], poor machine-processability [19, 20], and complex data formats [21] etc.; (4) the legislation perspective [13], e.g. difficulties in evaluating the eligibility of a data set [11], difficulties in evaluating the privacy sensitivity of a data set [10, 22], the complexity of data copyright [23], etc; (5) the technology perspective [10, 13], e.g. the way in which the data are stored, obtained and used by a department [24]; and (6) the environment perspective, e.g. external pressures [25].
Some of these barriers exist in reality, while others simply derive from an insufficient understanding of the possible risks. It was reported that a database of registered usernames and email addresses of the data. gov.uk had been leaked [26]. In 2018 BBC reported that sensitive information of the military has been disclosed in a data visualization map of a fitness tracking company [27]. Besides, some people were also concerned that the geospatial data published on data.gov by the Department of Agriculture of the US might be used to locate crops targeted for eradication via infestation, or even to commit acts of biological warfare [28]. As shown in above cases, the fear of the potential risks may hinder the advance of OGD. The culture of risk aversion is one source of existing institutional barriers to OGD [13]. Uncertainty avoidance plays a negative moderating role in the relationship between other organization resources and the OGD capacity of government departments [14].
The best strategy for addressing risks is understanding and managing them more effectively, rather than ignoring or avoiding them. Therefore, we aim to address the following three questions: “What kinds of risks are involved when government departments implement OGD initiative?”, “How are these risks distributed over the lifecycle of OGD?” and “What strategies could be adopted?” To answer these questions, we conducted a cross-case study on three OGD- related cases from three countries, the cease of the care.data program in the UK, the IRS data breach in the USA and the tardy progress of the OGD program in China. We then identified 14 risks and categorized them into a taxonomy model which distinguishes “risks to OGD” from “risks from OGD”. This study deepens the current understanding of OGD-related risks and brings new insights into the mechanism design for advancing OGD.
2. Literature Review and Analytical Framework
2.1 Risks Associated with OGD
Risk refers to the uncertainty of future results in a given condition and a particular period [29]. The risks related to OGD that are frequently discussed are those pertaining to data leakage [30, 31, 32], invasion of privacy [33, 34, 35] and other information security issues. [36] summarized 11 government risks in data release: copyright, trade secret protection, privacy, the security of the infrastructure, publication of improper data or information that might lead to negative attitude toward public institutions, inaccurate data, misinterpretation of the data, absence of data consumers, less willing to cooperate, overlapping of data and increased number of requests for data. The Office for National Statistics of UK also proposed to better balance openness with privacy protection [37]. [38] believes that the risks posed to the official statistics department by big data are also related to mission drift, damage to reputation and the loss of public trust, inconsistent access and continuity, the fragmentation of approaches across jurisdictions, resource constraints and cut-backs, privatization and competition, etc. Besides, OGD are also vulnerable to risks in terms of effectiveness, relevance and trust [39]. [15] categorized the risks that may hinder OGD as those related to governance, economic issues, licenses and legal frameworks, data characteristics, metadata, access and skills. Based on a single case study of Shanghai, [40] summarized the potential risks of OGD from the levels of legislation, management and data.
2.2 Risk Management Related to OGD
In response to risks faced by government departments, the National Audit Office (NAO) of the UK proposed embedding risk management into their core decision-making and planning management processes. The NAO put forth the notion that risk management, including identifying, evaluating, processing and reporting, can help governments make credible decisions and support innovation [41]. In 2013, a subordinate of the US Treadway Commission released the Internal Control-Integrated Framework to address enterprise risk management in five aspects: controlled environment, risk assessment, control activities, information and communication, and regulatory activities [42].
[36] proposed mitigation strategies for the identified risks of OGD, including monitoring and assessment of the demand for data, proper specification of data sets to avoid duplication, compliance assessment, data anonymization and data aggregation, quality control of data publication, establishment of internal and external data catalog, linking to data sets already published, properly formulated terms and prompts for data originating from third parties, clearly explained duties, and continuous monitoring of the impacts of OGD initiatives. [43] holds an opinion that the application of tax-related data quality control and data technology is a key element in control of tax risks. In 2017, the British government updated the Data Protection Act promulgated in 1998 to reinforce the rights of citizens, such as the right to be forgotten, personal data, privacy information and data migration stipulating that identifying personal information from anonymous data and tampering with them will result in criminal charges [44]. In 2018, the General Data Protection Regulation of the EU came into force [45], focusing on protecting and empowering all EU citizens regarding data privacy and reshaping the way all organizations including governments approach data privacy.
The afore-mentioned regulations and studies have fully discussed risk management strategies associated with OGD, but they did not address the context in which a specific risk occurs, for example, the stage in the data lifecycle, its source and consequence, etc. Besides, a single country context may not be sufficient to reveal the variety of risks in more comprehensive institutional backgrounds. These defects may weaken the effectiveness of suggested strategies. In view of this, we have conducted a cross-case study in international context from the perspective of the lifecycle of OGD.
2.3 Lifecycle Model of OGD
Lifecycle model can guide the process of opening up data [46]. Lifecycle analysis draws on the biological analysis method, and divides the development process of objects (e.g. records, data, products, projects, organizations, etc.), from the stage of generation to that of extinction into several stages. Among others, OGD, is also subject to changes in its lifecycle. An analysis based on the lifecycle of OGD can help identify the risks that exist at different stages. Meanwhile, risk management itself can also be divided into three phases of latency, occurrence and crisis response [47], which necessitates different measures.
According to [2], the OGD lifecycle comprises three sections, a pre-processing section (data creation, selection, harmonization and publishing), an exploitation section (data interlinking, discovery, exploration, and exploitation), and a maintenance section (data curation). Based on an investigation conducted in The Netherlands, [46] developed a community-driven open data lifecycle model, which comprises identification of data, data preparation, data issue, and data reuse and data evaluation. [48] discussed the barriers of OGD in China based on a data-centered lifecycle model, including data organization and processing, storage and distribution, discovery and acquisition, and appreciation and evaluation.
With different research purposes, the above-mentioned OGD lifecycle models are either data management centered or value realization centered. As the risks of OGD are mainly taken by government departments, including both the data providers and users, a government centered lifecycle model is needed. A lifecycle model of OGD should consist of at least five stages: data creation and collection, data organization, data release, data utilization and data maintenance.
2.4 Analytical Framework
Based on the literature review above and the lifecycle of OGD, we develop an analytical framework for the subsequent case study, as shown in Figure 1.
3. Research Design
3.1 Research Procedure
The present study adopts a cross-case study method. Although previous studies have discussed some OGD-related risks, they are neither integrated into a whole nor detailed from the perspective of the OGD lifecycle. To bridge these theoretical gaps, this study adopts the case study method. Compared with a single case study, a cross-case study is used to make generalization, and examine themes, similarities and differences across cases in quantitative or qualitative analysis. The research design involving multiple cases is generally regarded as more robust than that of a single case study, as it provides the observation and analysis of a phenomenon in several settings [49].
In this study, an analysis across three OGD cases from different countries is conducted. A content analysis method is also used to examine the lifecycle distributions of OGD-related risks. The research steps are as follows:
Select three cases of OGD in the US, the UK and China based on their theoretical potentials and representativeness in different stages of the OGD lifecycle.
Collect case data through the Internet search and semi-structured interviews;
Identify the risks associated with OGD using within-case and cross-case analysis;
Categorize all identified risks into a taxonomy model;
Analyze the distributions of all risks over five stages of the OGD lifecycle;
Suggest countermeasures for OGD risk management.
3.2 Case Selection
Three cases are selected for their theoretical richness and representative ability to answer the research questions and address the different stages of OGD programs. They are the UK healthcare data program “care.data,” the American Internal Revenue Service (IRS) data breach event and the Open Data initiative of Shanghai, China. The “care.data” case is typical, in that it reveals the risks in data collection and sharing. The IRS data breach is not a typical OGD case, but it reveals the mismanagement and malicious use of government data; it is therefore representative in revealing the risks that are characteristic of the OGD maintenance phase. The case of the Shanghai OGD is of typical significance, as it reveals the risk concerns of government departments at the early stage of an OGD initiative. The three cases, which are both country- and industry-specific, focusing on different phases of the OGD lifecycle, can jointly support the identification of risks in the entire lifecycle.
3.3 Data Collection
Case data are collected via Internet search and semi-structured interviews. The former applies to cases from the UK and the USA, and covers news reports and online commentaries; the latter applies to the case of China and covers three one-to-one in-depth interviews and one focuses on group interview with the heads of seven government departments. All the interviews lasted for nine hours in total, and 55,600 Chinese words were transcribed within one week of the interviews.
3.4 Data Analysis
A bottom-up coding approach combined with a cross-case comparison is adopted to analyze the data. The data analysis is a continuous process, starting with data collection. First, all concepts or entities related to risks or real harms are identified; they are then compared and categorized within the case. Second, the results of three cases are listed, analyzed, compared and categorized. Third, the sources and consequences of all risks are analyzed. Fourth, a taxonomy model of OGD-related risks is constructed. Finally, the distribution of the risks over the five stages of the OGD lifecycle is analyzed and strategies to address them are suggested.
4. Case Analysis
4.1 Case 1: The Discontinuation of the care.data Program in UK
4.1.1 Case introduction
In 2013, the long-established National Health Service (NHS) of the UK and the Health and Social Care Information Centre (HSCIC) initiated a program called care.data, aiming to improve the safety and care of patients by using information; the program also helped create an extensive health records database, whose target users include pharmacies, mental health services, opticians, dentists, and education and training institutions, and that will eventually support all healthcare facilities. The data have been anonymized and only cover the patient's age range, gender and area of residence, but in exceptional circumstances, such as during a pandemic, a researcher can apply to the Minister of Health for the removal of these privacy protections. Researchers believe the data will help them develop new treatments and evaluate NHS services. However, the “care.data” project, surprisingly, does not run smoothly. In February 2014, the NHS acknowledged a serious crisis of confidence regarding the “care.data” project, and informed family doctors to postpone the uploading of patient data for up to six months. In the fall of 2014, the NHS decided to endeavor four new pilots to collect medical health data for two million patients, but the first pilot was not officially launched until June 2015. Due to poor communication with the public, such as the absence of press conferences, and the failed delivery of brochures to families, the NHS's data collection was collectively denounced by patients, doctors, the British Medical Association, the privacy campaign group Big Brother Watch and the Association of Medical Research Charities; consequently, one million people withdrew from the program. Under enormous pressure, the care.data was stopped by the NHS on July 6, 2016.
4.1.2 Risk identification of case 1
The NHS is an important source of population data. The care.data is a typical example of government data collection and utilization. Through the analysis of data collected from online news reports, blogs, comments, etc, the following risks are identified:
Privacy leakage risk. Health data are of a personal and sensitive nature. Mudie, an opponent of care.data, said: “The human cost to the patient whose identity and medical history are made public is potentially disastrous. Careers could be ended, jobs lost, insurance refused and relationships destroyed if sensitive medical facts are made public or used by private firms, other people or, indeed, the media” [50]. Although the patient data collected by the care.data had been anonymized, data users typically used open data in conjunction with the holder's closed data and accessible data, and the public were worried that malicious analyzers could use specific techniques to identify the patient's private information [51].
Risks from implicit operational specifications. A lack of explicit operational specifications has led to various difficulties in the opening and collection of data. The accusations from the public are: “The ambiguous standards for obtaining health data poses a risk of trust between doctors and patients [52].” “The regulations over data access, data inspection and balance are not yet established or implemented [53],” etc. The NHS did not systematically organize and process the collected data sets, leading to perplexed usage and potential safety hazards when cooperating with other agencies.
Risks of improper data use, especially in cooperation with commercial organizations. The focus of the public concern is that sharing sensitive medical information with commercial companies can be risky without the explicit consent of patients. The care.data hands over coded patient data to the insurance industry to help actuaries calculate the average premiums, but the public believe that it may identify an individual's tendency to become ill, causing the insured to be biased against when attempting to buy insurance. In the cooperation between NHS and Google's DeepMind, the public are also worried that commercial organizations will use patients' private data for profit. “Any effort to ignore the use of data and only discuss open data policy will fail as they are faced with the disorder in reality and compromise in practice [53].”
Data quality risks induced by data collection methods, such as incompleteness, distortion, inaccuracies, etc. For example, care.data requires general practitioners to collect patient data; however, as a result, data pertaining to young and healthy males may be missing. Some people obtain the prescription drugs but flush them down the toilet when unmonitored, causing distortions in the drug's performance data, etc.
Risks from immature techniques. One of the reasons for the public's opposition to care.data is the lack of technical sophistication. Sheila Bird, a professor of statistics of Strathclyde University, stated: “Data-sharing as proposed by care.data was disastrously incompetent – both ethically and technically. Professionals rebelled and prevailed in out-casting care.data, thereby ensuring that future proposals will not succeed unless both technically proficient and in the public interest” [54].
The risk of public trust crisis prompted by the immaturity of government regulation. Of all the objections against care.data, many could be regarded as a function of the public trust crisis caused by a flawed government supervision system. For example, “when you propose to share our most confidential medical records, ambiguous promises and fictitious regulatory frameworks are disturbing to the public” [50].
The risk of poor communication. Huge external pressures on care.data stemmed from lack of publicity and ineffective communication. While the project is valuable, it requires the understanding, support and cooperation from citizens and other organizations. In November 2014, the All Party Parliamentary Group in the British Parliament investigated the care.data project, accusing it of lacking transparency and poor publicity, which eventually led to its failure.
The risk of unsustainable funding for public communication. Between October 2013 and 2014, the NHS announced £ 2 million to publicize the care.data project to the general public, but the actual cost of advocacy was merely £ 1 million. The survey found that less than one-third of the public received publicity brochures because of insufficient investment in public communication. This eventually led to the project being forced to cease due to tremendous external pressure.
The risk of oversized external pressures. After the care.data program was announced in early 2013, some privacy organizations launched the Medical Data Confidentiality Initiative, calling attention to security risks in the use of medical data. Since then, these organizations have applied significant pressure to initiatives in the collection of medical data.
The risk of unprofessional information governance. In January 2015, the NHS's Independent Information Governance Oversight Panel released a report indicating that the initial commitment to the care.data project was not completed, and partly because of the lack of experts in information governance.
4.2 Case 2: The IRS Data Breach in the US
4.2.1 Case introduction
In 2015, a data breach took place in the Internal Revenue Service (IRS) of the United States [55]. With the help of companies such as IBM, the IRS set up a complex data platform networked with other government agencies. The site has an application called “Get Transcript” that allows citizens to easily access to previous tax records. In 2015, hackers illegally accessed about 724,000 taxpayers' tax returns via the Get Transcript app. This malicious behavior was not detected until three months later. After the investigation, J. Russell George, the Treasury Inspector General for the tax administration, accused the IRS of not deploying the Web systems according to the requirements and recommendations, which caused serious data breaches in the event of reduced staffing and increased capital input [56].
4.2.2 Risk identification of case 2
The risk of hacking. Hacking is a constant threat to government applications, and its immediate consequences are leakages of privacy, trade secrets or even the security information of national infrastructure.
The risk of poorly implemented government regulations. After the implementation of the My Data project, the US Open Data Action Plan established clear requirements regarding network and data security systems for government departments, but the IRS did not follow these requirements seriously, resulting in a colossal security breach.
The risk of poor communication and cooperation between departments. After the data breach event, the IRS, the Treasury Department and other departments were at odds with each other. The poor communication and cooperation between them resulted in an inadequate response to the risk.
The risk of delayed system updates and maintenance. The maintenance and update of an open system platform must be carried out periodically; otherwise, security risks are likely to occur. In this case, some IRS applications were outdated and had many security vulnerabilities; this was coupled with poor maintenance of the systems and platforms. Together, these factors were vulnerable to hacking, which remained undetected for three months.
The risk of unsustainability in capital investment. According to the IRS, an increasing number of government information systems security services are being outsourced; with the delay of at least $ 400 million in investments due to IT budget cuts, many of the operations, including the maintenance and replacement of old IT systems, are affected, thereby resulting in system failure and security loopholes.
4.3 Case 3: The Tardy Progress of the OGD Program in China
During the short history of OGD in China, no major crisis has taken place yet. Therefore it is impossible to have an event as a case study that exhibits the lifecycle. Instead, we only investigate the relevant risk factors by examining a representative city government. Between December 2016 and May 2017, we interviewed seven government departments in Shanghai, including the Human Resources and Social Security Bureau, the Agriculture Commission, the Food and Drug Administration, the Audit Bureau, the Health and Family Planning Commission, the Trade and Industry Bureau, and the Planning and Land Resources Administration. According to the analysis of interview data, the risk concerns of the departments related to OGD can be summarized into the following six aspects:
Risk of data distortion. Government statistical data, though eye-catching and frequently used by the public, are prone to distortion in the layer-by-layer hierarchical reporting system; this has become a systemic problem. One interviewee said: “It is not that our attitude to OGD is inactive, but that the authenticity of data reported by the subordinate departments cannot be guaranteed”. In January 2016, the Eighth Inspectorate of the Central Discipline Inspection Commission stated in their feedback to the Party Committee of the National Bureau of Statistics: some leading cadres sought personal gain by “statistics”, with rent-seeking through power [57].
Risk of low data quality. Given that some government departments' data are collected from enterprises or reported by subordinate departments and other diversified channels, some data are thus flawed with inconsistent format, unstandardized metadata, incompleteness and other quality problems, which forestalls some departments from engaging in OGD pilots. One respondent noted: “The data collected directly from medical institutions and health administration bureaus at county or district level were found countless quality problems. We've been doing data cleaning for two–three consecutive years, so a lot of data have not yet been released.”
Risk of data aggregation. Although open data have been anonymized and desensitized, some departments are still worried about the existence of security risks for collective data release. One interviewee said: “The bulk of our data are disease information or disease prevention information, if released, it implicates personal privacy.”
Risk of undefined operational norms. Some departments expressed that due to the lack of technical standards and codes of practice, they are quite at sea for some problems in open data. One interviewee suggested that “the scope of disclosure also needs to be stipulated. We are willing to participate in the construction and also want to provide useful data, but the specification of the application needs to be well defined.”
Risk of imperfect mechanisms. The lack of supervision, feedback and incentive mechanism led to the slow progress of the OGD project. On account of the scanty feedback on the use of open data, the enthusiasm of some departments for OGD has been dampened. One interviewee said: “We are in want of a mechanism for feedback and supervision, we would like to know where and to what extent the data we provide has realized their value. All these require feedback, which is also a positive incentive for us.”
Risk of data value-sparseness. Some departments want to know if their data are really useful to the public. “Every year we provide a lot of open data to the community (for social services), and a great deal of work has been done, but I am still a stranger to the data they (neighborhoods) are using and how they use them.” “We'd love to know who is using the data.”
5. Risk Integration and Classification
Through iterative comparison and merging, the risks identified from the three cases are integrated into 14 types of risks: content legitimacy, data quality, data value, data management, platform support, information security, organization support, resource input, institutional support, business process, use, imperfect regulations and standards, scarce external resources and external pressures. The 14 types of risks are then further categorized based on their sources and consequences, respectively.
5.1 Classification Based on the Risk Source
According to the source, the 14 types of risks are categorized into five classes: data (related) risks, technology (related) risks, management (related) risks, utilization (related) risks and external environment (related) risks.
5.1.1 Data risks
Most of the data published on OGD websites are submitted by various government departments, while a small part of them are created by the OGD department itself. These data vary in their attributes accordingly with the clear-cut department functions. Some attributes may lead to uncertain consequences if the data are open. For instance, some data concern citizens' privacy, and their inappropriate disclosure may bring harm to the individuals involved. These risks arising from data attributes are categorized as data risk, which is mainly related to the legitimacy of the open content, the data quality, and the data value.
The risk in the legitimacy of open content refers to the possibility that the open contents do not comply with laws, regulations, or other social norms. The risk of low data quality includes the possible distortion, error, obsolescence, or incompleteness of data contents, format chaos, absent links or bad metadata. This kind of risks may bring inestimable losses to data users or damage the government's credibility. The main problems lie in the phase of data collection and pre-processing. Some desensitized data also suffer from quality degradation after de-identification. The risk of low data value refers to the possible investment loss of resources due to opening data that have trivial value, have no clear users, or are rarely used.
5.1.2 Technology risks
Technology risks result from inadequate technical capacity and tardy understanding of new technologies, such as data management, platform support and information security, etc. Technology problems in data management may lead to unsatisfactory effects of OGD, such as the use of outdated techniques for data collection, improper methods for data cataloging, unstandardized metadata and improper data formats that encumber analysis and utilization, etc. The low capacity and delayed updates of the OGD platform bring disastrous consequences, as evinced in case two. Because of the loopholes in the information security technologies of OGD platforms, worrisome consequences, such as privacy leakage, data tampering and corruption, platform damage or falsely authorized certification, may occur.
5.1.3 Management risks
Management risks are caused by problems that exist in OGD-related organizational structures, business processes, management styles, or mechanism designs; they comprise the following four types: organizational risks, resource input risks, institutional risks and business process risks. The organizational problems may exist in the lack of special position setting, the unbalanced allocation of power and responsibility between the data provider and receiver, poor inter-departmental communication and cooperation, weak OGD promotion strategies, and poor management of business outsourcing, etc. Insufficient investment of talents or funds in OGD-related businesses may directly hinder the promotion of OGD. Incomplete management rules, business guidelines, incentive mechanisms or ill-conceived business processes upon OGD may also bring unwanted consequences.
5.1.4 Utilization risks
Utilization risks arise from misuse, malicious or improper use, or insufficient use of the opened data, which may result from insufficient literacy or capacity of OGD users. This may result in erroneous decision making, invasion of privacy, or insufficient exploitation of data value.
5.1.5 Environment risks
Environment risk refers to a harmful impact upon the process, mode, or outcome of an OGD program due to the limitations of current institutions, resources, or public expectations, including: (1) Risk of imperfect regulations and standards. Some obstacles to OGD may be attributable to imperfect laws or regulations, inoperable technical standards, or rigid government administration systems. (2) Risk of scarce external resources, e.g. the lack of data governance experts, insufficient talent supply, or immature knowledge of OGD. (3) Risk of excessive external pressure. As shown in Case 1, online negative opinions of pressure groups exerted significant pressure on the initiative of OGD.
5.2 Classification Based on the Risk Consequence
Judging from the consequences, the afore-mentioned 14 subtypes of risks can be divided into two classes: risks to OGD and risks from OGD. The risks to OGD may hinder the smooth running of an OGD program but will not undermine the legitimacy of OGD itself. These risks comprise10 sub-categories: risks in data management, information security, platform support, organizational adjustment, resource investment, institutional provision, business process, imperfect regulations and standards, scare external resources, and excessive external pressures. The risks from OGD refers to the possible negative effects caused by OGD, which may lead to challenges to the legitimacy of the OGD, including illegal contents, low data value, poor data quality and improper data utilization. The risks from OGD may lead to a sceptical attitude to OGD initiative, while the risks to OGD may influence the success of OGD program. The fear of both may become the actual barriers to OGD.
5.3 A Holistic Taxonomy Model of Risks Associated with OGD
Based on the analysis above, a holistic taxonomy model of OGD-related risks is constructed, as shown in Figure 2. First, all risks identified from the three cases are listed together and compared. Those that are similar are clustered and categorized into an upper class according to their sources and consequences, respectively. Next, the correspondence between the source-based and consequence-based classifications is analyzed. Finally, a holistic taxonomy model of OGD-related risks is constructed. This model can help government departments and the public better understand OGD-related risks and thus devise more efficient response strategies.
6. Risk Distribution over the Lifecycle of OGD
6.1 Distribution of Risks from five Sources
Using a content analysis for the qualitative data of the three cases, the lifecycle distributions of risks from five sources are revealed, as shown in Table 1. The frequency of each type of risk occurring at every stage of the OGD lifecycle is calculated. At different stages, the distributions of the 14 types of risks are shown in Figure 3. Among the five stages, the data collection stage is risk intensive. This implies that an appropriate risk management plan should be made before the OGD program begins.
The lifecycle distribution of OGD-related risks.
Risk . | Collection . | Organization . | Release . | Utilization . | Maintenance . | Sum . | ||
---|---|---|---|---|---|---|---|---|
Risks from OGD | Data risks | Legitimacy | ✓ | ✓ | ✓ | 3 | ||
Data quality | ✓ | ✓ | ✓ | 3 | ||||
Data value | ✓ | ✓ | ✓ | ✓ | ✓ | 5 | ||
Utilization risk | Use | ✓ | 1 | |||||
Risks to OGD | Technology risks | Data management | ✓ | ✓ | ✓ | ✓ | ✓ | 5 |
Platform support | ✓ | ✓ | ✓ | ✓ | ✓ | 5 | ||
Information security | ✓ | ✓ | ✓ | ✓ | ✓ | 5 | ||
Management risks | Organizational management | ✓ | ✓ | ✓ | ✓ | 4 | ||
Resource input | ✓ | ✓ | ✓ | ✓ | ✓ | 5 | ||
Institutional | ✓ | ✓ | ✓ | ✓ | ✓ | 5 | ||
Business process | ✓ | ✓ | ✓ | ✓ | 4 | |||
Environment risks | Imperfect regulations and standards | ✓ | ✓ | ✓ | ✓ | 5 | ||
Scarce external resources | ✓ | ✓ | ✓ | 3 | ||||
Excessive external pressures | ✓ | | ✓ | ✓ | | 3 | ||
Sum | 13 | 11 | 11 | 10 | 11 | 56 |
Risk . | Collection . | Organization . | Release . | Utilization . | Maintenance . | Sum . | ||
---|---|---|---|---|---|---|---|---|
Risks from OGD | Data risks | Legitimacy | ✓ | ✓ | ✓ | 3 | ||
Data quality | ✓ | ✓ | ✓ | 3 | ||||
Data value | ✓ | ✓ | ✓ | ✓ | ✓ | 5 | ||
Utilization risk | Use | ✓ | 1 | |||||
Risks to OGD | Technology risks | Data management | ✓ | ✓ | ✓ | ✓ | ✓ | 5 |
Platform support | ✓ | ✓ | ✓ | ✓ | ✓ | 5 | ||
Information security | ✓ | ✓ | ✓ | ✓ | ✓ | 5 | ||
Management risks | Organizational management | ✓ | ✓ | ✓ | ✓ | 4 | ||
Resource input | ✓ | ✓ | ✓ | ✓ | ✓ | 5 | ||
Institutional | ✓ | ✓ | ✓ | ✓ | ✓ | 5 | ||
Business process | ✓ | ✓ | ✓ | ✓ | 4 | |||
Environment risks | Imperfect regulations and standards | ✓ | ✓ | ✓ | ✓ | 5 | ||
Scarce external resources | ✓ | ✓ | ✓ | 3 | ||||
Excessive external pressures | ✓ | | ✓ | ✓ | | 3 | ||
Sum | 13 | 11 | 11 | 10 | 11 | 56 |
6.2 The Occurrence Stages of 14 Types of Risks
A radar chart is made to show the number of the occurrence stages of 14 risks. As shown in Figure 4, the risks that may occur at all five stages include those related to data value, data management, platform support, information security, resource input, institutional risk, and flawed regulations and standards. This indicates the importance of a strong technical platform and a sound management system for the success of the OGD program. The risks that may occur at four stages include organizational and business process risk, which still imply the importance of OGD project management. The risks that may occur at three stages are those related to content legitimacy, data quality, the scarcity of external sources and external pressures.
Among the risks that occur at all five stages, only the risk related to data value is a risk from OGD. This means that, except their natural characteristic, the value of the open data should be improved at every stage of the OGD lifecycle. The other three types of risks from OGD emerge at three or fewer stages. The risk of data use only occurs at the utilization stage. This implies that a specific type of risk that may be brought by OGD should be controlled or avoided at a few specific stages. Most of the risks to OGD could be mitigated by improving project management and a few of them could be avoided by strengthening regulations and technical standards, e.g. the enforcement of EU GDPR.
7. Risk Management Strategies for OGD Projects
Risks from OGD and risks to OGD have different sources and consequences and are distributed at different stages in the OGD lifecycle. Strategies in response to them are suggested as follows:
7.1 Response Measures for Risks from OGD
(1) Data governance strategy for the risk in content legitimacy
The legitimacy of the open content is the first concern of many government departments, and has even become the default explanation for any deficiency in data openness. In response to this kind of risk, the government must accelerate the enactment of related laws and regulations (e.g. privacy protection act, and data security act), establish examination criteria for privacy-involved data and build an effective data governance system to ensure the legitimacy of open contents. In May 2018, the GDPR was enforced by the EU, empower people with more control over their personal data collected by all companies operating in the EU [45]. In the Nanhai district of Foshan city in Guangdong province, China, a data governance committee was established to determine the legitimacy of open data [30].
(2) Value appraisal and feedback mechanism for the risk in data value
Among the departments that are inactive to data openness, some are unconvinced of the value of the data to be released. To ensure the value of data that have yet to be disclosed, it is necessary to establish a mechanism for data value appraisal mechanism. Further, a feedback mechanism should be set up to inform the data providing departments about the results of data utilization, thus giving them sufficient information to determine the priorities for future releases.
(3) Quality management and data provenance strategy for the risk in data quality
In response to the risk in data quality, guidelines and evaluation standards for stable data formats and metadata, secure platform and data provenance are needed to ensure the authenticity, integrity and usability of data at each stage of the lifecycle of OGD program. For instance, the OPEN Government Data Act of the US requires federal agencies to publish their information online based on an underlying open standard that is maintained by a standard organization [58].
(4) Legislation strategy for the risk in data utilization
In response to data utilization risks, it is necessary to clearly stipulate the purpose, means, scope, and results of data utilization by speeding up the legislation, and mete out legal penalties for the malicious use of data.
7.2 Improving Risk Management of OGD Project
(1) Through the lifecycle of the risk itself
During the different periods of the risk lifecycle, different strategies are needed to warn, evaluate and respond appropriately. [59] proposed the establishment of a data risk warning mechanism, an internal control mechanism and the fostering of risk coping ability. (1) During the period of latent risk, government departments must actively analyze and identify potential risks, make efforts to avoid and transfer them, and establish reserved countermeasures for use in times of crisis. In this phase, although the government of Shanghai had foreseen possible adverse consequences at the early stages of the OGD program, they were not able to formulate a complete risk response plan because they lacked an understanding of OGD-related risks. (2) In the risk occurrence phase, it is necessary to initiate the emergency plan immediately and take effective measures to minimize losses as much as possible, including timely cut-loss, communication enhancement and winning public understanding. Due to the failure of taking measures in time to pledge public understanding since 2 million publicity funds were not invested as originally planned, the care.data initiative of UK had to be stopped under public pressures [50]. (3) After the risk occurs, an emergency plan should be initiated immediately to minimize any adverse effects. Due to its limited capacity for crisis management and poor inter-departmental communication, the IRS failed to spot and solve the problem in time after the hacker invasion, and allowed the data-theft to last undetected for three months [60].
(2) Through the lifecycle of the OGD project
At all stages of the lifecycle of an OGD project, it is necessary to implement corresponding risk control strategies.
At the stage of data collection, a complete OGD plan is needed. A detailed handbook should be developed to set the data open scope, conditions and technical means, especially the security level of data involving secrets.
At the data organization stage, a set of operable technique standards is needed so as to ensure trustworthy data quality, such as metadata, data preprocess, data catalogue, data documentation, etc.
At the stage of data release and utilization, detailed regulations should be made to ensure proper data use. Besides, emergency plans should be formulated in response to possible privacy leakage, malicious use, hacker and virus and weak publicity.
At the maintenance stage, plans should be made to prevent unsustainable resource investment to ensure the smooth progress of the project.
8. Conclusion and Discussion
Risk aversion is a potential factor of resistance to OGD, which stems largely from the lack of understanding rather than their uncontrollability. The construction of a taxonomy model of risks can help government departments understand them better, and thus avoid conservative inaction. By conducting a cross-case analysis on three cases of OGD in the UK, the US, and China, this study identified 14 types of risks associated with OGD. According to the risk source, the 14 types of risks are classified into five classes: data, technology, management, utilization, and environmental risks. According to the risk consequence, the 14 types of risks are classified into two groups: risks to OGD and risks from OGD. A holistic taxonomy model of OGD-related risks is then constructed. Based on the model, the distribution of each type of risk across five stages of the OGD lifecycle is analyzed. It is found that the stage of data collection is risk-intensive and seven types of risks may emerge at all five stages. It is also found that most of the risks to OGD could be avoided or mitigated by improving risk management throughout the lifecycle of OGD project. In response to the risks from OGD, it is necessary to improve data governance, data value appraisal and feedback, quality management, and data provenance and speed up legislation.
Among the 14 types of identified risks, some have been discussed in previous studies, such as privacy leakage [61, 62], low data quality [17, 63], and improper use [64], but beyond these, this study also identifies several new risks associated with OGD, including those concerning content legitimacy, data value, weak data management, excessive external pressures, institutional flaws, external resource scarcity, etc. Further, this study also makes the following contributions to this field: firstly, it distinguishes the risks to OGD from the risks from OGD, and thus differentiates the legitimacy of OGD from its smooth realization. This has a theoretical implications on deepening the understanding of OGD-related risks and supporting the rationality of an OGD initiative. Secondly, this study identifies OGD-related risks at different stages of the lifecycle of an OGD program from three cases in different countries and builds a holistic taxonomy model after clustering them. This fills the gap left by the deficiencies of previous studies that focused on a single country and presented the risks fragmentarily. Third, this study analyzes the distribution of each type of risk over five stages of the lifecycle of OGD and suggests specific strategies in response to them. This has practical implications for government departments that are promoting OGD.
The limitations of this study include: (1) Due to the authors' geographical limitations, the data for the cases of the UK and the US are mainly collected from online sources. Although Google renders a comprehensive search, the absence of interviews with related government officials has prevented the researchers from directly assessing the attitudes of government staff on this point. To make up for this limitation, several rounds of in-depth interviews were conducted in the case of Shanghai. (2) The taxonomy model of OGD-related risks is built with a bottom-up induction approach from three cases. Although it can deepen the understanding of the OGD-related risks, it might have not involved all the potential risks that face other government departments. In the future, we will extend our approach to improve the taxonomy model of OGD-related risks with more cases and to empirically test the correlation between risk management and OGD performance.
Author Contributions
F. Wang (wangfangnk@nankai.edu.cn) designed the whole framework, examined the results of data analysis and wrote the final paper. A. Zhao (zhaoanstudy@163.com) collected most of the data of the first two cases and made preliminary data analysis. H. Zhao (zhaohong@mail.nankai.edu.cn) analyzed part of the data and revised the draft. J. Chu (chujun620@163.com) took part in all the interviews and transcribed the records.
Acknowledgements
We would like to express special thanks to the reviewers and editors for their valuable comments, as well as the interviewees and news commentators for their insightful viewpoints. We are grateful to Jiayue Ma, Wei Zhao, the master students, Xiaoyu Wang, Weichong Zhang, Jing Yang, the doctoral students of Nankai University, and Yichen Zhang, the master student of UCSD for their helps with data collection and figure drawing.
This work has been funded by the project of National Engineering Laboratory of Big Data Application Technology for Improving Government Governance Capability: “The Large-scale Intelligent Government Document Processing Technology based on Nature Language Processing and Deep Learning” and “Improving the Governance Capability of the Government with Big Data”; the project of National Social Science Fund of China “Network Society Governance in China” (granted number: 14ZDA063), and the project of National Natural Science Fund of China: “Research on the Organization and Mode of Modern Social Governance” (grant number: 71533002).
References
Author notes
Business School of Nankai University, Tianjin 300071, China
Business School of Nankai University, Tianjin 300071, China
CETC Big Data Research Institute Co. Ltd., Guiyang 550081, China
Business School of Nankai University, Tianjin 300071, China