Abstract
Scholarly content has become more difficult to find as information retrieval has devolved from bespoke systems that exploit disciplinary ontologies to keyword search on generic search engines. In parallel, more scholarly content is available through open access mechanisms. These trends have failed to converge in ways that would facilitate text data mining, both for information retrieval and as a research method for the quantitative social sciences. Scholarly content has become open to read without becoming open to mine, due both to constraints by publishers and to lack of attention in scholarly communication. The quantity of available text has grown faster than has the quality. Academic dossier systems are among the means to acquire more quality data for mining. Universities, publishers, and private enterprise may be able to mine these data for strategic purposes, however. On the positive front, changes in copyright may allow more data mining. Privacy, intellectual freedom, and access to knowledge are at stake. The next frontier of activism in open access scholarship is control over content for mining as a means to democratize knowledge.
1. DATA, TEXT, AND MINING
Scholarship has become datafied as text, images, sound, video, numerical observations, and other forms of intellectual materials meld together as born-digital content. While extant cultural artifacts such as older books, paper archives, and physical objects are unlikely to be replaced by digital records, the scholarly research about those materials will be published as digital objects, whether journal articles, books, “papers,” videos, data sets, or other entities.
Paradoxically, the proliferation of digital content has made scholarly information harder to find. In the days of print publication, libraries cataloged books meticulously, providing multiple points of entry to authors, titles, subjects, and other bibliographic elements. Variant forms of author names were cross-referenced and clustered under a curated authority record. Online catalogs, starting in the latter 1970s, offered Boolean search capabilities that exploited these multiple indexes. Journal articles were described by indexing and abstracting services, often providing extensive subject-analytic metadata drawn from discipline-specific thesauri. The I&A services, as they were known, offered elaborate search functions that exploited these metadata and thesauri. User interfaces were cumbersome, but in the hands of experts, these bibliographic databases could be mined with great scholarly sophistication (Borgman, 2000, 2007, 2015; Borgman, Moghdam, & Corbett, 1984).
Today’s search is dominated by keyword strings, flattening out the rich structure of earlier digital library systems. Users type a few words into a search engine, leaving the combinatorics to proprietary algorithms whose rules are known only to the companies that deploy them. Even search engine providers may be hard pressed to explain precisely how any given set of results are retrieved, given the use of machine-learning techniques that adapt continuously to changes in individual profiles, in auction algorithms that rank results by advertiser payment, and in proprietary knowledge graphs.
As a result of these and other changes in information retrieval, many scholars are finding that the best way to mine databases of text and other content with sufficient sophistication is to write their own algorithms and scripts. Searching databases, web archives, and other digital content is now known generically as text data mining (TDM), although the search may include more than text (McDonald & Kelly, 2014). When the content being searched is open, these methods may be known as open content mining (Murray-Rust, Neylon, et al., 2010).
TDM requires as much technical sophistication on the part of researchers in the quantitative social sciences as was required of librarians in earlier days of information retrieval. TDM is gaining in popularity in the social sciences to model behavior and policy, in the sciences to extract data from publications, and in the humanities to explore history, culture, linguistics, philology, and more. Data mining can regain many of the advantages of sophisticated ontology-based tools of an earlier era by giving the searcher fine-grained and transparent control over the search process, at scale.
Open access publishing is a parallel trend, where scholarly publications are available to readers without charge. A growing proportion of new scholarly articles (and books to a lesser extent) is publicly available immediately or within a few months of initial release. In principle, open access publishing should make much more content available for TDM, which in turn, would facilitate open content mining. In fact, open access publishing does not appear to be advancing the scale of TDM. The failure of these two trends to converge is the subject of this article.
2. OPEN DATA, CLOSED DATA, AND MINABLE DATA
Researchers have sought technical access to proprietary databases of published materials since the earliest days of online databases in the latter 1970s, yet publishers continue to write contracts with university libraries based on assumptions of human readership. By the time of Google Books and the associated author lawsuits, around 2005, we learned that publishers wished to restrict “non-consumptive use” of scholarly content (Duguid, 2007; Leetaru, 2008; Nunberg, 2010). Throughout this period, the move toward open access to journal articles accelerated, with arXiv launching in 1991 (Ginsparg, 2011) and PubMed Central in 2000 (PMC Overview, 2018). Numerous other discipline-specific preprint servers, institutional repositories, and commercial services designed to distribute or redistribute open access versions of scholarly publications have been launched since. Concurrently, open access to publications became mandatory or highly recommended by many funding agencies and universities, in the United States and internationally (Borgman, 2015; Boulton, Babini, et al., 2015; Enserink, 2016; Piwowar, Priem, et al., 2018; Rabesandratana, 2019; Willinsky, 2018).
As a consequence of open science policies and practices, a growing amount of digital content is available as open access for downloading, whether in open access journals, data archives, institutional repositories, library catalogs, preprint servers (such as arXiv, SocArXiv, and bioRxiv), government databases, social media, web portals, public agencies, or elsewhere. Open access to content does not necessarily mean that these data are minable, however. In many, if not most cases, these user interfaces presume a human user who is capable of reading a web page, searching for content, and selecting individual items for download. The number of records that may be downloaded for local mining may also be limited. Robots may or may not be allowed to search open access databases. Scholars and libraries are pressing for greater mining privileges of journals, books, and other intellectual resources (Lammey, 2014; Senseney, Dickson, et al., 2018; Van de Sompel, 2013; Van de Sompel, Rosenthal, & Nelson, 2016; Williams, Fox, et al., 2014).
2.1. Open to Read vs. Open to Mine
Open science, in policy and in concept, is intended to improve transparency, accountability, and access to knowledge by providing open access to publications, data, and software; stewarding collections of scholarly resources for the long term; and making research data more findable, accessible, interoperable, and reusable (FAIR) (Borgman, 2015; Boulton et al., 2015; European Union Publications Office, 2018; Wilkinson, Dumontier, et al., 2016). While open science policies and practices have made great headway in increasing access to publications for reading and to research data for downloading, making scholarly content available for data mining is rarely a stated priority. Thus, the scholarly communication paradox: Open access to text for reading may not yield open access to text for mining.
The scholarly communication paradox can be traced to the early days of the internet and digital publishing. Activists’ goals for open access to scholarly materials were to democratize access to knowledge and to limit the role of big publishers to control access to scholarly content via expensive contracts. Whereas open access proponents viewed digital publishing as a liberating technology, commercial publishers saw economic efficiencies and new markets (Borgman, 2007; Harnad, 1991, 1999, 2005; Suber, 2012; Willinsky, 2006).
Conflicts between democratization and publisher control intensified as open access to publications became the norm. To make articles available free of charge to readers, commercial publishers developed new business models that require authors to pay several thousand dollars (or euros) to make a single article open access. Subscription charges to university libraries continue, despite these author fees, which has led to new rounds of negotiation between publishers and universities. Several large countries and university systems recently terminated contracts with large publishers when talks broke down (Ellis, 2018; Kwon, 2017; UC and Elsevier, 2019; Yeager, 2018).
The cancellation of publisher contracts has received far more public attention than has the quieter consolidation of infrastructure for scholarly communication. A small group of large publishers are consolidating the industry by purchasing smaller publishers and by acquiring technology and content companies across the spectrum of academic services (Posada & Chen, 2018). Of particular note is the purchase of open access preprint servers such as SSRN and Bepress by commercial publishers, rebranding community resources as corporate content. Academic authors who contributed papers to these repositories as community-based, not-for-profit enterprises are not happy (Cookson, 2016; Ellis, 2019; Elsevier, 2017; McKenzie, 2017; Pike, 2016). In sum, open access is not turning out to be the information commons that was envisioned by its pioneers (Benkler, 2004; Hess & Ostrom, 2007; Kranich, 2004; Lessig, 2001; O’Sullivan, 2008; Reichman, Dedeurwaerdere, & Uhlir, 2009; Reichman, Uhlir, & Dedeurwaerdere, 2016).
Intellectual property issues abound. Researchers who wish to mine texts, and libraries who have paid large sums for digital access to published content, often claim that text mining should fall under fair use protections of copyright. (Legal protections vary by country; “fair use” is a term specific to U.S. law.) Publishers, in turn, often claim that their contracts cover only “consumptive use” by human readers and that universities should pay additional fees for mining access. Complicating matters further, large text corpora may contain both public domain and copyrighted materials that are indistinguishable for mining purposes (Baldwin, 2014; Elkin-Koren, 2004; Elkin-Koren & Fischman-Afori, 2017; Levine, 2014; Senseney et al., 2018; Wilkin, 2017).
2.2. Mining Quantity vs. Quality
Researchers’ ability to mine text is fraught with complications, above and beyond the intellectual property and contractual challenges. User interfaces to bibliographic databases provide minimal mining capabilities and may limit the number of records that can be downloaded. Researchers report missing records and a general lack of transparency in search results when they attempt to download files for TDM (Dickson, Senseney, et al., 2018; Senseney et al., 2018).
Data quality is another complication for TDM. Original articles typically provide accurate bibliographic descriptions, and may also include “please cite as” instructions. However, references to published articles, which are essential for bibliometrics or for integrating content across databases, are inherently dirty data due to the vagaries of how authors create reference lists. A bibliography in a journal article is far from the “necessary and sufficient” set of citations that might be assumed by bibliometric evaluations. Rather, it is often an idiosyncratic list of familiar sources, compiled based on what is handy when the publication is submitted. Too few authors are bibliographic purists who verify middle initials, dates, DOIs, and page, volume, and issue numbers (Borgman, 2015, 2016). Complicating matters further is the lack of agreement on bibliographic styles. At last count, Zotero offered about 9,500 journal styles for referencing, representing about 2,000 unique bibliographic styles (Zotero Style Repository, 2019).
One way to get cleaner data is to extract them from authors’ curricula vitae, as authors have a vested interest in providing accurate lists of their own oeuvre. However, CVs tend to be closely held documents in many fields. While some individuals post their CVs on web pages, few are comprehensive or current. To the extent that authors consistently submit their publications to institutional repositories, which is also rare, these could become reliable sources for bibliographic data.
2.3. Privacy and Intellectual Freedom
As universities automate academic personnel processes, faculty dossiers become high-quality sources of bibliographic data. These digital dossiers are typically isolated from the public record for privacy protection. Individuals can give informed consent for specific uses of specific data, such as a dossier for hiring or promotion. In principle, bibliographic records could be separated from confidential review letters, allowing bibliographies to become public records that could be mined. In practice, this opportunity rarely arises, even as an opt-in or opt-out mechanism.
However, these digital dossiers on academic staff are becoming rich sources to be mined by universities, publishers, and data analytics companies. When dossiers were paper files, academic personnel processes were entirely internal to universities. When they became digital files, a new market arose for data management and mining of these materials. Some of these academic analytic companies are independent or privately held; others are among the entities acquired by major publishers in recent years (Ellis, 2019; Posada & Chen, 2018). Rather than build their own infrastructure, universities are outsourcing many of their academic personnel services to these companies. Job applicants submit dossiers to websites, as do those who write their references. Candidates for tenure and promotion also upload their files to university portals on these systems. Dossier-hosting services have certain mining rights under their contracts with universities. Similarly, universities may mine these data for strategic purposes beyond the personnel action for which they were harvested. As faculty become aware of these systems and practices, concerns arise about who has access to their dossiers and how the data can be mined for making decisions about their careers, their departments, and their fields (Borgman, 2018a; Ellis, 2019).
The emerging academic analytics industry appears to be following the successful business models of Alphabet/Google, Facebook, and Amazon in aggregating vast amounts of data about people’s lives. To the consumer, they promote the advantages of improving user experience with intelligent adaptation. To their business clients and investors, they promote the advantages of predictive analytics that can be deployed to strategic advantage. In the academic community, predictive analytics are being used to assess the performance of students and faculty, departments, universities, journals, research programs, and much more. The concentration of data by a few large players gives them a “god’s eye view” of their domains, with minimal oversight or regulation (Economist, 2017).
A related concern is the ability of publishers to surveil uses of scholarly materials. Ownership of intellectual property carries a large set of rights and responsibilities, some of which are associated with privacy protection and intrusion. Corporate owners of scholarly publishing, mass media, and social media content deploy digital rights management (DRM) technologies to track uses and users in minute detail. These technologies have eroded traditional protections of privacy and intellectual freedom in libraries and other domains (Cohen, 1996; Lynch, 2017).
The ability of publishers and other database companies to surveil the uses of their content also has implications for intellectual freedom. To submit TDM queries to some of these systems, researchers may explicitly, or sometimes implicitly, be providing database owners with their research questions and methods. These constraints are of considerable concern to many researchers, who would prefer to search anonymously or to download text for local manipulation (Dickson et al., 2018). Among the motivations of HathiTrust Digital Library to build a research center is to facilitate TDM within the constraints of copyright law, with a rich array of tools (HathiTrust Digital Library, 2019). Another positive development is a shift in international copyright law to allow more TDM for scholarly and other purposes, on the grounds that these constraints would limit the growth of new data-intensive commerce (Samuelson, 2019).
3. DISCUSSION AND CONCLUSIONS
As scholarly information retrieval has degraded, from customized discipline-specific tools to generic search engines, TDM becomes researchers’ best option for sophisticated information retrieval and content analysis. Open access publishing, despite making vastly more scholarly content available to read online, has not resulted in substantial improvements in open content mining. The lack of convergence of TDM and open access is due partly to a lack of foresight by activists who focused on human readers alone. TDM and robotic searching also democratize access to knowledge. The larger cause for the lack of convergence is the vested interests of publishers and other private stakeholders in maintaining control over intellectual property. These forms of control have proven lucrative, as more uses can be made of bibliographic data and scholarly materials through mining and combining with other intellectual assets (Posada & Chen, 2018).
Scholarly research with TDM methods, pioneered in the humanities in the 1960s, has benefited from advances in computation and data science. Researchers have deployed these methods, alone or in combination with other analytical tools, across academe. TDM is among the methods on which quantitative social sciences depends. The irony is that scholars produce the content that is valuable to mine, and build many of the tools on which these methods depend, and yet encounter ever more barriers in their efforts to exploit those texts in new ways. Universities collectively, and academic authors individually, have led the fight for more open access to knowledge for readers. University contracts with publishers are changing. Authors have more control over where they submit their work, and more opportunity to post their work in open access repositories. Copyright law is allowing more data mining. Now is the time for activism on uses of our scholarly content. By enabling TDM on our works, individually and collectively, readers and researchers can make fuller use of scholarly knowledge. Scholars are overdue in asking, “Whose text, whose mining, and to whose benefit?”
COMPETING INTERESTS
The author has no competing interests.
FUNDING INFORMATION
No funding was received for this research.
ACKNOWLEDGMENTS
This paper is an expanded version of a discussion paper written for Data Mining with Limited Access Text: National Forum in 2018 (Borgman, 2018b; Dickson et al., 2018). Thank you to the organizers of the forum for the invitation, and to Michael Scroggins and Morgan Wofford of UCLA for comments and discussion on earlier drafts.
REFERENCES
Author notes
Handling Editors: Loet Leydesdorff, Ismael Rafols, and Staša Milojević