Abstract

We detect ongoing innovation in empirical data about human technological innovations. Ongoing technological innovation is a form of open-ended evolution, but it occurs in a nonbiological, cultural population that consists of actual technological innovations that exist in the real world. The change over time of this population of innovations seems to be quite open-ended. We take patented inventions as a proxy for technological innovations and mine public patent records for evidence of the ongoing emergence of technological innovations, and we compare two ways to detect it. One way detects the first instances of predefined patent pigeonholes, specifically the technology classes listed in the United States Patent Classification (USPC). The second way embeds patents in a high-dimensional semantic space and detects the emergence of new patent clusters. After analyzing hundreds of years of patent records, both methods detect the emergence of new kinds of technologies, but clusters are much better at detecting innovations that are unanticipated and undetected by USPC pigeonholes. Our clustering methods generalize to detect unanticipated innovations in other evolving populations that generate ongoing streams of digital data.

1 Introduction

Patented technology has been recognized as a fertile context for study of evolutionary properties of a system that is producing an ongoing stream of innovations [2, 3, 11, 14, 17, 26, 27, 29]. Each patent may be regarded as representing an innovation, as this is an explicit requirement for an invention to be granted a patent. However, it can be difficult to detect innovations if patents are classified with predefined technology pigeonholes such as those used by the United States Patent and Trademark Office (USPTO), especially if the innovations were unanticipated when the pigeonholes were defined. Here we demonstrate a different classification paradigm, using methods from natural-language processing and machine learning. While some others view technology as complex combinations of units that may be represented by patents [2, 3, 27], we focus on the intrinsic properties of individual patents, obtained directly from the patent texts (titles and abstracts). Every patent may be regarded as representing a technological innovation, but some innovations are small new twists on existing technology, while others open a path to a stream of subsequent innovations. To identify significant high-level patterns of innovation, we group patents into clusters and observe signatures of large-scale innovation in the relation between clusters. The result is a method for detecting significant open-ended technological innovation in an ongoing stream of new patents. In addition to patent data, our methods apply to an ongoing stream of empirical data generated either by models or by physical systems in the real world.

2 The Problem of Detecting Innovations

New technologies are entities, and ongoing technological innovation involves generating new kinds of entities—one of the forms of open-ended evolution (OEE) cataloged by [28]. Examples of OEE are typically biological, and living organisms differ in many ways from human technologies, but both evolve in a way that is open-ended. Both change and evolve over time, both learn from experience and adapt to their environment through incremental improvements, and both often increase in complexity over time. Understanding the open-endedness in either might illuminate it in the other.

To study technological innovations we use a common convenient proxy: patented inventions. Patent records are detailed, accurate, digital descriptions of each patented invention, and they provide the raw data in which we will try to detect significant innovations. Contemporary methods in statistics and machine learning can detect subtle patterns in huge piles of patent records, and we use those tools to detect the generation of new technological innovations. Our methods easily generalize to detecting innovations in other domains that produce sufficient digital data.

Open-ended evolution is an enigmatic topic [28], and one reason is the emergence problem: the difficulty of detecting entities that are so novel that we have no distinctive descriptions for them [6]. This problem is especially acute for those aiming to detect the emergence of new technological innovations, armed only with a predefined classification created by human experts. The problem is not just the biases, preconceptions, and other epistemic shortcomings of any human expert. A more specific challenge is to detect innovations that are unanticipated; perhaps nanotechnology and genomic medicine are recent examples. A classification with blind spots will be unable to see certain innovations; blind spots raise the probability of undetected innovations, that is, false negatives.

We compare two methods for detecting innovations, linked to two methods of classifying entities. One method classifies entities into a fixed and finite list of pigeonholes and detects innovations by observing when pigeonholes are first exemplified. The second method classifies entities into threads of clusters of patents located in an abstract technology feature space, and detects when each cluster thread is first exemplified.

A pigeonhole must be predefined before anyone can classify something with it. Pigeonholes are typically defined by human experts who are guided by the history of technological innovations, and this historical grounding is a mixed blessing; it helps those looking backwards but can hinder those looking forwards. Well-defined pigeonholes can be apt for describing certain historical patterns of innovation, but below we will see an example of their ineffectiveness at detecting unanticipated future innovations.

When classifying with either pigeonholes or clusters, for simplicity here we consider only flat (i.e., non-hierarchical) classifications with non-overlapping classes. Formally we construe a classification as a set of classes of patents, where each class may be labeled with an integer, so the classification is a map from the set of patents to integers, C : PI. We will assume that there are a finite number of different classes of patents. Class i is the set of all patents mapping to i, or C−1(i). For classifications consisting of pigeonholes, i is the integer indexing the pigeonhole on a predefined list. When we classify patents with clusters, i is the integer indexing a thread of clusters on a predefined list of cluster threads.

Evolutionary activity statistics have been used to measure the rate of adaptive innovation in a number of computer models [4, 6] as well as in biological populations [5] and cultural populations [11]. So, our search for technological innovations includes looking for a characteristic statistical signature: sustained high evolutionary activity. We search for this signature in patent data, using both pigeonholes and clusters.

An evolutionary analysis requires the data to have a temporal order, so we put patent records into data frames that each contain N successively issued patents (here, N = 50,000). Flipping through the temporal sequence of frames shows a “movie” of the evolution of newly patented technology. We define the evolutionary activity of pigeonhole or cluster i at time t as the cumulative sum of the number of instances of i from the time of its first exemplification to t. If Cit is the number of patents in category i at time t, then the activity is
Ait=tCit

New successful innovations have unusually high evolutionary activity statistics, because innovations with many instances amplify the cumulative sum statistic. When evolutionary activity is plotted as a function of time, steeply rising activity waves indicate new successful innovations. The density of new waves at a particular time is one measure of the rate at which innovations are occurring at that time [5].

To study the challenge of detecting unanticipated innovations in an important practical context, we measure evolutionary activity in one hundred and seventy years of USPTO patent records, and we identify newly instantiated pigeonholes in the USPC classification and newly emerging clusters in an abstract technology feature space inferred from patent records.

3 Detecting Innovations with Technology Pigeonholes

The United States Patent and Trademark Office (USPTO) classifies each patent by putting it into one (or more) classes chosen from the United States Patent Classification (USPC). Those classes are further subdivided into codes and subcodes, and the classes are further collected into categories and subcategories; but we can ignore those details here. The USPTO revises the USPC from time to time, and then it reclassifies all earlier patented inventions into the revised classes. But the USPTO always aims to make each USPC revision correct and complete. The classification is revised only to correct errors or misclassifications. The historical record of USPC revisions is a list of past and now defunct classifications that have been abandoned by the USPTO. Here we ignore those discarded classifications and focus on the USPC classification that is currently accepted and used by the USPTO today.

Figure 1 depicts the evolutionary activity of each class in the USPC. The middle panel figure blows up the activity scale on the y axis by two orders of magnitude, in order to highlight the new activity waves caused when USPC classes are first exemplified. The blowup clearly shows new waves of evolutionary activity continually sweeping up through the figure—the signature of the ongoing generation of new significant technological innovations.

Figure 1. 

Top panel: time series of the evolutionary activity of each class in the USPC, computed from 1845 to 2015. The frame containing a class's first exemplification is indicated by its color, from oldest (blue) to newest (red). A handful of specific classes are given other specific colors: Three shades of green are used to color the top four classes in years 1875 (yellow-green), 1950 (grass-green), and 2015 (moss-green), respectively, discussed in Table 1. The black waves indicate the final group of heavily instantiated classes, indicated in Table 2. The golden waves are the lowest waves with at least 100 patents per class, discussed in Table 3. Middle panel: blowup of the bottom 0.5% of the evolutionary activity scale. Special classes listed in Tables 1, 2, and 3. Bottom panel: The data in the middle panel are presented with time shown in chunks of 50,000 successively issued patents; compare with Figure 6.

Figure 1. 

Top panel: time series of the evolutionary activity of each class in the USPC, computed from 1845 to 2015. The frame containing a class's first exemplification is indicated by its color, from oldest (blue) to newest (red). A handful of specific classes are given other specific colors: Three shades of green are used to color the top four classes in years 1875 (yellow-green), 1950 (grass-green), and 2015 (moss-green), respectively, discussed in Table 1. The black waves indicate the final group of heavily instantiated classes, indicated in Table 2. The golden waves are the lowest waves with at least 100 patents per class, discussed in Table 3. Middle panel: blowup of the bottom 0.5% of the evolutionary activity scale. Special classes listed in Tables 1, 2, and 3. Bottom panel: The data in the middle panel are presented with time shown in chunks of 50,000 successively issued patents; compare with Figure 6.

Figure 1 shows that many classes were first exemplified early on in the 1850s, but the density of newly exemplified clusters generally lessens over the first hundred years. After 1945 the rate picks up again, and a large number of classes are first exemplified at the creation of the modern-day USPTO in 1976. But then the rate quickly drops to zero and remains there through the final twenty years.

The most heavily instantiated USPC classes are those corresponding to activity waves at the top of the figure. Table 1 shows how the most active classes changed over the past century and a half (shades of green). The most instantiated USPC classes in 1875 (yellow-green) concerned earth working, stoves and furnaces, harvesters, and cutting—all technologies of high practical value over a century ago, if not today. The most active classes in 1950 (grass-green) covered machine parts, fasteners, fluid handling, and buildings—more important today than stoves and cutting, but still secondary today. In 2015 the most active topics (moss-green) included drugs and body-treating compositions, semiconductor manufacturing, and solid-state devices—all of primary importance today. The fourth most active class (stock material or miscellaneous articles) is a grab bag of miscellaneous inventions that fall into no other existing class.

Table 1. 
The most heavily instantiated USPC classes in 1875, 1950, and 2015. At those times the evolutionary activity waves of these classes are the top waves (i.e., the waves having the top four values of Ait for those times) in the top panel of Figure 1. Activity waves for the classes listed in this table appear in shades of green.
Most instantiated USPC classes (shades of green in Figure 1)
IDClass nameNo. of Patents
1875 (yellow-green in Figure 1
172 Earth working 4741 
126 Stoves and furnaces 4301 
56 Harvesters 3764 
83 Cutting 2807 
1950 (grass-green in Figure 1
74 Machine element or mechanism 36965 
24 Buckles, buttons, clasps, etc. 33446 
137 Fluid handling 31933 
52 Static structures (e.g., buildings) 29113 
2015 (moss-green in Figure 1
514 Drugs, bio-affecting, and body-treating compositions 126417 
257 Active solid-state devices 107974 
438 Semiconductor device manufacturing 103468 
428 Stock material or miscellaneous articles 100902 
Most instantiated USPC classes (shades of green in Figure 1)
IDClass nameNo. of Patents
1875 (yellow-green in Figure 1
172 Earth working 4741 
126 Stoves and furnaces 4301 
56 Harvesters 3764 
83 Cutting 2807 
1950 (grass-green in Figure 1
74 Machine element or mechanism 36965 
24 Buckles, buttons, clasps, etc. 33446 
137 Fluid handling 31933 
52 Static structures (e.g., buildings) 29113 
2015 (moss-green in Figure 1
514 Drugs, bio-affecting, and body-treating compositions 126417 
257 Active solid-state devices 107974 
438 Semiconductor device manufacturing 103468 
428 Stock material or miscellaneous articles 100902 

The least-instantiated USPC classes include button making, whips and whip apparatus, and needle and pin making; see Table 2. The activity waves of these classes are golden colored and are visible as three of the four nearly horizontal waves in the bottom panel of Figure 1. These three flatlined classes were first exemplified long ago and are now archaic, but the fourth (class 260) concerns the chemistry of carbon compounds. Class 260 has flatlined because it was reclassified so often that it now contains only 288 patents.

Table 2. 
The least-instantiated USPC classes in 2015 with at least 100 patents. In the lower panel of Figure 1, the evolutionary activity waves of these classes (illustrated in a golden color) are nearly horizontal, indicating very few recent additions to these classes.
Least-instantiated USPC classes (golden in Figure 1)
IDClass nameNo. of Patents
79 Button making 474 
231 Whips and whip apparatus 350 
260 Chemistry of carbon compounds 288 
163 Needle and pin making 177 
Least-instantiated USPC classes (golden in Figure 1)
IDClass nameNo. of Patents
79 Button making 474 
231 Whips and whip apparatus 350 
260 Chemistry of carbon compounds 288 
163 Needle and pin making 177 

Figure 1 shows many heavily instantiated USPC classes, and their activity waves shoot up through the top of the blowup (bottom). Before the end of the last century the activity waves for the last remaining classes in the current USPC went shooting up with new instances. The recently active classes (black in Figure 1), shown in Table 3 (e.g., inter-program communication, combinatorial chemistry, and information security), all remain important technologies today.

The continual production of new activity waves like those in Figure 1 provides one window on the contingent and open-ended process by which the technology pigeonholes in USPC were exemplified. But pigeonholes share an obvious limitation with any fixed and finite classification: As more and more pigeonholes have been exemplified, the classification becomes increasingly blind to new innovations. This drives up the number of false negatives and worsens the emergence problem. After the last pigeonhole has been exemplified, no further innovations can be detected—not without creating new pigeonholes—and every new innovation becomes another false negative.

The increasing blindness of fixed and finite classification apparently explains the general decline in the rate of newly exemplified classes in Figure 1 and especially the striking lack of any new activity waves in the last twenty years. This figure shows that any genuine innovations that arose after 1995 failed to first exemplify a USPC class; hence, they are exactly the sort of false negatives created by the emergence problem.

These false negatives are dramatized in the patent time perspective shown in the bottom panel of Figure 1, which shows that USPC pigeonholes detect no new innovations in the final third of all of the patents issued since 1845. Many actual innovations occurred but were undetected by USPC classes.

One could define more pigeonholes by considering combinations of existing pigeonholes. For example, the number of pairs of n pigeonholes is n choose 2, and the number of triples is n choose 3. Even after every pigeonhole has been exemplified, plenty of pairs and triples of pigeonholes will still have no instances. So, one could detect innovations by looking for the first examples of pairs and triples of pigeonholes. This might detect innovations that are combinations of earlier innovations, and plenty of innovations build on and combine earlier innovations. Perhaps the first airplanes were innovations that mainly combined prior technologies from bicycles and kites. However, some innovations seem to be more than mainly combinations of earlier innovations. Some strikingly novel innovations might mainly involve fundamentally new ideas, new materials, new purposes, and new methods. The emergence of these kinds of unanticipated technological innovations is partly what forces the USPTO to revise the USPC now and then. In fact, 40% of all US patents issued in 1976 have by now been reclassified at least once [18].

Table 3. 
The final USPC classes as of 2015 that are heavily instantiated. In the bottom panel of Figure 1 these classes correspond to the last three activity waves (illustrated in black) that shoot up in the 1990s, indicating many additions to those classes in those years.
Final heavily instantiated USPC Classes (black in Figure 1)
IDClass nameNo. of PatentsIssued
719 Electrical computers and digital processing systems: interprogram communication or interprocess communication (IPC) 5115 1972 
506 Combinatorial chemistry: method, library, apparatus 1653 1974 
726 Information security 18375 1974 
Final heavily instantiated USPC Classes (black in Figure 1)
IDClass nameNo. of PatentsIssued
719 Electrical computers and digital processing systems: interprogram communication or interprocess communication (IPC) 5115 1972 
506 Combinatorial chemistry: method, library, apparatus 1653 1974 
726 Information security 18375 1974 

4 Detecting Innovations with Technology Clusters

4.1 Clusters in Technology Feature Space

There is another strategy for detecting unanticipated innovations, which ignores USPC pigeonholes and instead uses a fundamentally different kind of classification of technology, recently developed in [23]: a classification in which the classes can change and evolve over time. We illustrate this strategy with patent records from the years 1976–2014, roughly 5 million patents. We divide the data into 100 time frames with ≈50,000 patents in each time frame. Because patents are unevenly distributed over time (the number of patents issued per year is constantly increasing), the time frames have unequal temporal extents. We use patent time instead of real time for the time frames simply to have uniform statistical estimation errors across our time chunks, following [29].

After dividing patent data into time frames, this approach takes three steps to achieve a classification: (i) from all patents in the temporal sequence of frames, use the algorithm doc2vec to construct a semantic embedding of each patent into a 300-dimensional vector space that we will call technology feature space; (ii) perform a spherical k-means clustering of all patents in each frame; and (iii) identify temporal threads of clusters that are near to each other (within a certain distance threshold). We use cluster threads to define classes of technology, as an alternative to USPC pigeonholes. New classes are produced automatically by the procedure, when a new data frame has a new cluster that is far away from all existing clusters. The first two steps are illustrated in Figure 2, and the third step is the formation of technology classes by clustering centroids across time. The following discussion treats these three steps in more detail.

Figure 2. 

A schematic of the procedure executed for the time frames (100 frames covering the period 1976–2014), resulting in 25 centroids positioned in the 300-dimensional technology feature space. Subsequently, the 2500 centroids across all time frames are themselves linked into temporal threads using contiguity clustering, with a distance threshold, using the cosine similarity metric in technology feature space.

Figure 2. 

A schematic of the procedure executed for the time frames (100 frames covering the period 1976–2014), resulting in 25 centroids positioned in the 300-dimensional technology feature space. Subsequently, the 2500 centroids across all time frames are themselves linked into temporal threads using contiguity clustering, with a distance threshold, using the cosine similarity metric in technology feature space.

The first step is creation of the technology feature space, by embedding each patent into a 300-dimensional semantic vector space. The embedding is obtained by using publicly available topic modeling software, doc2vec [20, 25]. Construction of the embedding is driven by the empirical contextual word co-occurrence statistics amassed by analyzing millions of patent records. The specific data used by the doc2vec are the title and abstract only (future work will include the body text), after a preprocessing step that constructs a vocabulary from all documents, including identification of n-grams up to length 6.

These patent descriptions are embedded in this space with doc2vec, an algorithm designed to optimize the principle that the proximity of two patent documents in the space is proportional to the semantic similarity of the two documents. The algorithm doc2vec embeds both words and documents in the technology feature space, using a neural net trained to predict words in the texts. Since our documents are descriptions of newly patented inventions, the semantic similarity of two documents roughly corresponds to the similarity of the inventions described in the two documents. A metric on the technology feature space provides a precise, quantitative measure of the similarity of any two embedded inventions. This proportionality principle enables the embedding space to function as a technology feature space, with different regions in it corresponding to different kinds of technologies. The classes in our classification are the different regions of technology feature space in which patents tend to cluster, and those regions change and evolve over time.

The doc2vec algorithm represents the current state of the art for semantic embedding, and it uses neural networks trained (on the five-million-record data set) to predict word occurrences within each document. Previous approaches to creating semantic embedding typically do not use predictive neural networks (e.g., [13, 30]), and several specifically aim at patent analysis [1, 7,,10, 15,17]. Comparing and evaluating semantic embedding algorithms is an ongoing endeavor [19]. Our methods easily generalize to these other embeddings.

Clustering points in technology feature space may be regarded as a form of topic modeling (each cluster is a topic); future work may use technology feature space embedding with other topic modeling techniques such as latent Dirichlet allocation methods [22].

The second step uses the embedding of each patent, now precisely located in technology feature space, to perform a clustering in each frame using the spherical k-means clustering algorithm, with k = 25. The clustering is performed by open source machine learning software [24]. Spherical k-means is used because it is based on the cosine similarity metric, which measures the distance between two points as the angle subtended by the points. Cosine similarity is the natural metric because the doc2vec algorithm projects points onto the surface of a sphere. Choice of k-means clustering is a metaparameter of the procedure. It was chosen primarily because it is fast, and it gives explicit control over cluster granularity. The clustering is clearly imperfect, because it contains visible noise (see discussion below). Other clustering methods (e.g., hierarchical clustering) were examined briefly, but they either suffered from intractable computation time or inability to identify clear cluster structure.

Clustering produces 100 successive frames of 25 clusters of newly issued patents. Each cluster occurs in a specific frame, and clusters in different frames can be in different locations. This makes it easy to observe if any clusters have moved. The 100 frames of clusters are a precise description of how each cluster in technology feature space has changed and evolved.

The third and final step of our classification procedure records how clusters change and move in technology feature space, by connecting nearby clusters into threads (temporal trajectories in the space). These threads consist of clusters of patents that are near one another in technology feature space; specifically, the nearest centroid to any centroid in a thread is some other centroid in the same thread. For example, given a network in which the nodes are the 2500 centroids in 100 frames of 25 centroids, and the edges connect any pair of nodes that are closer in technology feature space than some predefined distance threshold (dth), the network's connected components are the cluster threads (trajectories) into which we classify patents. Since threads of centroids are parameterized by dth, changing dth can change which centroids are in which threads.

Each cluster thread defines a different, non-overlapping local region of technology feature space. The regions are local in the sense that the centroids in them are always nearer than dth to some other centroids in the thread. At the same time, two centroids in a thread can be far apart if they are connected by a long enough sequence of pairs of nearby centroids. Still, centroids in a thread are always nearest in technology feature space to other centroids in the thread. On the other hand, centroids can be temporally far away from every other centroid in the thread. What connects centroids in a thread is proximity in technology feature space specifically, not necessarily proximity in time.

Every patent issued between 1976 and 2014 is located in technology feature space nearest to one of the 25 clusters in one of our 100 frames of data, and that cluster is in one of the threads of nearby clusters of patents. We classify a patent into the thread that contains its nearest cluster.

Note that our three-step procedure of embedding, clustering, and threading involves choosing various metaparameters, such as the number of patents per time chunk (50,000 here), the dimension of technology feature space (300 here), the number of clusters (25 here), and the distance threshold, dth, defining centroid connectedness (0.04 here). In addition there are several smaller-scale metachoices made in construction of the vocabulary (stemming rules, stop words, threshold frequencies, etc.). Larger-scale metachoices include the choice of clustering algorithm (spherical k-means here), and even the choice of the semantic metric to measure distance between documents (cosine distance on doc2vec embedding here). We have varied all metaparameters enough to be confident that the results we discuss hold broadly over variation of metaparameters. Details of particular keywords associated with clusters, connectedness of clusters, and precise timing of new innovative clusters may change, but the overall structure of evolving technological categories typically remains quite similar.

4.2 Results

We analyzed one hundred frames of 50,000 successively issued US patent documents published from 1976 to 2015, first clustering the documents in each frame, and then connecting nearby clusters into temporal threads. Those threads define distinct regions of technology feature space, and those technology subspaces define our classification of technology.

Figure 3 shows all the centroids in the thirty largest threads of clusters, with the columns corresponding to frames of data and the rows corresponding to threads. Each column contains the twenty-five centroids from a given time frame, and each row contains the centroids in a given thread. Centroids are assigned to the same temporal thread if they are nearby (within the distance threshold, dth = 0.04). The first frame contains the centroids in the leftmost column, and successive frames of twenty-five centroids are added to the right. That subregion can change and grow as more and more nearby centroids are added to a thread. Some threads contain many centroids, while others contain only a few, and threads have been colored from largest (blue) to smallest (red). Only threads with at least four centroids are shown. Centroids in any shorter threads are collected in the black unnumbered thread at the bottom. In our technology classification the black thread is an “other” category that collects all otherwise unclassified patents.

Figure 3. 

A lineup of all of the temporal threads of clusters used to classify technology, showing one hundred successive time frames on the x axis, each with twenty-five centroids (colored cells in each column), spanning the years 1976–2014. Centroids in a thread appear in the row(s) identified with a cluster ID. All centroids in each thread have the same color, with a hue (blue to red) indicating the number of clusters in the thread (many to few). Light gray cells indicate temporal gaps in threads, that is, frames with no cluster assigned to that thread. Only threads with at least four centroids are shown; centroids in any shorter threads are collected in the black thread at the bottom.

Figure 3. 

A lineup of all of the temporal threads of clusters used to classify technology, showing one hundred successive time frames on the x axis, each with twenty-five centroids (colored cells in each column), spanning the years 1976–2014. Centroids in a thread appear in the row(s) identified with a cluster ID. All centroids in each thread have the same color, with a hue (blue to red) indicating the number of clusters in the thread (many to few). Light gray cells indicate temporal gaps in threads, that is, frames with no cluster assigned to that thread. Only threads with at least four centroids are shown; centroids in any shorter threads are collected in the black thread at the bottom.

A number of patterns are evident in the thread lineup. Many cluster threads in the lineup persist through most or all of the data frames. This includes the first eleven threads in the lineup, as well as some that only appear in later frames, such as thread 27. Some threads are so large that more than one centroid in a given data frame are assigned to the same thread. For example, threads 4 and 6 are each assigned two centroids in the first frame, and thread 1 is assigned three centroids in the last frame. Threads vary in size, that is, the number of centroids in the thread. We also observe that some threads generally grow in successive data frames (e.g., thread 1), some wax and wane in size (e.g., thread 2), and others generally shrink (e.g., thread 6). Threads can have gaps (shown in light gray) when none of the twenty-five centroids in a frame are assigned to the thread, and some threads contain many gaps (e.g., threads 9, 10, and 18). Gaps can become so prevalent that a thread peters out and is eventually assigned no more centroids, and the thread is considered to have died. For example, thread 11 is diminishing, and thread 16 has entirely died out. Finally, we also see many examples of the emergence of new threads. These are technological innovations: a cluster of new inventions that are relatively distant from all clusters in previous data frames. Technological innovations evident in the lineup include new threads like 36 and 45 that are dense with centroids, as well as new threads like 42 in which centroids are more sparse.

All of the patterns evident in Figure 3—persisting threads of clusters, some shrinking and going out of existence, others emerging and growing—appear to be robust; even if metaparameters like size of data frames or number of clusters are changed, the generic pattern remains the same. As successive data frames are analyzed, the thread lineup continues to grow in size as old threads die and new threads emerge. Over time the threads of clusters in technology feature space evolve unpredictably, and the emergence of new clusters appears to be open-ended.

To understand the distinctive nature of each technology cluster thread, we take advantage of the fact that words and patent documents are both embedded in the same technology feature space. We filter the dictionary of words and n-grams in the patent documents to find words that occur in the patent corpus at least N times (N = 1000), locate the M nearest words to each centroid (M = 30), and select the L most common words among all the centroids in a thread (L = 11). Table 4 shows the nearest shared words for ten of the threads in Figure 3, along with each thread's ID number, and a short verbal tag produced by a human summary of the central category of technology described by the shared words shown. The word lists include a number of n-grams and acronyms; for example, the communications thread's nearest shared words include the bigram “instant_messag” and the acronym “PSTN” (public switched telephone network). The word lists also include some puzzles, such as the trigram “relat_hybrid_soybean” listed under the thread for medical treatments (thread 9) and the “sausag” and “shrimp” stems listed under thread 16 for sheets and folds. A number of other threads have word lists with related puzzles.

Table 4. 
Information about some of the cluster threads shown in Figure 3: ID number, a short verbal tag produced by a human, and the most common words and n-grams among those closest to the centroids in each thread.
Cluster threads (10 examples)
IDHuman tagNearest shared words
communication persist requestor call_parti callback pstn publish affili instant_messag subscript isdn anonym voic_mail 
semiconductors oxynitrid hard_mask interlay_dielectr epi silicon_oxynitrid epitaxi_grown ono dope_polysilicon gaa inp shallow_trench_isol 
medical treatments vertebr canin dementia prophylaxi_treatment arthriti psoriasi multipl_sclerosi relat_hybrid_soybean basal rand_rare_independ pig 
10 optics monochromat beamsplitt infra_red achromat confoc dichroic meniscu_len circularli_polar blue linearli_polar near_infrar 
11 fossil fuels crude ethan propanediol aliphat_alcohol monohydr alkanol butanol sulfit trimethyl mother_liquor glycerin 
16 sheets and folds newspap windrow stalk cop sausag stacker bank_note booklet cooki carton_blank shrimp 
22 drugs quinolin cyclohexyl psoriasi propylen_glycol pyrazol interleukin terpen methoxi exemplifi anti_inflammatori_agent propanediol 
36 imaging black_white grayscal pictori binar handwritten grai panoram oct stereoscop blur monochromat 
42 molecular genetics leukemia interleukin nucleotid_sequenc_encod reductas polypeptid_polynucleotid polynucleotid_encod polypeptid_nucleic_acid lectin dna_encod polynucleotid_polypeptid rat 
45 programming hardware microinstruct instruct_fetch ecc dma prefetch microprogram microcod specul snoop semaphor fifo 
Cluster threads (10 examples)
IDHuman tagNearest shared words
communication persist requestor call_parti callback pstn publish affili instant_messag subscript isdn anonym voic_mail 
semiconductors oxynitrid hard_mask interlay_dielectr epi silicon_oxynitrid epitaxi_grown ono dope_polysilicon gaa inp shallow_trench_isol 
medical treatments vertebr canin dementia prophylaxi_treatment arthriti psoriasi multipl_sclerosi relat_hybrid_soybean basal rand_rare_independ pig 
10 optics monochromat beamsplitt infra_red achromat confoc dichroic meniscu_len circularli_polar blue linearli_polar near_infrar 
11 fossil fuels crude ethan propanediol aliphat_alcohol monohydr alkanol butanol sulfit trimethyl mother_liquor glycerin 
16 sheets and folds newspap windrow stalk cop sausag stacker bank_note booklet cooki carton_blank shrimp 
22 drugs quinolin cyclohexyl psoriasi propylen_glycol pyrazol interleukin terpen methoxi exemplifi anti_inflammatori_agent propanediol 
36 imaging black_white grayscal pictori binar handwritten grai panoram oct stereoscop blur monochromat 
42 molecular genetics leukemia interleukin nucleotid_sequenc_encod reductas polypeptid_polynucleotid polynucleotid_encod polypeptid_nucleic_acid lectin dna_encod polynucleotid_polypeptid rat 
45 programming hardware microinstruct instruct_fetch ecc dma prefetch microprogram microcod specul snoop semaphor fifo 

These keywords and verbal tags suggest explanations for the thread patterns visible in Figure 3. Well-known major categories of technologies tend to produce persisting threads, such as communications (thread 1), semiconductors (thread 2), medical treatments (thread 9), optics (thread 10), and fossil fuels (thread 11). The figure also indicates significant growth in communications technologies; while semiconductor technologies first waxed and then waned, fossil fuel technologies generally diminished, and technologies concerning sheets and folds (thread 16) became completely absent in the final two-thirds of the frames.

Similarly, the keywords and verbal tags of the newly emerging threads in Table 4 also suggest why those threads emerged: The threads on imaging (thread 36), molecular genetics (thread 42), and programming hardware (thread 45) all emerge only after many frames of data, and all of these technologies are commonly listed among the major technological innovations in the past forty years.

The thread lineup in Figure 3 reveals the emergence of many new cluster threads, and Table 4 describes ten examples. A limitation of our methodology is that a new thread in our lineup might not indicate when a new technological innovation first arises.

Our method reveals the emergence of new clusters only after they have become one of the twenty-five largest clusters in some data frame. So new clusters can remain undetected, below the radar, for many years. Furthermore, since our data start in 1976, they will not detect clusters of inventions from earlier years.

Each cluster thread marks a distinct subregion of technology feature space, and each of these subregions has a distinct precise location in this 300-dimensional space. Since it is difficult to visualize high-dimensional structures like clusters, Figure 4 shows a two-dimensional projection of the 2500 cluster centroids in 300-dimensional feature space, produced with the t-SNE algorithm [21]. This algorithm does an especially good job of reflecting the local structure in very high-dimensional spaces. Each dot in the projection shows the location of one of the 2500 centroids in our lineup, and the centroids in each of the ten threads in Table 4 are given distinctive colors (the remaining centroids are white).

Figure 4. 

The 2500 centroids in 300-dimensional technology feature space are projected to two dimensions using t-SNE, with 25 centroids in each of 100 time frames spanning 1976–2014. The ten threads identified in Table 4 are colored and numbered.

Figure 4. 

The 2500 centroids in 300-dimensional technology feature space are projected to two dimensions using t-SNE, with 25 centroids in each of 100 time frames spanning 1976–2014. The ten threads identified in Table 4 are colored and numbered.

The t-SNE projection confirms that cluster threads are located in technology feature space in readily identifiable groups, corresponding to distinct subregions of the space. The ten threads analyzed in Table 4 match ten of those groups, and the other threads all correspond to other readily identifiable groups in Figure 4. In the t-SNE projection the centroids within a thread are closer to each other than to other centroids not in the thread. The largest threads (e.g., 1 and 2) are indicated by the largest colored centroid groups, and the smallest threads are indicated by the smallest groups (42 and 16).

In Figure 5 the 2500 centroids in the t-SNE projection are displayed against a background of contour lines and a landscape color gradient that indicates the centroids kernel density estimation. In order to reveal temporal patterns like births and deaths of cluster threads, we color the centroids in the t-SNE projection to encode its time frame with grayscale, from black (earlier frames) to white (later frames). This enables the temporal sequence of centroids within a cluster thread to be identified from the shades of gray coloring the centroids in a thread. If the centroids within a thread move randomly in technology feature space, then the dots within a group will appear with randomly distributed shades of gray. On the other hand, if the centroids are moving nonrandomly in 300-dimensional technology feature space, a t-SNE projection of the thread's trajectory could be revealed by an identifiable two-dimensional trajectory of successively lighter shades of gray. In particular, a t-SNE projection with such a trajectory is hard to explain except as a nonrandom, directional trajectory in feature space.

Figure 5. 

A two-dimensional t-SNE projection of the centroids of clusters of patents issued during the years 1976–2014. Contour lines and the landscape color gradient indicate the kernel density estimation for the centroids. One hundred frames of 25 centroids are overlaid. Centroids in each frame have the same color, and grayscale represents time; earlier (later) centroids are darker (lighter).

Figure 5. 

A two-dimensional t-SNE projection of the centroids of clusters of patents issued during the years 1976–2014. Contour lines and the landscape color gradient indicate the kernel density estimation for the centroids. One hundred frames of 25 centroids are overlaid. Centroids in each frame have the same color, and grayscale represents time; earlier (later) centroids are darker (lighter).

The thread dynamics seen in the thread lineup (Figure 3) are also evident in Figure 5: Some threads persist, some become extinct, some new ones come into existence. We can see that the threads on communications and semiconductors (threads 1 and 2) persists through all of our frames of data, while thread 16 on sheets and folds dies out after some black and dark gray centroids. By contrast, innovations like thread 36 for imaging and thread 42 for molecular genetics appear as groups of lighter-colored centroids, and many further innovations are also evident. The trajectories in technology feature space seen in Figure 5 indicate the ongoing emergence of new clusters in the space—a form of OEE.

Figure 5 also reveals another kind of change: nonrandom movement over time of the threads. The t-SNE projection of the centroids within thread 36 on imaging, for example, shows clear directional movement, as does thread 11 on optics. The t-SNE projection of thread 9 for medical treatments shows a distinctive kind of circular movement that is evident in a number of other groups of white centroids in Figure 4. Figure 5 reveals certain further kinds of dynamics within some of the largest threads. For example, thread 1 on communications shows directional movement over time, and it splits into three subgroups. Similarly, thread 2 on semiconductors has two distinct subthreads that each move nonrandomly.

Figure 6 (top) shows the evolutionary activity of each thread of technology clusters that arose during and after 1976 and eventually contains at least four centroids. The same figure (middle) clearly shows the ongoing generation of new innovations, as pioneering clusters of patents arise in unexplored regions of technology feature space and new activity waves start accumulating exemplifications. The distance distributions in Figure 6 (bottom) show that many of those innovations arise in remote regions of technology feature space. By construction, any centroid in a thread of centroids is farther away than dth from all centroids outside the thread. The distance distributions in Figure 6 (bottom) show that about a quarter of the innovations arise in remote regions of technology feature space, farther away than 4 × dth = 0.16 from any centroid in any other thread.

Figure 6. 

Top: Evolutionary activity of cluster threads that are first exemplified after the first frame and eventually contain at least four centroids, over the 100 time frames covering 1976–2014. Middle: Blowup of the bottom 10% of the y axis above, to see the emergence of activity waves that indicate new innovative threads. Bottom: Temporal sequence of distributions of distances (min, mean, and max) between the first centroid in a thread and all other centroids in the same or earlier frames. A black dashed line is shown at dth = 0.04, the distance threshold used to connect centroids into threads of centroids. Three specific examples of innovative threads in Table 4 are given distinct colors in these plots: thread 36 on imaging (black), thread 42 on molecular genetics (orange), and thread 45 on programming hardware (green).

Figure 6. 

Top: Evolutionary activity of cluster threads that are first exemplified after the first frame and eventually contain at least four centroids, over the 100 time frames covering 1976–2014. Middle: Blowup of the bottom 10% of the y axis above, to see the emergence of activity waves that indicate new innovative threads. Bottom: Temporal sequence of distributions of distances (min, mean, and max) between the first centroid in a thread and all other centroids in the same or earlier frames. A black dashed line is shown at dth = 0.04, the distance threshold used to connect centroids into threads of centroids. Three specific examples of innovative threads in Table 4 are given distinct colors in these plots: thread 36 on imaging (black), thread 42 on molecular genetics (orange), and thread 45 on programming hardware (green).

4.3 Discussion

Together, Figures 3, 5, and 6 present clear and detailed empirical evidence for the ongoing generation of new and persisting clusters of technology—a clear example of technology's open-ended evolution. Figure 6 (middle and bottom) shows dozens of significant innovations that arise after frame 30, which contains clusters of patents issued after 1995. But recall that Figure 1 shows that no USPC classes were first exemplified after 1995; that is, the 450 predefined classes in the USPC detect none of the innovations in Figure 6 (middle and bottom) that arise after frame 30. Those innovations are all false negatives for the USPC.

Threads of clusters grow and change over time, and so does our classification. We can observe threads of clusters persisting, changing, and moving nonrandomly in technology feature space. Threads come into existence, split, merge, and go out of existence, and at different times different threads are exemplified. Because the threads of clusters automatically fit each new crop of innovations, the classification automatically adapts and evolves to cover innovations no matter where they emerge in technology feature space.

A weakness of classifications based on pigeonholes can be the biases, preconceptions, and other epistemic shortcomings of the human experts that define the pigeonholes, but classifications grounded in statistically defined technology clusters have many fewer such problems. The emergence of technology clusters presupposes only a suitable technology feature space, constructed from word patterns in millions of patent records. Those patent records were authored by millions of different human beings, and those authors no doubt each have individual epistemic limitations. But the construction of technology feature space averages out those individual quirks. On the other hand, if there are biases or other epistemic limitations shared by millions of patent authors, our technology clusters should reflect them, because semantic vectors built from standard corpora of text from the web have been shown to retain imprints of many known human biases [12].

5 A General Method for Detecting Unanticipated Innovations

Attempts to detect OEE empirically have been vexed by the emergence problem: how to detect innovations when you do not know their distinctive features. This problem is especially acute if you classify and describe technologies using a predefined list of pigeonholes—such as those used by the USPC. We have described an evolving classification of technology that enables a natural method for detecting unanticipated innovations, and we have illustrated this method by detecting the ongoing generation of new technological innovations in millions of patent records amassed since 1850. Some of those innovations are not detected by newly exemplified USPC classes; these undetected innovations are the false negatives expected with predetermined pigeonholes. The number of false negatives will typically grow until the pigeonholes are revised and adjusted.

We have proposed a process to analyze an evolving stream of patent data that largely eliminates the problem of detecting unanticipated innovations. Our process for classification begins by first embedding successive frames of patents in technology feature space, then clustering each frame of patents, and finally threading centroids of nearby clusters. This is a practical and feasible solution to the problem of detecting and classifying new and emerging technological innovations. The resulting classification is pragmatic, and it will continually and automatically adapt to unanticipated changes in the incoming stream of data. Significantly new kinds of inventions will produce new threads of patents that can be revealed by the process of embedding, clustering, and threading. The process has a handful of metaparameters that can be fine-tuned further for best results—for example, the number of clusters used (currently k = 25), the size of each temporal frame of data (currently 50,000 patents), and the dimension of the embedding space (currently 300).

Our method of detecting innovations has been presented in the context of the stream of data produced by patents, and the innovations are technological innovations. The process of embedding documents, clustering, and threading connected clusters is rather specific to the particular context of the stream of patent data. More generally, for any process that produces ongoing innovations, we assume that sequentially in time the process produces data, {…, Dt−2, Dt−1, Dt}, that somehow characterize the innovations, and with each new frame of data we apply in parallel a learning algorithm that produces a classification of all of the most recent data, {…, At−2, At−1, At}, and the classification at time t may depend on all previous data: At = At(…, Dt−2, Dt−1, Dt). In the case of the patents, Dt is the set of 50,000 patents in temporal data frame t, and At is the classification produced by embedding, clustering, and threading.

The method is, however, a very general way to solve the emergence problem for a wide range of other processes that produce innovations in an open-ended way, in biology, chemistry, culture, and beyond. In other contexts, the nature of the data Dt will change, and the learning algorithm At appropriate for finding structure in those data will also change. But for any data with underlying categorical structure, application of a learning algorithm should be possible. Up until now it has been difficult for observers of dynamically evolving systems to detect novel innovations before the innovation's distinctive characteristics have been been discovered and mapped. Our methods now enable this problem to be addressed with empirical pragmatism in all these areas. The learning algorithm At may be also be used for additional purposes, such as forecasting changes in the stream of technological innovations [23].

References

1
Alstott
,
J.
,
Triulzi
,
G.
,
Yan
,
B.
, &
Luo
,
J.
(
2017
).
Mapping technology space by normalizing patent networks
.
Scientometrics
,
110
(
1
),
443
479
.
2
Arthur
,
W. B.
(
2009
).
The nature of technology: What it is and how it evolves
.
New York
:
Simon and Schuster
.
3
Arthur
,
W. B.
, &
Polak
,
W.
(
2006
).
The evolution of technology within a simple computer model
.
Complexity
,
11
(
5
),
23
31
.
4
Bedau
,
M. A.
, &
Packard
,
N. H.
(
1991
).
Measurement of evolutionary activity, teleology, and life
. In
C.
Langton
,
C.
Taylor
,
D.
Farmer
, and
S.
Rasmussen
(Eds.),
Artificial life II
(pp.
431
461
).
Boston
:
Addison-Wesley
.
5
Bedau
,
M. A.
,
Snyder
,
E.
,
Brown
,
C. T.
, &
Packard
,
N. H.
(
1997
).
A comparison of evolutionary activity in artificial evolving systems and the biosphere
. In
P.
Husbands
and
I.
Harvey
(Eds.),
Proceedings of the Fourth European Conference on Artificial Life, ECAL97
(pp.
125
134
).
Cambridge, MA
:
MIT Press
.
6
Bedau
,
M. A.
,
Snyder
,
E.
, &
Packard
,
N. H.
(
1998
).
A classification of long-term evolutionary dynamics
. In
C.
Adami
,
H.
Belew
,
H.
Kitano
, and
C.
Taylor
(Eds.),
Artificial life VI
(pp.
228
237
).
Cambridge, MA
:
MIT Press
.
7
Benson
,
C. L.
, &
Magee
,
C. L.
(
2013
).
A hybrid keyword and patent class methodology for selecting relevant sets of patents for a technological field
.
Scientometrics
,
96
(
1
),
69
82
.
8
Bergeaud
,
A.
,
Potiron
,
Y.
, &
Raimbault
,
J.
(
2017
).
Classifying patents based on their semantic content
.
PLoS ONE
,
12
(
4
),
e0176310
.
9
Boyack
,
K. W.
,
Grafe
,
V. G.
,
Johnson
,
D. K.
, &
Wylie
,
B. N.
(
2002
).
Patent data mining method and apparatus
.
US Patent 6,389,418
.
10
Boyack
,
K. W.
,
Wylie
,
B. N.
, &
Davidson
,
G. S.
(
2002
).
Domain visualization using VxInsight® for science and technology management
.
Journal of the American Society for Information Science and Technology
,
53
(
9
),
764
774
.
11
Buchanan
,
A.
,
Packard
,
N. H.
, &
Bedau
,
M. A.
(
2011
).
Measuring the drivers of technological innovation in the patent record
.
Artificial Life
,
17
(
2
),
109
122
.
12
Caliskan
,
A.
,
Bryson
,
J. J.
, &
Narayanan
,
A.
(
2017
).
Semantics derived automatically from language corpora contain human-like biases
.
Science
,
356
,
183
186
.
13
Choi
,
S.
,
Kim
,
H.
,
Yoon
,
J.
,
Kim
,
K.
, &
Lee
,
J. Y.
(
2013
).
An SAO-based text-mining approach for technology roadmapping using patent information
.
R&D Management
,
43
(
1
),
52
74
.
14
Hall
,
B. H.
,
Jaffe
,
A. B.
, &
Trajtenberg
,
M.
(
2001
).
The NBER patent citation data file: Lessons, insights and methodological tools
(
Technical report
).
National Bureau of Economic Research
.
15
Heng
,
L.
,
Yuan
,
Z.
, &
Yufei
,
L.
(
2017
).
Technology hot spots and frontier identification support to 2035: The technology list adjustment method, using robot technology as an example
.
Strategic Study of Chinese Academy of Engineering
,
19
(
1
),
124
132
.
16
Hill
,
F.
,
Reichart
,
R.
, &
Korhonen
,
A.
(
2015
).
Simlex-999: Evaluating semantic models with (genuine) similarity estimation
.
Computational Linguistics
,
41
(
4
),
665
695
.
17
Kelly
,
B.
,
Papanikolaou
,
D.
,
Seru
,
A.
, &
Taddy
,
M.
(
2018
).
Measuring technological innovation over the long run (Technical Report 25266)
.
National Bureau of Economic Research
.
18
Lafond
,
F.
, &
Kim
,
D.
(
2017
).
Long-run dynamics of the U.S. patent classification system
.
arXiv:1703.02104v1
.
19
Lau
,
J. H.
, &
Baldwin
,
T.
(
2016
).
An empirical evaluation of doc2vec with practical insights into document embedding generation
.
arXiv:1607.05368
.
20
Le
,
Q.
, &
Mikolov
,
T.
(
2014
).
Distributed representations of sentences and documents
. In
Proceedings of the 31st International Conference on Machine Learning (ICML-14)
(pp.
1188
1196
).
21
Maaten
,
L. van der
, &
Hinton
,
G.
(
2008
).
Visualizing data using t-SNE
.
Journal of Machine Learning Research
,
9
(
Nov
),
2579
2605
.
22
Moody
,
C. E.
(
2016
).
Mixing Dirichlet topic models and word embeddings to make lda2vec
.
arXiv:1605.02019
.
23
Packard
,
N. H.
,
Gigliotti
,
N.
,
Nambiar
,
A.
,
Janssen
,
T.
, &
Bedau
,
M. A.
(
2018
).
Classification and prediction of evolving technology
.
[In preparation]
.
24
Pedregosa
,
F.
,
Varoquaux
,
G.
,
Gramfort
,
A.
,
Michel
,
V.
,
Thirion
,
B.
,
Grisel
,
O.
,
Blondel
,
M.
,
Prettenhofer
,
P.
,
Weiss
,
R.
,
Dubourg
,
V.
,
Vanderplas
,
J.
,
Passos
,
A.
,
Cournapeau
,
D.
,
Brucher
,
M.
,
Perrot
,
M.
, &
Duchesnay
,
E.
(
2011
).
Scikit-learn: Machine learning in Python
.
Journal of Machine Learning Research
,
12
,
2825
2830
.
25
Rehurek
,
R.
, &
Sojka
,
P.
(
2010
).
Software framework for topic modelling with large corpora
. In
Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks
.
Citeseer
.
26
Solé
,
R. V.
,
Valverde
,
S.
,
Casals
,
M. R.
,
Kauffman
,
S. A.
,
Farmer
,
D.
, &
Eldredge
,
N.
(
2013
).
The evolutionary ecology of technological innovations
.
Complexity
,
18
(
4
),
15
27
.
27
Strumsky
,
D.
,
Lobo
,
J.
, &
Van der Leeuw
,
S.
(
2012
).
Using patent technology codes to study technological change
.
Economics of Innovation and New Technology
,
21
(
3
),
267
286
.
28
Taylor
,
T.
,
Bedau
,
M.
,
Channon
,
A.
,
Ackley
,
D.
,
Banzhaf
,
W.
,
Beslon
,
G.
,
Dolson
,
E.
,
Froese
,
T.
,
Hickinbotham
,
S.
,
Ikegami
,
T.
,
McMullin
,
B.
,
Packard
,
N.
,
Rasmussen
,
S.
,
Virgo
,
N.
,
Agmon
,
E.
,
Clark
,
E.
,
McGregor
,
S.
,
Ofria
,
C.
,
Ropella
,
G.
,
Spector
,
L.
,
Stanley
,
K. O.
,
Stanton
,
A.
,
Timperley
,
C.
,
Vostinar
,
A.
, &
Wiser
,
M.
(
2016
).
Open-ended evolution: Perspectives from the OEE workshop in York
.
Artificial Life
,
22
(
3
),
408
423
.
29
Valverde
,
S.
,
Solé
,
R.
,
Bedau
,
M.
, &
Packard
,
N.
(
2007
).
Topology and evolution of technology innovation networks
.
Physical Review E
,
76
(
5
),
056118
.
30
Wang
,
S.
, &
Koopman
,
R.
(
2017
).
Clustering articles based on semantic similarity
.
Scientometrics
,
111
(
2
),
1017
1031
.