Abstract
With the increasing rate of patent application filings, automated patent classification is of rising economic importance. This article investigates how patent classification can be improved by using different representations of the patent documents. Using the Linguistic Classification System (LCS), we compare the impact of adding statistical phrases (in the form of bigrams) and linguistic phrases (in two different dependency formats) to the standard bag-of-words text representation on a subset of 532,264 English abstracts from the CLEF-IP 2010 corpus. In contrast to previous findings on classification with phrases in the Reuters-21578 data set, for patent classification the addition of phrases results in significant improvements over the unigram baseline. The best results were achieved by combining all four representations, and the second best by combining unigrams and lemmatized bigrams. This article includes extensive analyses of the class models (a.k.a. class profiles) created by the classifiers in the LCS framework, to examine which types of phrases are most informative for patent classification. It appears that bigrams contribute most to improvements in classification accuracy. Similar experiments were performed on subsets of French and German abstracts to investigate the generalizability of these findings.
1. Introduction
Around the world, the patent filing rates in the national patent offices have been increasing year after year, creating an enormous volume of texts, which patent examiners are struggling to manage (Benzineb and Guyot 2011). To speed up the examination process, a patent application needs to be directed to patent examiners specialized in the subfield(s) of that particular patent as quickly as possible (Smith 2002). This preclassification is done automatically in most patent offices, but substantial additional manual labor is still necessary. Furthermore, since 2010, the International Patent Classification1 (IPC) is revised every year to keep track of recent developments in the various subdomains. Such a revision is followed by a reclassification of portions of the existing patent corpus, which is currently done mainly by hand by the national patent offices (Held, Schellner, and Ota 2011). Both preclassification and reclassification could be improved, and a higher consistency of the classifications of the documents in the patent corpus could be obtained, if more reliable and precise automatic text classification algorithms were available (Benzineb and Guyot 2011).
Most approaches to text classification use the bag-of-words (BOW) text representation, which represents each document by the words that occur in it, irrespective of their ordering in the original document. In the last decades much research has gone into expanding this representation with additional information, such as statistical phrases2 (n-grams) or some forms of syntactic or semantic knowledge. Even though (statistical) phrases are more representative units for classes than single words (Caropreso, Matwin, and Sebastiani 2001), they are so sparsely distributed that they have limited impact during the classification process. Therefore, it is not surprising that the best scoring multi-class, multi-label3 classification results for the well-known Reuters-21578 data set have been obtained using a BOW representation (Bekkerman and Allan 2003). But the limited contribution of phrases in addition to the BOW-representation does not seem to hold for all classification tasks: Özgür and Güngör (2010) found significant differences in the impact of linguistic phrases between short newswire texts (Reuters-21578), scientific abstracts (NSF), and informal posts in usenet groups (MiniNg): Especially the classification of scientific abstracts could be improved by using phrases as index terms. In a follow-up study, Özgür and Güngör (2012) found that for the three different data sets, different types of linguistic phrases have most impact. The authors conclude that more formal text types benefit from more complex syntactic dependencies.
In this article, we investigate if similar improvements can be found for patent classification and, more specifically, which types of phrases are most effective for this particular task. In this article we investigate the value of phrases for classification by comparing the improvements that can be gained from extending the BOW representation with (1) statistical phrases (in the form of bigrams); (2) linguistic phrases originating from the Stanford parser (see Section 3.2.2); (3) aboutness-based4 linguistic phrases from the AEGIR parser (Section 3.2.3); and (4) a combination of all of these. Furthermore, we will investigate the importance of different syntactic relations for the classification task, and the extent to which the words in the phrases overlap with the unigrams. We also investigate which syntactic relations capture most information in the opinion of human annotators. Finally, we perform experiments to investigate if our findings are language-dependent. We will then draw some conclusions on what information is most valuable for improving automatic patent classification.
2. Background
2.1 Text Representations in Classification
Lewis (1992) was the first to investigate the use of phrases as index terms for text classification. He found that phrases generally suffer from data sparseness and may actually cause classification performance to deteriorate. These findings were confirmed by Apté, Damerau, and Weiss (1994). With the advent of increasing computational power and bigger data sets, however, the topic has been revisited in the last two decades (Bekkerman and Allan 2003).
In this section we will give an overview of the major findings in previous research on the use of statistical and syntactic phrases for text classification. Except when mentioned explicitly, all the classification experiments reported here were conducted using the Reuters-21578 data set, a well-known benchmark of 21,578 short newswire texts for multi-class classification into 118 categories (a document has an average of 1.24 class labels).
2.1.1 Combining Unigrams with Statistical Phrases
For an excellent overview of the work on using phrases done up to 2002, see Bekkerman and Allan (2003), and Tan, Wang, and Lee (2002).
Because they contain more specific information, one might think that phrases are more powerful features for text classification. There are two ways of using phrases as index terms: either index terms only or in combination with unigrams. All experimental results, however, show that using only phrases as index terms leads to a decrease in classification accuracy compared with the BOW baseline (Bekkerman and Allan 2003). Both Mladenic and Grobelnik (1998) and Fürnkranz (1998) showed that classifiers trained on combinations of unigrams and n-grams composed of at most three words performed better than classifiers that only use unigrams; no improvement was obtained when using larger n-grams. Because trigrams are sparser than bigrams, most of the subsequent research has focused on optimizing the combination of unigrams and bigrams using different feature selection techniques.
2.1.2 Feature Selection
Obviously, unigrams and bigrams overlap: Bigrams are pairs of unigrams. Caropreso, Matwin, and Sebastiani (2001) evaluated the relative importance of unigrams and bigrams in a classifier-independent study: Instead of determining the impact of features on the classification scores, they scored all unigrams and bigrams using conventional feature evaluation functions to find the features that are most representative for the document classes. For the Reuters-21578 data set, they found that many bigram features scored higher than unigram features. These (theoretical) findings were not confirmed in subsequent classification experiments, however. When the bigram/unigram ratio for a fixed number of features is changed to favor bigrams, classification performance tends to go down. It appears that the information in the bigrams does not turn the unigrams redundant.
Braga, Monard, and Matsubara (2009) used a Multinomial Naive Bayes classifier to investigate classification performance with unigrams and bigrams by comparing multiview classification (the results of two independent classifiers trained with unigram and bigram features are merged) with monoview classification (unigrams and bigrams are combined in a single feature set).5 They found that there is little difference between the output of the mono- and multiview classifiers. In the multiview classifiers, the unigram and bigram classifiers make similar decisions in assigning labels, although the latter generally yielded lower confidence values. Consequently, in the merge the unigram and bigram classifiers affirm each other's decisions, which does not result in an overall improvement in classification accuracy. The authors suggest combining unigrams only with those bigrams for which it holds that the whole provides more information than the sum of the parts.
Tan, Wang, and Lee (2002) proposed selecting highly representative and meaningful bigrams based on the Mutual Information scores of the words in a bigram compared with the unigram class model. They selected only the top 2% of the bigrams as index terms, and found a significant improvement over their unigram baseline, which was low compared to state-of-the-art results. Bekkerman and Allan (2003) failed to improve over their unigram baseline when using similar selection criteria based on the distributional clustering of unigram models. Crawford, Koprinska, and Patrick (2004) were not able to improve e-mail classification when using the selection criteria proposed by Tan, Wang, and Lee.
2.1.3 Combining Unigrams with Syntactic Phrases
Lewis (1992) and Apté, Damerau, and Weiss (1994) were the first to investigate the impact of syntactic phrases6 as features for text classification. Dumais et al. (1998) and Scott and Matwin (1999) did not observe a significant improvement in classification on the Reuters-21578 collection when noun phrases obtained with a shallow parser were used instead of unigrams. Moschitti and Basili (2004) found that neither words augmented with word sense information, nor syntactic phrases (acquired through shallow parsing) in combination with unigrams improved over the BOW baseline. Syntactic phrases appear to be even sparser than bigrams. Therefore, it is not surprising that most papers concluded that classifiers using only syntactic phrases perform worse than the baseline, except when the BOW baseline is low for that particular classification task (Mitra et al. 1997; Fürnkranz 1999).
Deep syntactic parsing is a computationally expensive process, but thanks to the increase in computational power it is now possible to use phrases acquired through deep syntactic parsing in classification tasks. Nastase, Sayyad, and Caropreso (2007) used Minipar to generate dependency triples that are combined with lemmatized and unlemmatized unigrams to classify the 10 most frequent classes in the Reuters-21578 data set. Their criterion for selecting triples as index terms is document frequency ≥ 2. The small improvement over the lemmatized unigram baseline was not statistically significant. Özgür and Güngör (2010, 2012) achieve small but significant improvements when combining unigrams with a subset of the dependency types from the Stanford parser on three different data sets, including the Reuters-21578 set. They find that separate pruning levels (based on the term frequency–inverse document frequency [TF-IDF] score of the index units) for the unigrams and syntactic phrases influence classification accuracy. Which dependency relations prove most relevant for a classification task depends greatly on the language use in the different data sets: The informal MiniNG data set (usenet posts) benefits a little from “simple” dependencies such as part, denoting a phrasal verb, for example write down, while classification in the more formal Reuters-21578 (newswire) and NSF (scientific abstracts) data sets is more improved by using dependencies on phrase and clause level (adjectival modifier, compound noun, prepositional attachment; and subject and object, respectively). The highest-ranking features for the NSF data set are compound noun (nn), adjectival modifier (amod), subject (subj), and object (obj), respectively. Furthermore, they observe that splitting up more generic relator types (such as prep) into different, more specific, subtypes increases the classification accuracy.
2.2 Patent Classification
It is not possible to draw far-reaching conclusions from previous research on patent classification, because there is no tradition of using a “standard” data set, and a standard split of patent corpora in a training and test set. Furthermore, there are differences between the various experiments in task definitions (mono-label versus multi-label classification); the granularity of the classification (depth in the IPC hierarchy); and the choices of (sub)sets of data. Fall and Benzineb (2002) give an overview of the work done in patent classification research up to 2002 and of the commercial patent classification systems available; see Benzineb and Guyot (2011) for a general introduction to patent classification.
Larkey (1999) was the first to present a fully automated patent classification system, but she did not report her overall accuracy results. Larkey (1998) used a combination of weighted words and noun phrases as index terms to classify a subset of the USPTO database, but found no improvement over a BOW baseline. The weights were calculated as follows: Frequency of a word or phrase in a particular section times the manually assigned weight (importance) given to that section. The weights for each word or phrase were then summed across sections. Term selection was based on a threshold for these weights.
Krier and Zaccà (2002) organized a comparative study of various academic and commercial systems for patent classification for a common data set. In this informal benchmark Koster, Seutter, and Beney (2001) achieved the best results, using the Balanced Winnow algorithm with a word-only text representation. Classification is performed for 44 or 549 categories (which correspond to different levels of depth in the then used version of the IPC), with around 78% and 68% precision at 100% recall, respectively.
Fall et al. (2003) introduced the EPO-alpha data set, attempting to create a common benchmark for patent classification. Using only words as index terms, they tested different classification algorithms and found that SVM outperform Naive Bayes, k-NN, SNoW, and decision-based classifiers. They achieved P@3-scores7 of 73% and 59% on 114 classes and 451 subclasses, respectively. They also found that when using only the first 300 words from the abstract, claims, and description sections, classification accuracy is increased compared with using the complete sections. The same data set was later used by Koster and Seutter (2003), who experimented with a combined representation of words and phrases consisting of head-modifier pairs.8 They found that head-modifier pairs could not improve on the BOW-baseline: The phrases were too sparse to have much impact on the classification process.
Starting in 2009, the IRF9 has organized CLEF-IP patent classification tracks in an attempt to bridge the gap between academic research and the patent industry. For this purpose the IRF has put a lot of effort into providing very large patent data sets,10 which have enabled academic researchers to train their algorithms on real-life data. In the CLEF-IP 2010 classification track the best results were achieved by Guyot, Benzineb, and Falquet (2010). Using the Balanced Winnow algorithm, they achieved a P@1-score of 83%, while classifying on subclass level. They used a combination of words and statistical phrases (collocations of variable length extracted from the corpus) as index terms and used all available documents (in English, French, and German) in the corpus as training data. In the same competition, Derieux et al. (2010) came second (in terms of P@1). They also used a mixed document representation of both single words and longer phrases, which had been extracted from the corpus by counting word co-occurrences. Verberne, Vogel, and D'hondt (2010) and Beney (2010) experimented with a combined representation of words and syntactic phrases derived from an English and French syntactic parser, respectively. They both found that adding syntactic phrases to words improves classification accuracy slightly. Beney (2010) remarks that this improvement may be language-dependent. As a follow-up, Koster et al. (2011) investigated the added value of syntactic phrases. They found that attributive phrases, that is, combinations of adjective or nouns with nouns, were by far the most important syntactic phrases for patent classification. On a subset of the CLEF-IP 2010 corpus11 they also found a small, but significant, improvement when adding dependency triples to words.
3. Experimental Set-up
In this article, we investigate the relative contributions of different types of terms to the performance of patent classification. We use four different types of terms, namely, lemmatized unigrams, lemmatized bigrams (see Section 3.2.1), lemmatized dependency triples obtained with the Stanford parser (see Section 3.2.2), and lemmatized dependency triples obtained with the AEGIR parser (see Section 3.2.3). We will leave term (feature) selection to the preprocessing module of the Linguistic Classification System (LCS) which we used for all experiments (see Section 3.3). We will analyze the relation between unigrams and phrases in the class profiles in some detail, however (see Sections 4.2 and 4.3).
3.1 Data Selection
We conducted classification experiments on a collection of patent documents obtained from the CLEF-IP 2010 corpus,12 which is a subset of the larger MAREC patent collection. The corpus contains 2.6 million patent documents, which roughly correspond to 1.3 million individual patents, published between 1985 and 2001.13 The documents in the collection are encoded in a customized XML format and may include text in English, French, and German. In addition to the standard sections of a patent document (title, abstract, claims, and description section), the documents also include meta-information on inventor, date of application, assignee, and so forth. Because our focus lies on text representation, we did not include any of the meta-data in our document representations.
The most informative sections of a patent document are generally considered to be the title, the abstract, and the beginning of the description (Benzineb and Guyot 2011). Verberne and D'hondt (2011) showed that using both the description and the abstract gives a small, but significant, improvement in classification results on the CLEF-IP 2011 corpus, compared with classification on abstracts only. The effort involved in parsing the descriptions is considerable, however: Because of the long sentences and the dense word use, a parser will have much more difficulty in processing text from the description section than from the abstracts. The titles of the patent documents also pose a parsing problem: These are generally short noun phrases that contain ambiguous PP-attachments that are impossible to disambiguate without any domain knowledge. This leads to incorrect syntactic analyses and, consequently, noisy dependency triple features. Because we are interested in comparing classification results for different text representations, and not in comparing results for different sections, we opted to use only the abstract sections of the patent document in the current article.
From the corpus, we extracted all files that contain both an abstract in English and at least one IPC class14 in the <classification-ipcr> field. We extracted the IPC classes on the document level; this means that we did not include the documents where the IPC class is in a separate file than the English abstract. In total, there were 121 different classes in our data set. Most documents have been assigned one to three different IPC classes (on class level). On average, a patent abstract in our data set has 2.12 class labels. Previous cross-validation experiments on the same document set showed very little variation (standard deviation < 0.3%) between the classification accuracies in different training-test splits (Verberne, Vogel, and D'hondt 2010). We therefore decided to use only one training and test set split.15
The final data set contained 532,264 abstracts, divided into two sets: (1) a training set (425,811 documents) and (2) a test set (106,453 documents). The distribution of the data over the classes is in accordance with the Pareto Principle: 20% of the classes cover 80% of the data, and 80% of the classes comprise only 20% of the data.
3.2 Data Preprocessing
3.2.1 Unigrams and Bigrams
The sentences in the abstract documents were converted to single words by splitting on whitespaces and removing punctuation. The words were then lemmatized using the AEGIR lexicon. Bigrams were created through a similar procedure. We did not create bigrams that spanned sentence boundaries. This resulted in approximately 60 million unigram and bigram tokens for the present corpus.
3.2.2 Stanford
The Stanford parser is a broad-coverage natural language parser that is trained on newswire text, for which it achieves state-of-the-art performance. The parser has not been optimized/retrained for the patent domain.17 In spite of the technical difficulties (Parapatics and Dittenbach 2009) and loss of linguistic accuracy for patent texts reported in Mille and Wanner (2008), most patent processing systems that use linguistic phrases use the Stanford parser because its dependency scheme has a number of properties that are valuable for Text Mining purposes (de Marneffe and Manning 2008). The Stanford parser collapsed typed dependency model has a set of 55 different syntactic relators to capture semantically contentful relations between words. For example, the sentence The system will consist of four separate modules is analyzed into the following set of dependency triples in the Stanford representation:
The Stanford parser was compiled with a maximum memory heap of 1.2 GB. Sentences longer than 100 words were automatically skipped. Combined with failed parses this led to a 1.2% loss of parser output on the complete data set. Parsing the entire set of abstracts took 1.5 weeks on a computer cluster consisting of 60 2.4GHz cores with 4 GB RAM per core. The resulting dependency triples were stripped of the word indexes and then lemmatized using the AEGIR lexicon.
3.2.3 AEGIR
AEGIR18 is a dependency parser that was designed specifically for robust parsing of technical texts. It combines a hand-crafted grammar with an extensive word-form lexicon. The parser lexicon was compiled from different technical terminologies, such as the SPECIALIST lexicon and the UMLS.19 The AEGIR parser aims to capture the aboutness of sentences. Rather than outputting extensive linguistic detail on the syntactic structure of the input sentence as in the Stanford parser, AEGIR returns only the bare syntactic–semantic structure of the sentence. During the parsing process, it effectively performs normalization at various levels, such as typography (for example, upper and lower case, spacing), spelling (for example, British and American English, hyphenation), morphology (lemmatization of word forms), and syntax (standardization of the word order and transforming passive structures into active ones).
The AEGIR parser uses only eight syntactic relators and returns fewer unique triples than the Stanford parser. The parser is currently still under development; for this article we used the version AEGIR v.1.7.5. The parser was constrained to a time limit of maximum three seconds per sentence. This caused a loss of 0.7% of parser output on the complete data set. Parsing the entire set of abstracts took slightly less than a week on the computer cluster described above. The AEGIR parser has several output formats, among which its own dependency format. The example sentence used to illustrate the output of the Stanford parser is analyzed as follows:
3.2.4 Lemmatization
Table 1 shows the impact of lemmatization (using the AEGIR lexicon) on the distribution of terms for the different text representations. Lemmatization and stemming are standard approaches to decreasing the sparsity of features; stemming is more aggressive than lemmatization. Ozgür and Güngör (2009) showed that—when using only words as index terms—stemming (with the Porter Stemmer) appears to improve performance; stemming dependency triples did not improve performance, however.
. | # tokens . | # types (terms) . | token/type (lem.) . | ||
---|---|---|---|---|---|
. | . | raw . | lemmatized . | gain . | . |
unigram | 48,898,738 | 160,424 | 142,396 | 1.12 | 343.39 |
bigram | 48,473,756 | 3,836,212 | 3,119,422 | 1.23 | 15.54 |
Stanford | 35,772,003 | 8,750,839 | 7,430,397 | 1.18 | 4.81 |
AEGIR | 31,004,525 | – | 5,096,918 | – | 6.08 |
. | # tokens . | # types (terms) . | token/type (lem.) . | ||
---|---|---|---|---|---|
. | . | raw . | lemmatized . | gain . | . |
unigram | 48,898,738 | 160,424 | 142,396 | 1.12 | 343.39 |
bigram | 48,473,756 | 3,836,212 | 3,119,422 | 1.23 | 15.54 |
Stanford | 35,772,003 | 8,750,839 | 7,430,397 | 1.18 | 4.81 |
AEGIR | 31,004,525 | – | 5,096,918 | – | 6.08 |
We opted to use a less aggressive form of generalization: Lemmatizing the word forms. We found that the bigrams gain20 most by lemmatizing the word forms, resulting in a higher token/type ratio. From Table 1 it can be seen that there are fewer triple tokens than bigram tokens: Whereas all the (high-frequency) function words are kept in the bigram representations, both dependency formats discard some function words in their parser output. For example, the AEGIR parser does not create triples for auxiliary verbs, and in both dependency formats, the prepositions become part of the relator. Consequently, the parsers will output fewer but more variable tokens, which results in lower token/type ratios and a lower impact of lemmatization.
3.3 Classification Experiments
The classification experiments were carried out within the framework of the LCS (Koster, Seutter, and Beney 2003). The LCS has been developed for the purpose of comparing different text representations. Currently, three classifier algorithms are available: Naive Bayes, Balanced Winnow (Dagan, Karov, and Roth 1997), and SVM-light (Joachims 1999). Verberne, Vogel, and D'hondt (2010) found that Balanced Winnow and SVM-light yield comparable classification accuracy scores for patent texts on a similar data set, but that Balanced Winnow is much faster than SVM-light for classification problems with a large number of classes. The Naive Bayes classifier yielded a lower accuracy. We therefore only used the Balanced Winnow algorithm for our classification experiments, which were run with the following LCS configuration, based on tuning experiments on the same data by Koster et al. (2011):
Global term selection (GTS): Document frequency minimum is 2, term frequency minimum is 3. Although initial term selection is necessary when dealing with such a large corpus, we deliberately aimed at keeping as many of the sparse phrasal terms as possible.
Local term selection (LTS): Simple Chi Square (Galavotti, Sebastiani, and Simi 2000). We used the LCS option to automatically select the most representative terms for every class, with a hard maximum of 10,000 terms per class.21
After LTS the selected terms of all classes are aggregated into one combined term vocabulary, which is used as the starting point for training the individual classes (see Table 3).
Term strength calculation: LTC algorithm (Salton and Buckley 1988) which is an extension of the TF–IDF measure.
Training method: Ensemble learning based on one-versus-rest binary classifiers.
Winnow configuration: We performed tuning experiments for the Winnow parameters on a development set of around 100,000 documents. We arrived at using the same setting as Koster et al. (2011), namely, α = 1.02, β = 0.98, θ+ = 2.0, θ − = 0.5, with a maximum of 10 training iterations.
For each document the LCS returns a ranked list of all possible labels and the attendant confidence scores. If the score assigned is higher than a predetermined threshold, the document is assigned that category. The Winnow algorithm has a default (natural) threshold equal to one. We configured the LCS to return a minimum of one label (with the highest score, even if it is lower than the threshold) and a maximum of four labels for each document.
The classification quality was determined by calculating the Precision, Recall, and F1 measures per document/class combination (see, e.g., Koster, Seutter, and Beney 2003), on the document level (micro-averaged scores).
Table 2 shows the impact of our global term selection criteria for the different text representations. This first feature reduction step is category-independent: The features are discarded on the basis of the term and document frequencies over the corpus, disregarding their distributions for the specific categories. We can see that the token/type ratio of Table 1 is mirrored in this table: The sparsest syntactic phrases lose most terms. Although the Stanford parser output is the sparsest text representation, it has the largest pool of terms to select from at the end of the GTS process.
. | total # of terms . | # of terms selected in GTS . | % of terms removed in GTS . |
---|---|---|---|
unigram | 142,396 | 58,423 | 58.97 |
bigram | 3,119,422 | 1,115,170 | 64.25 |
stanford | 7,430,397 | 1,618,478 | 78.22 |
AEGIR | 5,096,918 | 1,312,715 | 74.24 |
. | total # of terms . | # of terms selected in GTS . | % of terms removed in GTS . |
---|---|---|---|
unigram | 142,396 | 58,423 | 58.97 |
bigram | 3,119,422 | 1,115,170 | 64.25 |
stanford | 7,430,397 | 1,618,478 | 78.22 |
AEGIR | 5,096,918 | 1,312,715 | 74.24 |
The impact of the second feature reduction phase is shown in Table 3. During local term selection, the LCS finds the most representative terms for each class by selecting the terms whose distributions in the sets of positive and negative training examples for that class are maximally different from the general term distribution. We can see that in the combined runs only around 50% of the selectable unigrams (after GTS) are selected as features during LTS. This means that the phrases replace at least a part of the information contained in the possible unigrams.
. | . | # of terms after GTS . | # of terms after LTS . |
---|---|---|---|
baseline | uni | 58,423 | 69,476 |
unigrams + bigrams | uni | 58,423 | 23,753 |
bi | 1,115,170 | 300,826 | |
unigrams + stanford triples | uni | 58,423 | 26,630 |
stanford | 1,618,478 | 424,204 | |
unigrams + AEGIR triples | uni | 58,423 | 29,348 |
AEGIR | 1,312,715 | 409,851 |
. | . | # of terms after GTS . | # of terms after LTS . |
---|---|---|---|
baseline | uni | 58,423 | 69,476 |
unigrams + bigrams | uni | 58,423 | 23,753 |
bi | 1,115,170 | 300,826 | |
unigrams + stanford triples | uni | 58,423 | 26,630 |
stanford | 1,618,478 | 424,204 | |
unigrams + AEGIR triples | uni | 58,423 | 29,348 |
AEGIR | 1,312,715 | 409,851 |
4. Results and Discussion
4.1 Classification Accuracy
Table 4 shows the micro-averages of Precision, Recall, and F1 for five classification experiments with different document representations. To give an idea of the complexity of the task we have included a random guessing baseline in the first row.22 We found that extending a unigram representation with statistical and/or linguistic phrases gives a significant improvement in classification accuracy over the unigram baseline. The best-performing classifier is the one that combines all four text representations. When adding only type of phrase to unigrams, the unigrams + bigrams combination is significantly better than the combinations with syntactic phrases. Combining all four representations boosts recall, but has less impact on precision.
. | P . | R . | F1 . |
---|---|---|---|
weighted random guessing | 6.09% ± 0.14 | 6.04% ± 0.14 | 6.06% ± 0.14 |
unigrams | 76.27% ± 0.26 | 66.13% ± 0.28 | 70.84% ± 0.27 |
unigrams + bigrams | 79.00% ± 0.24 | 70.19% ± 0.27 | 74.34% ± 0.26 |
unigrams + Stanford triples | 78.35% ± 0.25 | 69.57% ± 0.28 | 73.70% ± 0.26 |
unigrams + AEGIR triples | 78.51% ± 0.25 | 69.18% ± 0.28 | 73.55% ± 0.26 |
all representations | 79.51% ± 0.24 | 71.11% ± 0.27 | 75.08% ± 0.26 |
. | P . | R . | F1 . |
---|---|---|---|
weighted random guessing | 6.09% ± 0.14 | 6.04% ± 0.14 | 6.06% ± 0.14 |
unigrams | 76.27% ± 0.26 | 66.13% ± 0.28 | 70.84% ± 0.27 |
unigrams + bigrams | 79.00% ± 0.24 | 70.19% ± 0.27 | 74.34% ± 0.26 |
unigrams + Stanford triples | 78.35% ± 0.25 | 69.57% ± 0.28 | 73.70% ± 0.26 |
unigrams + AEGIR triples | 78.51% ± 0.25 | 69.18% ± 0.28 | 73.55% ± 0.26 |
all representations | 79.51% ± 0.24 | 71.11% ± 0.27 | 75.08% ± 0.26 |
The results are similar to Özgür and Güngör's (2012) findings for scientific abstracts: Adding phrases to unigrams can significantly improve classification. The text in the patent corpus is vastly different from the newswire text in the Reuters corpus. Like scientific abstracts, patents are full of jargon and terminology, often expressed in multi-word units, which might favor phrasal representations. Moreover, the innovative concepts in a patent are sometimes described in generalized terms combined with some specifier (to ensure larger legal scope). For example, a hose might be referred to as a watering device. The term hose can be captured with a unigram representation, but the multi-word expression cannot. The difference with the results on the Reuters-21578 data set (discussed in Section 2.1.1), however, may not completely be due to genre differences: Bekkerman and Allan (2003) remark that the unigram baseline for the Reuters-21578 task is difficult to improve upon, because in that data set a few keywords are enough to distinguish between the categories.
4.2 Unigram versus Phrases
In this section we investigate whether adding phrases suppresses, complements, or changes unigram selection. To examine the impact of phrases in the classification process, we analyzed the class profiles23 of two large classes (H04 – Electric Communication Technique; and H01 – Basic electric elements) that show significant improvements in both Precision and Recall24 for the bigram classifier compared with the unigram baseline. We look at (1) the overlap of the single words in the class profiles of the unigram and combined representations; and (2) the overlap of the single words and the words that make up the phrases (hereafter referred to as parts) within the class profile of one text representation.
4.2.1 Overlap of Unigrams
The class profiles in the baseline unigram classifier contained far fewer terms ( < 20%) than the profiles in the classifiers that combine unigrams and phrases. This could be expected from the data in tables 2 and 3.
Unigrams are the highest ranked25 features in the combined representation class profiles (see Table 5). Furthermore, words that are important terms for unigram classification also rank high in the combined class profiles: On average, there is an 80% overlap of the top 1,000 most important words in unigram and combined representation class profiles. This decreases to 75% when looking at the 5,000 most important words. This shows that the classifier tends to select mostly the same words as important terms for the different text representation combinations. The relative ranking of the words is very similar in the class profiles of all the text representations. Thus, adding phrases to unigrams does not result in replacing the most important unigrams for a particular class and the improvements in classification accuracy must derive from the additional information in the selected phrases.
. | . | rnk10 . | rnk20 . | rnk50 . | rnk100 . | rnk1000 . |
---|---|---|---|---|---|---|
bigrams | 3.0 | 4.0 | 48.0 | 45.0 | 70.5 | |
stanford | 0.0 | 1.0 | 24.0 | 26.0 | 48.0 | |
AEGIR | 0.0 | 0.5 | 20.0 | 25.0 | 44.9 | |
bigrams | 2.0 | 2.0 | 34.0 | 36.0 | 43.2 | |
all representations | stanford | 0.0 | 0.0 | 4.0 | 6.0 | 13.0 |
AEGIR | 0.0 | 0.0 | 2.0 | 4.0 | 18.0 |
. | . | rnk10 . | rnk20 . | rnk50 . | rnk100 . | rnk1000 . |
---|---|---|---|---|---|---|
bigrams | 3.0 | 4.0 | 48.0 | 45.0 | 70.5 | |
stanford | 0.0 | 1.0 | 24.0 | 26.0 | 48.0 | |
AEGIR | 0.0 | 0.5 | 20.0 | 25.0 | 44.9 | |
bigrams | 2.0 | 2.0 | 34.0 | 36.0 | 43.2 | |
all representations | stanford | 0.0 | 0.0 | 4.0 | 6.0 | 13.0 |
AEGIR | 0.0 | 0.0 | 2.0 | 4.0 | 18.0 |
4.2.2 Overlap of Single Words and Parts of Bigrams
Like Caropreso, Matwin, and Sebastiani (2001), we investigated to what extent the parts of the high-ranked phrases overlap with words in the unigrams + bigrams class profile. We first looked at the lexical overlap of the words and the parts of the bigrams in the H01 unigrams + bigrams class profile. Interestingly, we found a relatively low overlap between the words and the parts of the phrases: For the 20 most important bigrams, only 11 of the 32 unique parts of the bigrams overlap with the 100 most important single word terms; in the complete class profile only 56% of the 10,387 parts of the bigrams overlap with the 9,064 words in the class profile. This means that a large part of the bigrams contains complementary information not present in the unigrams in the class profile.
To gain a deeper insight into the relationship between the bigrams and their parts, we also looked at the mass of the different terms in the class profiles. The mass of a term for a certain class is the product of its TF–IDF score and its Winnow weight for that class; ”mass” provides an estimate of how much a term contributes to the score of documents for a particular class. We can divide the terms into three main categories:
(a) mass(partA) ≥ mass(partB) ≥ mass(bigram);
(b) mass(partA) ≥ mass(bigram) > mass(partB);
(c) mass(bigram) > mass(partA) ≥ mass(partB).
The bigrams in category (c) (27%) are typically made up from low-ranked single words, such as mobile_station. Interestingly, most bigram parts in this subset do not occur as word terms in the unigram and bigram class profiles, but occur in the negative class profiles (a selection of terms that are considered to describe anything but that particular class). The complementary information of bigram phrases (compared to unigrams) is contained in this set of bigrams.
4.3 Statistical versus Linguistic Phrases
Results in Section 4.1 indicate that bigrams are most important additional features, but the experiment combining all four representations showed that dependency triples do complement bigrams. In this section we examine what information is captured by the different phrases and how this accounts for the differences in classification accuracy.
4.3.1 Class Profile Analysis
We first examined the differences between the statistical phrases and the two types of linguistic phrases to discover what information contained in the bigrams leads to better classification results. We performed our analysis on the different class profiles of B60 (“Vehicles in general”), a medium-sized class, which most clearly shows the advantage of the bigram classifier compared to the classifiers with linguistic phrases.26
All four class profiles with phrases contain roughly the same set of unigrams (between 78% to 91% overlap) that occur quite high in the corresponding unigram class profile. The AEGIR class profile contains 10% more unigrams than the other combined representation class profiles; these are mainly words that appear in the negative class profile of the corresponding unigram classifier. As in class H01, the relative position of the words remains the same. The absolute position of the words in the list, however, does change: Caropreso, Matwin, and Sebastiani (2001) introduced a measure for the effectiveness of phrases as terms, called the penetration, that is, the percentage of phrases in the top k terms when classifying with both words and phrases.
Comparing the penetration levels at the various ranks for the different classifiers, we can see that the classification results correspond with the tendency of a classifier to select phrases in the top k terms. Interestingly, we see a large disparity in the phrasal features that are selected by the combination classifier. The preference for bigrams is mirrored by the penetration levels of the unigrams + bigrams classifier which has selected more bigrams at higher ranks in the class profile than the classifiers with the linguistic phrases. This is in line with the findings of Caropreso, Matwin, and Sebastiani (2001) that penetration levels are a reasonable way to compute the contribution of n-grams to the quality of a feature set. On average, the linguistic phrases have much smaller weight in the class profiles than the bigrams and, consequently, are likely to have a smaller impact during the classification process. For the combination run, however, it seems that a long tail of small-impact features does improve classification accuracy.
Linguistic analysis of the top 100 phrases in the profiles of class B60 shows that all classifiers select similar types of phrases. We manually annotated the bigrams with the correct syntactic dependencies (in the Stanford collapsed typed dependency format) and compared these with the syntactic relations expressed in the linguistic phrases. The results are summarized in Table 6.
grammatical relation . | bigrams . | stanford . | AEGIR . | combination . |
---|---|---|---|---|
noun–noun compounds | 41 | 48 | 6228 | 4428 |
adjectival modifier | 11 | 8 | ||
determiner | 34 | 28 | 27 | 41 |
subject | 6 | 4 | 6 | 9 |
prepositions | 2 | 4 | 1 | 2 |
<other> | 7 | 8 | 4 | 4 |
grammatical relation . | bigrams . | stanford . | AEGIR . | combination . |
---|---|---|---|---|
noun–noun compounds | 41 | 48 | 6228 | 4428 |
adjectival modifier | 11 | 8 | ||
determiner | 34 | 28 | 27 | 41 |
subject | 6 | 4 | 6 | 9 |
prepositions | 2 | 4 | 1 | 2 |
<other> | 7 | 8 | 4 | 4 |
It appears that noun phrases and compounds such as circuit board and electric device are by far the most important terms in the class profiles. Interestingly, phrases that contain a determiner relation (e.g., the device) are deemed equally important in all four different class profiles. It is unlikely that this is a semantic effect, that is, that the determiner relation provides additional semantic information to the nouns in the phrases, but rather it seems an artefact of the abundance of noun phrases which occur in patent texts. We also looked into the lexical overlap between the parts of the different types of phrases. We found that the selected phrases encode almost exactly the same information in all three representations: There is an 80% overlap between the parts of the top 100 most important phrases. This decreases only to 75% when looking at the 10,000 most important phrases.
Given that the class profiles select the same set of words and contain phrases with a high lexical overlap, therefore, how do we explain the marked differences in classification accuracy between the three different representations? These must stem from the different combinations of the words in the phrasal features. To examine in detail how the features created through the different text represenations differ, we conducted a feature quality assessment experiment against a manually created reference set.
4.3.2 Human Quality Assessment Experiment
To gain more insight in the syntactic and semantic relations that are considered most informative by humans, we conducted an experiment in which we asked human annotators to select the five to ten most informative phrases27 for 15 sentences taken at random from documents in the three largest classes in the corpus. We then compiled a reference set consisting of 70 phrases (4.6 phrases per sentence) which were considered as “informative” by at least three out of four annotators. Of these, 57 phrases were noun–noun compounds and 11 were combinations of an adjectival modifier with a noun. None of the annotators selected phrases containing determiners.
We created bigrams from the input and extracted head–modifier pairs28 from the parser output for the sentences in the test set. We then compared the overlap of the generated phrases with the reference phrases. We found that bigrams overlap with 53 of the 70 reference phrases; Stanford triples overlap with 62 phrases and AEGIR triples overlap with 57 phrases. Although three data points are not enough to compute a formal measure, it is interesting to note the correspondence with the number of terms kept for the three text representations after Local Term Selection (see Table 3). The fact that the text representation with the smallest number of terms after LTC and with the smallest overlap with “contentful” phrases in a text as indicated by human annotators still yields the best classification performance suggests that not all “contentful” phrases are important or useful for the task of classifying that text. This finding is reminiscent of the fact that the “optimal” summary of a text is dependent on the goal with which the summary was produced (Nenkova and McKeown 2011).
Only 15% of the phrases extracted by the human annotators contain word combinations that have long-distance dependencies in the original sentences. This suggests that the most meaningful phrases are expressed in local dependencies, that is, adjacent words. Consequently, syntactic analysis aimed at discovering meaning expressed by long-distance dependencies can only make a small contribution. A further analysis of the phrases showed that the smaller coverage of the bigrams is due to the fact that some of the relevant noun–noun combinations are missed because function words, typically determiners or prepositions, occur between the nouns. For example, the annotators constructed the reference phrase rotation axis for the noun phrase the rotation of the second axis. This reference phrase cannot be captured by the bigram representation. When intervening function words are removed from the sentences, the coverage of the resulting bigrams on the reference set rises29 to 59 phrases (more than AEGIR, and almost as many as Stanford). Despite the fact that generating more phrases does not necessarily lead to better classification performance, we intend to use bigrams stripped of function words as additional terms for patent classification in future experiments.
The analysis also revealed an indication why syntactic phrases may lead to inferior classification results: Both syntactic parsers consistently fail to find the correct structural analysis of the long and complex noun phrases such as an implantable, inflatable dual chamber shape retention tissue expander, which are frequent in patent texts. Phrases like this contain many compounds in an otherwise complex syntactic structure, namely
[an [implantable, inflatable [[dual chamber] [shape retention] [tissue expander]]]].
[an [implantable, [inflatable [dual [chamber [shape [retention [tissue expander]]]]]]]].
These findings are confirmed when looking at the overlap of the word combinations: Although there is high lexical overlap between the phrases of the different representations (80% overlap of the parts of phrases in Section 4.3.1), the overlap of the word combinations that make up the phrases is much lower: Only 33% of the top 1,000 phrases are common between all three representations.
4.4 Stanford versus AEGIR Triples
The performance with the unigrams + Stanford triples is not significantly different from the combination with AEGIR triples. Because the AEGIR triples are slightly less sparse (see Table 1), we expected that these would have an advantage over Stanford triples. Most of the normalization processes that make the AEGIR triples less sparse concern syntactic variation on the clause level, however. But as was shown in Section 4.3, the most important terms for classification in the patent domain are found in the noun phrase, where Stanford and AEGIR perform similar syntactic analyses. Although Stanford's dependency scheme is more detailed (see Table 6), the noun-phrase internal dependencies in the Stanford parser map practically one-to-one onto AEGIR's set of relators, resulting in very similar dependency triple features for classification. Consequently, there is no normalization gain in using the AEGIR dependency format to describe the internal structure of the noun phrases.
4.5 Comparison with French and German Patent Classification
We found that phrases contribute to improving classification on English patent abstracts. The improvement might be language-dependent, however, because compounds are treated differently in different languages. A compounding language like German might benefit less from using phrases than English. To estimate the generalizability of our findings, we conducted additional experiments in which we compared the impact of adding bigrams to unigrams for both French and German.
Using the same methods described in sections 3.1 and 3.2, we extracted and processed all French and German abstracts from the CLEF-IP 2010 corpus, resulting in two new data sets that contained 86,464 and 294,482 documents, respectively (Table 7). Both data sets contained the same set of 121 labels and had label distributions similar to the English data set. The sentencing script was updated with the most common French and German abbreviations to minimize incorrect sentence splitting. The resulting sentences were then tagged using the French and German versions of the TreeTagger.30 From the tagged output, we extracted the lemmas and used these to construct unigrams and bigrams for both languages. We ran the experiments with the LCS using the settings reported in Section 3.3.
. | . | P . | R . | F1 . |
---|---|---|---|---|
French | unigrams | 70.65% ± 0.68 | 61.40% ± 0.73 | 65.70% ± 0.70 |
unigrams + bigrams | 72.31% ± 0.67 | 62.58% ± 0.72 | 67.09% ± 0.69 | |
German | unigrams | 76.44% ± 0.34 | 65.82% ± 0.38 | 70.73% ± 0.37 |
unigrams + bigrams | 76.39% ± 0.34 | 65.41% ± 0.38 | 70.47% ± 0.37 |
. | . | P . | R . | F1 . |
---|---|---|---|---|
French | unigrams | 70.65% ± 0.68 | 61.40% ± 0.73 | 65.70% ± 0.70 |
unigrams + bigrams | 72.31% ± 0.67 | 62.58% ± 0.72 | 67.09% ± 0.69 | |
German | unigrams | 76.44% ± 0.34 | 65.82% ± 0.38 | 70.73% ± 0.37 |
unigrams + bigrams | 76.39% ± 0.34 | 65.41% ± 0.38 | 70.47% ± 0.37 |
The results show a much smaller but still significant improvement for using bigrams when classifying French patent abstracts and even a deterioration for German. Due to the difference in size between the English and French data set it is difficult to draw hard conclusions on which language benefits most from adding bigrams. It is clear, however, that our findings are not generalizable to German (and probably other compounding languages).
5. Conclusion
In this article we have examined the usefulness of statistical and linguistic phrases for patent classification. Similar to Özgür and Güngör's 2010 results for scientific abstracts, we found that adding phrases to unigrams significantly improves classification results for English. Of the three types of phrases examined in this article, bigrams have the most impact, both in the experiment that combined all four text representations, and in combination with unigrams only.
The abundance of compounds in the terminology-rich language of the patent domain results in a relatively high importance for the phrases. The top phrases across the different representations were mostly noun–noun compounds (for example watering device), followed by phrases containing a determiner relation (for example the module) and adjective modifier phrases (for example separate module).
The information in the phrases and unigrams overlaps to a large extent: Most of the phrases consist of words that are important unigram features in the combined profile and that also appear in the corresponding unigram class profile. When examining the H01 class profiles, however, we found that 27% of the selected phrases contain words that were not selected in the unigram profile (see Section 4.2.2).
When comparing the impact of features created from the output of the aboutness-based AEGIR parser with those from the Stanford parser, we found the latter resulted in slightly (but not significantly) better classification results. AEGIR's normalization features are not advantageous (compared with Stanford) in creating noun-phrase internal triples, which are the most informative features for patent classification.
The parsers were not specifically trained for the patent domain and both experienced problems with long, complex noun phrases consisting of sequences of words that can function as adjective/adverb or noun and that are not interrupted by function words that clarify the syntactic structure. The right-headed bias of both syntactic parsers caused problems in analyzing those constructions, yielding erroneous and variable data. As a consequence, parsers may miss potentially relevant noun–noun compounds and noun phrases with adjectival modifiers. Because of the highly idiosyncratic nature of the terminology used in the patent domain, it is not evident whether this problem can be solved by giving a parser access to information about the frequency with which specific noun–noun, adjective–noun, and adjective/adverb–adjective pairs occur in technical texts. Bigrams, on the other hand, are less variable (as seen in Table 1) and therefore yield better classification results. This is the more important point because the dependency relations marked as important for understanding a sentence by the human annotators consist mainly of pairs of adjacent words.
We also performed additional experiments to examine the generalizability of our findings for French and German: As could be expected, compounding languages like German which express complex concepts in “one word” do not gain from using bigrams.
In line with Bekkerman and Allan (2003) we can conclude that with the large quantities of text available today, the role of phrases as features in text classification must be reconsidered. For the automated classification of English patents at least, adding phrases and more specifically bigrams significantly improves classification accuracy.
Notes
The IPC is a complex hierarchical classification system comprising sections, classes, subclasses, and groups. For example, the “A42B 1/12” class label, which groups designs for bathing caps, falls under section A “Human necessities,” class 42 “Headwear,” subclass B “Head coverings,” group 1 “Hats; caps; hoods.” The latest edition of the IPC contains eight sections, 129 classes, 639 subclasses, 7,352 groups, and 61,847 subgroups. The IPC covers inventions in all technological fields in which inventions can be patented.
By a phrase we mean an index unit consisting of two or more words, generated through either syntactic or statistical methods.
Multi-class classification is the problem of classifying instances into more than two classes. Multi-label signifies that documents in this test set are associated with more than one class, and must be assigned a set of labels during classification.
The notion of aboutness refers to the conceptual content expressed by a dependency triple. For a more detailed description, see Section 3.2.3.
The difference between multiview and monoview classification corresponds to what is called late and early fusion in the pattern recognition literature.
The concept “syntactic phrase” can be given several different interpretations, such as noun phrases, verb phrases, predicate structures, dependency triples, and so forth.
Precision at rank 3 (P@3) signifies the percentage correct labels in the first three labels by the classifier to a given document.
Head-modifier pairs were derived from the syntactic analysis output of the EP4IR syntactic parser.
Information Retrieval Facility, see www.irf.com.
The CLEF-IP 2009, CLEF-IP 2010, and CLEF-IP 2011 data sets can be obtained through the IRF. The more recent data sets subsume the older sets.
The same data set as will be used in this article. For a more detailed description, see Section 3.1.
This test collection is available through the IRF (http://www.ir-facility.org/collection).
Note the difference between a patent and a patent document: A patent is not a physical document itself, but a name for a group of patent documents that have the same patent ID number.
For our classification experiments we use the codes on the class level in the IPC8 classification.
The data split was performed using a perl script that randomly shuffles the documents and puts them into a train set and test set, while ensuring that the class distribution of the examples in the train set approximates that of the whole corpus. It can be downloaded as part of the LCS distribution.
The lexicon can be downloaded at http://lexsrv3.nlm.nih.gov/Specialist/Summary/lexicon.html.
For retraining a parser, a substantial amount of annotated data (in the form of syntactically annotated dependency trees) is needed. Creating such annotations is a very expensive task and beyond the scope of this article.
AEGIR stands for Accurate English Grammar for Information Retrieval. Using the AGFL compiler (found at http://www.agfl.cs.ru.nl/) this grammar can be compiled into an operational parser. The grammar is not freely distributed.
The Unified Medical Language System contains a widely-used terminology of the biomedical domain and can be downloaded at http://www.nlm.nih.gov/research/umls/.
By “gain” we mean the decrease in number of types for the lemmatized forms compared to the non-lemmatized forms, which will result in higher corresponding token/type ratios.
Increasing the cut-off to 100,000 terms resulted in a small increase in accuracy (F1 values) for the combined representations, mostly for the larger classes. Because the patent domain has a large lexical variety, a large amount of low-frequency terms in the tail of the term distribution can have a large impact on the accuracy scores. Because we are more interested in the relative gains between different text representations and the corresponding top terms in the class profiles than in achieving maximum classification scores, we opted to use only 10,000 terms for efficiency reasons.
The script used to calculate the baseline can be downloaded at http://lands.let.ru.nl/~dhondt/. We used a weighted randomization that takes the category label distributions and label frequency distributions into account.
A class profile is the model built by the LCS classifier for a class during training. It consists of a ranked list of terms that contribute most to distinguishing members from a class from all other classes.
H04: P: + 3.09%; R: + 1.83%; H01: P: + 3.61%; R: + 5.14%.
The rank of a term is based on the decreasing order of mass assigned to that term in the class profile. (See Section 4.2.2.)
Precision is 77.34% for unigrams+bigrams, 75.67% for unigrams+Stanford, 73.47% for unigrams+AEGIR, and 77.38% for unigrams+bigrams+Stanford+AEGIR. The Recall scores are essentially equal for all three, that is, 68.81%, 68.38%, 69.7%, and 70.18%, respectively.
“Phrase” was defined as a combination of two words that both occur in the sentence, irrespective of the order in which they occur in the sentence.
Head–modifier pairs are syntactic triples that are stripped of their grammatical relations.
This result is language-dependent: English has a fairly rigid phrase-internal word order but for a more synthetic language with a more variable word order, like Russian, bigram coverage might suffer from the variation in the surface form.
References
Author notes
Center for Language Studies, PO Box 9103, 6500 HD Nijmegen, the Netherlands. E-mail: [email protected].
Center for Language Studies / Institute for Computing and Information Sciences, PO Box 9103, 6500 HD Nijmegen, the Netherlands. E-mail: [email protected].
Institute for Computing and Information Sciences, PO Box 9010, 6500 HD Nijmegen, the Netherlands. E-mail: [email protected].
Center for Language Studies, PO Box 9103, 6500 HD Nijmegen, the Netherlands. E-mail: [email protected].