Abstract
From the perspective of structural linguistics, we explore paradigmatic and syntagmatic lexical relations for Chinese POS tagging, an important and challenging task for Chinese language processing. Paradigmatic lexical relations are explicitly captured by word clustering on large-scale unlabeled data and are used to design new features to enhance a discriminative tagger. Syntagmatic lexical relations are implicitly captured by syntactic parsing in the constituency formalism, and are utilized via system combination. Experiments on the Penn Chinese Treebank demonstrate the importance of both paradigmatic and syntagmatic relations. Our linguistically motivated, hybrid approaches yield a relative error reduction of 18% in total over state-of-the-art baselines. Despite the effectiveness to boost accuracy, computationally expensive parsers make hybrid systems inappropriate for many realistic NLP applications. In this article, we are also concerned with improving tagging efficiency at test time. In particular, we explore unlabeled data to transfer the predictive power of hybrid models to simple sequence models. Specifically, hybrid systems are utilized to create large-scale pseudo training data for cheap models. Experimental results illustrate that the re-compiled models not only achieve high accuracy with respect to per token classification, but also serve as a front-end to a parser well.
1. Introduction
In grammar, a part-of-speech (POS) is a linguistic category of words, generally defined by the syntactic or morphological behavior of the word in question. Automatically assigning POS tags to words plays an important role in parsing, word sense disambiguation, as well as many other NLP applications. Many successful tagging algorithms developed for English have been applied to many other languages as well. In some cases, the methods work well without large modifications, such as for German. But a number of augmentations and changes become necessary when dealing with highly inflected or agglutinative languages, as well as analytic languages, of which Chinese is the focus of this article. The Chinese language is characterized by the lack of formal devices such as morphological tense and number that often provide important clues for syntactic processing tasks. Although state-of-the-art tagging systems have achieved accuracies above 97% on English, Chinese POS tagging has proven to be more challenging and result in accuracies of about 93–94% (Ng and Low 2004; Tseng, Jurafsky, and Manning 2005; Huang, Harper, and Wang 2007; Huang, Eidelman, and Harper 2009; Li et al. 2011).
It is generally accepted that Chinese POS tagging often requires more sophisticated language processing techniques that are capable of drawing inferences from more subtle linguistic knowledge. From a linguistic point of view, meaning arises from the differences between linguistic units, including words, phrases, and so on, and these differences are of two kinds: paradigmatic (concerning substitution) and syntagmatic (concerning positioning). The distinction is a key one in structuralist semiotic analysis. Whereas syntagmatic relations are possibilities of combination, paradigmatic relations are functional contrasts—they involve differentiation. Both paradigmatic and syntagmatic lexical relations have a great impact on POS tagging, because the value of a word is determined by the two relations. For example, the Penn Chinese Treebank (CTB) (Xue et al. 2005)-style POS tags capture both paradigmatic and syntagmatic relations among words, given that its annotation criterion is the syntactic distribution of words.
With a linguistic motivation, we examine the impact of paradigmatic and syntagmatic lexical relations on Chinese POS tagging. Our study is motivated by the key language-specific property that Chinese is an analytic language and encodes lexical categorial information in a highly configurational rather than morphological way. This implies that capturing paradigmatic and syntagmatic relations must leverage on clues from a wider range of sources rather than surface strings. On the contrary, expressive morphological information can be found based on the word strings themselves. We argue that different strategies should be employed for designing tagging models for Chinese and other morphologically rich languages.
We present an error analysis of two state-of-the-art sequential taggers. The first one uses a generative hidden Markov model (HMM) that is enhanced by using latent annotations. This model is also known as symbol-refined HMM (SR-HMM). The second one is a discriminative tagger that uses linear-chain global linear models (LGLM) with rich contextual word features. Both achieve state-of-the-art performance. Our error analysis of both taggers shows that the lack of both paradigmatic and syntagmatic lexical knowledge accounts for a large part of tagging errors.
Our research is concerned with capturing paradigmatic and syntagmatic lexical relations to advance the state-of-the-art of Chinese POS tagging. Chinese, as an analytic language, encodes lexical categorial information in a highly configurational rather than morphological way. This language-specific property implies that capturing paradigmatic and syntagmatic relations must leverage clues from a wider range of sources rather than surface strings. To improve tagging performance, first, we use unsupervised word clustering to explore paradigmatic relations that are encoded in large-scale unlabeled data. Using unsupervised algorithms to acquire rich word representations, such as word clustering and word similarity calculation, is a very practical way to achieve wide-coverage lexical resources. To enhance the discriminative tagger, word clusters are explicitly utilized as new features. We are relying on the ability of discriminative learning to explore informative features that play a central role in boosting tagging performance.
Second, we study the possible impact of syntagmatic relations on POS tagging by comparatively analyzing (syntax-free) sequential tagging models and (syntax-based) parsing models in the constituency formalism. Inspired by the analysis, we use a full parser to implicitly capture syntagmatic relations and propose a simple yet effective stacking model to combine the complementary strengths of sequential taggers and parsers.
We conduct experiments on the CTB and Chinese Gigaword. We implement a discriminative sequential classification model for POS tagging that achieves state-of-the-art accuracy. Experiments show that this model is significantly improved by word cluster features in accuracy across a wide range of conditions. This confirms the importance of the paradigmatic relations. We then present a comparative study of our tagger and a constituency parser and a dependency parser, and show that the combination of heterogeneous models can significantly improve tagging accuracy. Our experiments show that stacking is a very effective method to combine the complementary strengths of heterogeneous models. This demonstrates the importance of the syntagmatic relations. Cluster-based features and the stacking model result in a relative error reduction of 18% in terms of the word classification accuracy.
Although predictive powers of hybrid systems are significantly better than individual systems, they are not suitable for large-scale real word applications that have stringent time requirements. The best performing model is slow and large, and fast and compact models are less accurate, because either they are not expressive enough or they overfit to the limited training data. To improve POS tagging efficiency without loss of accuracy, we explore unlabeled data to transfer the predictive power of complex, inefficient models to simple, efficient models. Specifically, hybrid systems are utilized to create large-scale pseudo training data for cheap sequence models. For the SR-HMM tagger, pseudo training data are able to estimate finer-grained latent variables, and for the discriminative tagger, tagging accuracy can be improved by extending the context for feature extraction.
Experiments on the CTB and Gigaword demonstrate that unlabeled data are effective to transfer the predictive power of hybrid models to simple models, including both latent variable generative models and global linear classifiers. On one hand, the precision in terms of word classification is improved to 95.34%, which is equivalent to the parser-integrated hybrid model. On the other hand, re-compiled models are adapted based on parsing results, and as a result the ability to capture syntagmatic lexical relations is improved, too. Different from the purely supervised sequence models, re-compiled models also serve as a front-end to a parser well.
Our study has been partially published in Sun and Uszkoreit (2012) and Sun, Peng, and Wan (2013). For this iteration, we re-implement all models, and therefore experimental results are not exactly the same. We also release our implementation for research purposes. The related resources can be downloaded at www.icst.pku.edu.cn/lcwm/lexer.
2. Motivating Analysis
Many algorithms have been applied to computationally assigning POS labels to English words, including hand-written rules, generative HMM tagging, and discriminative sequence labeling. Such methods have been applied to many other languages as well. In some cases, the methods work well without large modifications, such as for German POS tagging. But a number of augmentations and changes became necessary when dealing with Chinese, a language that has little, if any, inflectional morphology. Whereas state-of-the-art tagging systems have achieved accuracies above 97% on English, Chinese POS tagging has proven to be more challenging and obtains accuracies of about 93–94% (Tseng, Jurafsky, and Manning 2005; Huang, Harper, and Wang 2007; Huang, Eidelman, and Harper 2009; Li et al. 2011). In this section, we give a brief introduction and a comparative analysis to several models that have been recently designed to resolve the Chinese POS tagging problem.
2.1 State-of-the-Art Tagging Models
2.1.1 Linear-Chain Global Linear Model (LGLM)
Discriminative learning is also an appropriate solution for Chinese POS tagging, because of its flexibility to include knowledge from multiple linguistic sources. Tseng, Jurafsky, and Manning (2005) introduced a maximum entropy–based model, which includes morphological features for unknown word recognition, and Sun (2011) studied the joint word segmentation and POS tagging problem and developed a fully discriminative method. However, they did not deeply analyze the problem from a linguistic view.
The global linear algorithm we adopt in this article is averaged perceptron (Collins 2002).
2.1.2 Symbol-Refined Hidden Markov Model (SR-HMM)
Generative models with latent annotations (LAs) obtain state-of-the-art performance for a number of NLP tasks. For example, both context-free Grammar (CFG) and tree-substitution grammar (TSG) with refined latent variables achieve excellent results for syntactic parsing (Matsuzaki, Miyao, and Tsujii 2005; Shindo et al. 2012). For Chinese POS tagging, Huang, Eidelman, and Harper (2009) described and evaluated a bigram HMM tagger that utilizes latent annotations. The use of latent annotations substantially improves the performance of a simple generative bigram tagger, outperforming a trigram HMM tagger with sophisticated smoothing.
2.1.3 Local Classification
A very simple approach to POS tagging is to formulate it as a local word classification problem. Various features can be drawn upon information sources such as word forms and characters that constitute words. Previous study on many languages shows that local classification is inadequate to capture structural information of output labels, and thus does not perform as well as structured models. The local classification algorithm we adopt in this article is linear SVM.1 Because it is a local linear model, we denote it as LLM.
2.2 Evaluation
2.2.1 Setting
Penn Chinese Treebank (CTB) (Xue et al. 2005) is a popular data set to evaluate a number of Chinese NLP tasks, including word segmentation (Sun and Xu 2011), POS tagging (Huang, Harper, and Wang 2007; Huang, Eidelman, and Harper 2009), constituency parsing (Wang, Sagae, and Mitamura 2006; Zhang and Clark 2009), and dependency parsing (Zhang and Clark 2008; Huang and Sagae 2010; Li et al. 2011). We use CTB 6.0 as the labeled data for the study. The corpus was collected during different time periods from different sources with a diversity of topics. In order to obtain a representative split of data sets, we conduct experiments following the setting of the CoNLL 2009 shared task. The setting is provided by the principal organizer of the CTB project, and considers many annotation details. This setting is more robust for evaluating Chinese language processing algorithms. Table 1 shows the statistics of our experimental settings.
Training, development, and test data on CTB 6.0.
. | #Sentence . | #Word . |
---|---|---|
Training | 22,277 | 609,060 |
Development | 1,763 | 49,620 |
Test | 2,556 | 73,153 |
. | #Sentence . | #Word . |
---|---|---|
Training | 22,277 | 609,060 |
Development | 1,763 | 49,620 |
Test | 2,556 | 73,153 |
To deeply analyze the POS tagging problem for Chinese, we implement a linear-chain global linear model. A majority of state-of-the-art English POS taggers are based on LGLMs, for example, structured perceptron (Collins 2002) and conditional random fields (Lafferty, McCallum, and Pereira 2001). We choose structured perceptron (Collins 2002) to estimate parameters.
2.2.2 Features for LLM and LGLM
In our experiments, we use a feature set that draws upon information sources such as word forms and characters that constitute words. To conveniently illustrate, we denote a word in focus with a fixed window w−2w−1ww+1w+2, where w is the curent token. Our features includes:
- •
Word unigrams: w−2, w−1, w, w+1, w+2
- •
Word bigrams: w−2−w−1, w−1_w, w_w+1, w+1_w+2
- •
In order to better handle unknown words, we extract morphological features: character n-gram prefixes and suffixes for n up to 3
2.2.3 Overall Performance
Table 2 summarizes the performance in terms of per word classification of different supervised models on the development data. We present the results of both first- and second-order LGLMs. There is only a slight gap between the local classification model and various structured models. Although the local classifier achieves comparable results when applied to Chinese data, there is a much more significant gap between the corresponding structured models. Similarly, the gap between the first- and second-order LGLMs is very modest too.
2.3 Error Analysis
2.3.1 Correlating Tagging Accuracy with Word Frequency
Table 3 summarizes the prediction accuracy on the development data with respect to the word frequency on the training data. To avoid overestimating the tagging accuracy, these statistics exclude all punctuation that can be easily recognized. From this table, we can see that words with low frequency, especially the out-of-vocabulary (OOV) words, are hard to label. Compared with a generative model, one major advantage of a discriminative model is its ability to utilize flexible features for disambiguation. This is quite important for predicting an unknown word. When a word is very frequently used, its behavior is complicated and therefore hard to predict. A typical example of such words is the language-specific function word “.” This analysis suggests that a main topic to enhance Chinese POS tagging is to bridge the gap between the infrequent words and frequent words.
Tagging accuracies (%) relative to word frequency.
Freq. . | LLM . | LGLM1 . | LGLM2 . | SR-HMM . |
---|---|---|---|---|
0 | 78.72 | 79.77 | 80.66 | 77.49 |
1–5 | 87.75 | 87.95 | 88.13 | 87.57 |
6–10 | 90.04 | 91.04 | 91.28 | 90.69 |
11–100 | 94.49 | 94.94 | 94.80 | 94.60 |
101–1000 | 95.68 | 96.08 | 96.12 | 96.23 |
1001– | 91.81 | 93.62 | 93.94 | 93.41 |
Freq. . | LLM . | LGLM1 . | LGLM2 . | SR-HMM . |
---|---|---|---|---|
0 | 78.72 | 79.77 | 80.66 | 77.49 |
1–5 | 87.75 | 87.95 | 88.13 | 87.57 |
6–10 | 90.04 | 91.04 | 91.28 | 90.69 |
11–100 | 94.49 | 94.94 | 94.80 | 94.60 |
101–1000 | 95.68 | 96.08 | 96.12 | 96.23 |
1001– | 91.81 | 93.62 | 93.94 | 93.41 |
2.3.2 Correlating Tagging Accuracy with Span Length
In this work, we define the maximal projection of a word x as the span of words below x in the dependency tree. The key property is that a word projects its grammatical property to its maximal projection and it syntactically governs all words under the span of its maximal projection. Though maximal projection is traditionaly defined on deep structure by transformational generative grammaticians, we can empirically borrow the idea that a word in a sentence only governs a limited domain. Measuring the area governed by a word is helpful for error analysis. Sometimes modeling such an observation can even improve practical NLP systems such as a semantic role labeller (Sun, Sui, and Wang 2008). The concept of maximal projection used here is adopted from our early work on semantic role labeling (Sun, Sui, and Wang 2008).
Table 4 shows the tagging accuracies relative to the length of the spans. The spans are calculated according to the corresponding dependency annotations converted from CTB and provided by the CoNLL shared task. We can see that with the increase of the number of words governed by the token, the difficulty of its POS prediction increases. Especially, higher-order models make better predictions for words governing larger spans. This analysis suggests that syntagmatic lexical relations play a significant role in POS tagging, and sometimes words located far from the current token significantly affect its tagging.
Tagging accuracies (%) relative to length. The length is defined as one plus the number of words that are dominated by the target word.
Len. . | LLM . | LGLM1 . | LGLM2 . | SR-HMM . |
---|---|---|---|---|
1–2 | 92.77 | 93.51 | 93.55 | 93.37 |
3–4 | 91.97 | 92.94 | 93.13 | 92.50 |
5–6 | 91.21 | 92.29 | 92.51 | 91.62 |
7– | 93.37 | 94.17 | 94.58 | 93.77 |
Len. . | LLM . | LGLM1 . | LGLM2 . | SR-HMM . |
---|---|---|---|---|
1–2 | 92.77 | 93.51 | 93.55 | 93.37 |
3–4 | 91.97 | 92.94 | 93.13 | 92.50 |
5–6 | 91.21 | 92.29 | 92.51 | 91.62 |
7– | 93.37 | 94.17 | 94.58 | 93.77 |
An interesting phenomenon is that the performance decline stops when the length is greater than 7. The main reason is that these words usually have clear collocation words nearby but the collocated words govern a very large area. Typical examples are words that take a clause as its complement, such as “/say.” It is relatively easy to label this word, but its complement could be of a large size. In other words, the usage of a word that is complex from one particular view is not necessarily complex from another.
2.3.3 Correlating Tagging Accuracy with POS Type
Table 5 presents F-scores of several POS types, including nouns and functional words. The POS types NR, NT, and NN, respectively, represent proper nouns, temporal nouns, and other common nouns. We can clearly see that models that only explore local dependencies are good enough to deal with nouns. Superisingly, the local classifier that does not directly define features of possible POS tags of other surrounding words performs even better than structured models for proper nouns and other common nouns.
Tagging F1 scores relative to POS types.
Type . | LLM . | LGLM1 . | LGLM2 . | SR-HMM . |
---|---|---|---|---|
NN | 94.51 | 94.77 | 94.82 | 94.32 |
NR | 93.94 | 94.37 | 94.90 | 94.42 |
NT | 97.13 | 97.41 | 97.26 | 97.56 |
DEC | 78.72 | 81.17 | 81.89 | 79.25 |
DEG | 82.35 | 85.59 | 86.61 | 84.38 |
Type . | LLM . | LGLM1 . | LGLM2 . | SR-HMM . |
---|---|---|---|---|
NN | 94.51 | 94.77 | 94.82 | 94.32 |
NR | 93.94 | 94.37 | 94.90 | 94.42 |
NT | 97.13 | 97.41 | 97.26 | 97.56 |
DEC | 78.72 | 81.17 | 81.89 | 79.25 |
DEG | 82.35 | 85.59 | 86.61 | 84.38 |
The tag DEC denotes a complementizer or a nominalizer, and the tag DEG denotes a genitive marker and an associative marker. These two types only include two words: “”and “
.” The latter is mainly used in ancient Chinese. About 5.19% of words appearing in the training data set is DEC/DEG. In addition to the high frequency, “
” takes much functional information that is very important for syntactic processing. The pattern of the DEC recognition is clause/verb phrase+DEC+noun phrase, and the pattern of the DEG recognition is nominal modifier+DEC+noun phrase. To distinguish the sentential/verbal and nominal modification phrases, the DEC and DEG words usually need long-range syntactic information for accurate disambiguation. We claim that the prediction performance of the two specific types is a good clue to how well a tagging model resolves long-distance dependencies. We can see that though these taggers work relatively well on predicting content words, they cannot handle function words satisfyingly. The significant performance gap between content words and function words again suggests that syntagmatic lexical relations plays an important role in POS tagging.
3. Capturing Paradigmatic Relations via Word Clustering
To bridge the gap between high- and low-frequency words, we use word clustering to acquire the knowledge about paradigmatic lexical relations from large-scale texts. Our work is also inspired by the successful application of word clustering to named entity recognition (Miller, Guinness, and Zamanian 2004) and dependency parsing (Koo, Carreras, and Collins 2008).
3.1 Word Clustering
Word clustering is a technique for partitioning sets of words into subsets of syntactically or semantically similar words. It is a useful technique to capture paradigmatic or substitutional similarity among words.
3.1.1 Clustering Algorithms
Various clustering techniques have been proposed, some of which, for example, perform automatic word clustering optimizing a maximum-likelihood criterion with iterative clustering algorithms. In this article, we focus on distributional word clustering that is based on the assumption that words that appear in similar contexts (especially surrounding words) tend to have similar syntactic distributions. Note that syntactic rather than morphological distributions are the key evidence to determine the grammatical categories of Chinese words, given that Chinese is an analytic language. Automatic word clustering has been successfully applied to many NLP problems, such as language modeling.
The main problem is that we cannot expect these independently optimized classes to be correspondent with syntactic structures. In the feature induction framework, this problem is partially resolved by exploring the ability of discriminative learning to automatically identify the correspondence between the two types of “word classes.” In the literature, contexts have been defined as subjective and objective relations involving the word, as the documents containing the word, or as search engine snippets for the word as a query. We derive new features for POS tagging by applying two distributional clustering methods, which both take into account surrounding words as contexts.

One downside of both Brown and MKCLS clustering is that they are based solely on bigram statistics, and do not consider word usage in a wider context. We choose to work with these two algorithms considering their prior success in other NLP applications. However, we expect that our approach can function with other clustering algorithms.
3.1.2 Data
Chinese Gigaword is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC). The large-scale unlabeled data we use in our experiments come from the Chinese Gigaword (LDC2005T14). We choose the Mandarin news text, that is, Xinhua newswire. These data cover all news published by Xinhua News Agency (the largest news agency in China) from 1991 to 2004, which contains over 473 million characters.
3.1.3 Pre-processing: Word Segmentation
Different from English and other Western languages, Chinese is written without explicit word delimiters such as space characters. To find the basic language units (i.e., words), segmentation is a necessary pre-processing step for word clustering. Our previous research showed that character-based segmentation models trained on labeled data are reasonably accurate (Sun 2010). In this work, we use a supervised segmenter introduced in Sun and Xu (2011) to process raw texts.
3.2 Improving Tagging with Cluster Features
Our discriminative sequential tagger is easy to be extended with arbitrary features and therefore suitable to explore additional features derived from other sources. We propose using word clusters as substitutes for word forms to assist the POS tagger. We are relying on the ability of the discriminative learning method to explore informative features, which play central role to boost the tagging performance. Five clustering-based features are added:
- •
Cluster unigrams: wc−1, wc, wc+1
- •
Cluster bigrams: wc−1_wc, wc_wc+1
3.3 Evaluation
Table 6 summarizes the tagging results on the development data with different feature configurations. In this table, the symbol “+” in the Features column means that the current configuration contains both the baseline features and new cluster-based features; the number is the total number of the clusters; the number in the #Sent column means how many millions of raw sentences are used to cluster words. From this table, we can clearly see the impact of word clustering features on POS tagging. The new features lead to substantial improvements over the strong supervised baseline. In particular, the word clustering information bridges the gap between the local classifier and structured prediction models much. Moreover, these increases are consistent regardless of the clustering algorithms. Both clustering algorithms contribute to the overall performance equivalently. A natural strategy for extending current experiments is to include both clustering results together. However, we find no further improvement. For each clustering algorithm, there are not many differences among different sizes of the total clustering numbers. When a small size of unlabeled data is added, the semi-supervised learning only yields minor improvements. When a comparable amount of unlabeled data are used, the further increase of the unlabeled data for clustering does not lead to much changes of the tagging performance.
3.4 Learning Curves
We do additional experiments to evaluate the effect of the derived features as the amount of labeled training data is varied. We use the LGLM1 model and the clustering results with “MKCLS+11.96M” setting for these experiments. Table 7 summarizes the accuracies of the systems when trained on smaller portions of the labeled data. We can see that the new features obtain consistent gains regardless of the size of the training set. The error is reduced significantly on all data sets. In other words, the word cluster features can significantly reduce the amount of labeled data required by the learning algorithm. The relative reduction is greatest when smaller amounts of the labeled data are used, and the effect lessens as more labeled data are added. This result gives a rough impression of the amount by which derived features reduce the need for supervised data, given a desired level of accuracy.
Tagging accuracies (%) relative to sizes of training data. Size = number of sentences in the labeled training corpus. Bold identifies best performance at the given size.
Size . | Supervised . | +c100 . | +c500 . | +c1000 . |
---|---|---|---|---|
100 | 66.94 | 76.78 | 71.91 | 68.59 |
500 | 77.89 | 84.19 | 82.17 | 81.14 |
1000 | 83.70 | 87.68 | 87.74 | 86.49 |
5000 | 90.57 | 91.84 | 91.93 | 91.95 |
10000 | 93.17 | 94.00 | 93.91 | 94.02 |
15000 | 93.76 | 94.42 | 94.47 | 94.39 |
20000 | 94.11 | 94.76 | 94.58 | 94.73 |
Size . | Supervised . | +c100 . | +c500 . | +c1000 . |
---|---|---|---|---|
100 | 66.94 | 76.78 | 71.91 | 68.59 |
500 | 77.89 | 84.19 | 82.17 | 81.14 |
1000 | 83.70 | 87.68 | 87.74 | 86.49 |
5000 | 90.57 | 91.84 | 91.93 | 91.95 |
10000 | 93.17 | 94.00 | 93.91 | 94.02 |
15000 | 93.76 | 94.42 | 94.47 | 94.39 |
20000 | 94.11 | 94.76 | 94.58 | 94.73 |
3.5 Analysis
Word clustering derives paradigmatic relational information from unlabeled data by grouping words into different sets. As a result, the contribution of word clustering to POS tagging is two-fold. On the one hand, word clustering captures and abstracts context information. This new linguistic knowledge is thus helpful to better correlate a word in a certain context to its POS tag. On the other hand, the clustering of the OOV words to some extent fights the sparse data problem by correlating an OOV word with in-vocabulary (IV) words through their classes. To evaluate the two contributions of the word clustering, we limit entries of the clustering lexicon to only contain IV words, that is, words appearing in the training corpus. Using this constrained lexicon, we train new first-order LGLMs with “+MKCLS+11.96M” clustering and report its prediction power in Table 8. The gap between the baseline and +IV models can be viewed as the contribution of the first effect, and the gap between the +IV and +All models can be viewed as the second contribution. This result indicates that the improved predictive power partially comes from the new interpretation of a POS tag through clustering, and mainly comes from its memory of OOV words that appear in the unlabeled data.
Tagging accuracies (%) with IV clustering.
Clusters . | +c100 . | +c500 . | +c1000 . |
---|---|---|---|
IV | 94.37 (↑0.07) | 94.41 (↑0.11) | 94.40 (↑0.10) |
All | 94.83 (↑0.46) | 94.87 (↑0.46) | 94.79 (↑0.39) |
Clusters . | +c100 . | +c500 . | +c1000 . |
---|---|---|---|
IV | 94.37 (↑0.07) | 94.41 (↑0.11) | 94.40 (↑0.10) |
All | 94.83 (↑0.46) | 94.87 (↑0.46) | 94.79 (↑0.39) |
Table 9 shows the recall of OOV words on the development data set. Only the word types appearing more than 10 times are reported. For more information about the definition POS tags, refer to the guideline4 provided by the CTB project. We give a brief illustration of the POS tags in Appendix A. The results are evaluated using the first-order LGLM tagger. The recall of almost all OOV words is improved with any kind of clustering results, especially of proper nouns (NR) and common verbs (VV). Another interesting fact is that almost all of them are content words. This table is also helpful to understand the impact of the clustering information on the prediction of OOV words.
The tagging recall (%) of OOV words.
Type . | #Words . | Baseline . | +c100 . | +c500 . | +c1000 . |
---|---|---|---|---|---|
AD | 21 | 42.86 | 47.62 (↑) | 52.38 (↑) | 52.38 (-) |
CD | 237 | 98.73 | 98.31 (↓) | 99.16 (↑) | 98.73 (↑) |
JJ | 86 | 26.74 | 37.21 (↑) | 31.40 (↑) | 23.26 (↓) |
NN | 1012 | 85.47 | 87.06 (↑) | 88.44 (↑) | 86.86 (↑) |
NR | 863 | 81.23 | 88.30 (↑) | 85.86 (↑) | 89.92 (↑) |
NT | 21 | 57.14 | 57.14 (-) | 61.90 (↑) | 66.67 (↑) |
VA | 15 | 40.00 | 73.33 (↑) | 80.00 (↑) | 73.33 (↑) |
VV | 402 | 69.15 | 72.14 (↑) | 72.89 (↑) | 76.37 (↑) |
Type . | #Words . | Baseline . | +c100 . | +c500 . | +c1000 . |
---|---|---|---|---|---|
AD | 21 | 42.86 | 47.62 (↑) | 52.38 (↑) | 52.38 (-) |
CD | 237 | 98.73 | 98.31 (↓) | 99.16 (↑) | 98.73 (↑) |
JJ | 86 | 26.74 | 37.21 (↑) | 31.40 (↑) | 23.26 (↓) |
NN | 1012 | 85.47 | 87.06 (↑) | 88.44 (↑) | 86.86 (↑) |
NR | 863 | 81.23 | 88.30 (↑) | 85.86 (↑) | 89.92 (↑) |
NT | 21 | 57.14 | 57.14 (-) | 61.90 (↑) | 66.67 (↑) |
VA | 15 | 40.00 | 73.33 (↑) | 80.00 (↑) | 73.33 (↑) |
VV | 402 | 69.15 | 72.14 (↑) | 72.89 (↑) | 76.37 (↑) |
4. Capturing Syntagmatic Relations via Parsing
To capture syntagmatic relations among words, a trivial idea is to use higher order Markov models. However, the empirical evaluation on the CTB data indicates that the second-order model does not benefit much, especially when word clustering features are added. This result suggests that a linear-chain structure is relatively weak to capture complex syntagmatic lexical relations. Different from lexical analysis, syntactic analysis, especially the full and deep one, reflects syntagmatic relations of words and phrases of sentences. We present a series of empirical studies of the tagging results of the two syntax-free sequential taggers and a state-of-the-art syntax-based parser, aiming at illuminating more precisely the impact of information about phrase-structures as well as dependency structures on POS tagging. The analysis is helpful to understand the role of syntagmatic lexical relations in POS prediction.
4.1 CFG-Based Parsing
POS tags can be taken as pre-terminals of a constituency parse tree, so a constituency parser can also provide POS information. The majority of the state-of-the-art constituent parsers are based on generative probabilistic CGF (PCFG) learning, with lexicalized (Charniak 2000; Collins 2003) or latent annotation (Matsuzaki, Miyao, and Tsujii 2005; Petrov et al. 2006) refinements. Compared with complex lexicalized parsers, the symbol-refined PCFG (SR-PCFG) parsers leverage on an automatic procedure to learn refined grammars and are more robust to parse many non-English languages that are not well studied. For Chinese, a SR-PCFG parser achieves the state-of-the-art performance and outperforms many other types of parsers (Zhang and Clark 2009). In our work, the Berkeley parser,5 an open source implementation of the SR-PCFG model, is used for experiments.
4.2 Comparing Tagging and Parsing
From a linguistic view, we can distinguish syntax-free and syntax-based models. In a syntax-based model, POS tagging is integrated into parsing, and thus (to some extent) is capable of capturing a considerable amount of long range syntactic information. From a machine learning view, we can distinguish generative and discriminative models. Compared with generative models, discriminative models define expressive features to classify words. Note that the two generative models use latent variables to refine the output spaces, which significantly boost the accuracy and increase the robustness of simple generative models.
Table 10 shows their overall and detailed performance with respect to representative types. In the following, we present a comparative analysis.
Tagging F1 scores of relative to word classes.
Type . | LLM . | LGLM1 . | LGLM2 . | SR-HMM . | Parser . |
---|---|---|---|---|---|
NN | 94.51 | 94.77 | 94.82 | 94.32 | 93.46 |
NR | 93.94 | 94.37 | 94.90 | 94.42 | 89.76 |
NT | 97.13 | 97.41 | 97.26 | 97.56 | 96.80 |
CD | 97.26 | 97.57 | 97.63 | 97.57 | 95.50 |
VA | 79.34 | 83.25 | 84.49 | 80.57 | 81.47 |
VC | 97.10 | 97.20 | 97.00 | 96.90 | 96.01 |
AD | 93.47 | 94.53 | 94.59 | 94.81 | 94.13 |
JJ | 82.19 | 83.80 | 83.18 | 82.54 | 81.38 |
CC | 90.52 | 91.99 | 91.98 | 92.91 | 94.00 |
P | 93.51 | 94.52 | 94.35 | 95.10 | 96.19 |
DEC | 78.72 | 81.17 | 81.89 | 79.25 | 85.69 |
DEG | 82.35 | 85.59 | 86.61 | 84.38 | 88.94 |
DER | 75.86 | 77.42 | 75.00 | 83.33 | 78.05 |
DEV | 57.73 | 74.38 | 74.14 | 76.81 | 84.89 |
Overall | 93.61% | 94.30% | 94.42% | 94.08% | 93.69% |
Type . | LLM . | LGLM1 . | LGLM2 . | SR-HMM . | Parser . |
---|---|---|---|---|---|
NN | 94.51 | 94.77 | 94.82 | 94.32 | 93.46 |
NR | 93.94 | 94.37 | 94.90 | 94.42 | 89.76 |
NT | 97.13 | 97.41 | 97.26 | 97.56 | 96.80 |
CD | 97.26 | 97.57 | 97.63 | 97.57 | 95.50 |
VA | 79.34 | 83.25 | 84.49 | 80.57 | 81.47 |
VC | 97.10 | 97.20 | 97.00 | 96.90 | 96.01 |
AD | 93.47 | 94.53 | 94.59 | 94.81 | 94.13 |
JJ | 82.19 | 83.80 | 83.18 | 82.54 | 81.38 |
CC | 90.52 | 91.99 | 91.98 | 92.91 | 94.00 |
P | 93.51 | 94.52 | 94.35 | 95.10 | 96.19 |
DEC | 78.72 | 81.17 | 81.89 | 79.25 | 85.69 |
DEG | 82.35 | 85.59 | 86.61 | 84.38 | 88.94 |
DER | 75.86 | 77.42 | 75.00 | 83.33 | 78.05 |
DEV | 57.73 | 74.38 | 74.14 | 76.81 | 84.89 |
Overall | 93.61% | 94.30% | 94.42% | 94.08% | 93.69% |
4.2.1 Content Words vs. Function Words
Table 10 gives a detailed comparison regarding different word types. For each type of word, we report the accuracy of both solvers and compare the difference. The majority of the words that are better labeled by the tagger are content words, including nouns (NN, NR, NT), numbers (CD), predicates (VA, VC), adverbs (AD), nominal modifiers (JJ), and so on. It is worth noting that both discriminative and generative sequential taggers consistently outperform the parser. In contrast, most of the words that are better predicted by the parser are function words, including most particles (DEC, DEG, DER, DEV, AS, MSP), prepositions (P), and coordinating conjunctions (CC).
4.2.2 Open Classes vs. Close Classes
POS can be divided into two broad supercategories: closed class types and open class types. Open classes accept the addition of new morphemes (words), through such processes as compounding, derivation, inflection, coining, and borrowing. On the other hand closed classes are those that have relatively fixed membership. For example, nouns and verbs are open classes because new nouns and verbs are continually coined or borrowed from other languages, whereas DEC/DEG are two closed classes because only the function word “”is assigned to them. The discriminative model can conveniently include many features, especially features related to the word formation, which are important to predict words of open classes. Table 11 summarizes the tagging accuracies relative to IV and OOV words. These statistics exclude all punctuations that can be trivially recognized. On the whole, the Berkeley parser processes IV words slightly better than our tagger, but processes OOV words significantly worse. The numbers in this table clearly show that the main weakness of the Berkeley parser is the the predictive power of the OOV words.
4.2.3 Local Disambiguation vs. Global Disambiguation
Closed class words are generally function words that tend to occur frequently and often have structuring uses in grammar. These words have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence. They signal the structural relationships that words have to one another and are the glue that holds sentences together. Thus, they serve as important elements to the structures of sentences. The disambiguation of these words normally requires more syntactic clues, which are very hard and inappropriate for a sequential tagger to capture. Based on global grammatical inference of the whole sentence, the full parser is relatively good at dealing with structure-related ambiguities.
We conclude that a discriminative sequential tagging model can better capture local syntactic and morphological information, and the full parser can better capture global syntactic structural information. The discriminative tagging models are limited by the Markov assumption and are inadequate to correctly label structure-related words.
4.3 Impact on Parsing
The weak ability for non-local disambiguation also imposes restrictions on using a sequence POS tagging model as a front module for parsing. To evaluate the impact, we use the Berkeley parser to parse a sentence based on the POS tags provided by sequence models. Table 12 shows the parsing performance. Labeled bracketing precision, recall, and F-score (LP, LR, and LF) are listed. Note that the overall tagging performance of the Berkeley parser is significantly worse than the sequence models. However, better POS tagging does not lead to better parsing. Our experiments suggest that sequence models propagate too many errors to the parser. Moreover, the parser is very sensitive to errors of prediction of some specific categories. The numbers presented in the bottom block of Table 12 give a rough illustration. The results are obtained by providing the parser mixed POS tagging analysis: The tags of “” predicted by the Berkeley parser and the tags of other words are utilized. We can see that though the overall parsing quality is still worse than Berkeley parser, it is better than sequence models. The performance change demonstrates the importance of prediction of these two particular words. Our linguistic analysis can also better explain the poor performance of Chinese CCG parsing when applying the C&C parser (Tse and Curran 2012). We think the failure is mainly due to overplaying sequence models in both POS tagging and supertagging.
Parsing accuracies (%) on the development data.
Devel. . | LP . | LR . | LF . |
---|---|---|---|
Berkeley(ALL) | 82.44 | 80.31 | 81.36 |
LLM(ALL) | 79.17 | 78.46 | 78.82 (↓) |
LGLM1(ALL) | 80.28 | 79.52 | 79.90 (↓) |
LGLM2(ALL) | 79.59 | 80.58 | 80.08 (↓) |
SR-HMM(ALL) | 80.59 | 79.35 | 79.96 (↓) |
LLM(non-De)+Berkeley(De) | 81.12 | 79.18 | 79.64 (↓) |
LGLM1(non-De)+Berkeley(De) | 80.82 | 79.94 | 80.38 (↓) |
LGLM2(non-De)+Berkeley(De) | 81.14 | 80.07 | 80.60 (↓) |
SR-HMM(non-De)+Berkeley(De) | 81.32 | 80.01 | 80.66 (↓) |
Devel. . | LP . | LR . | LF . |
---|---|---|---|
Berkeley(ALL) | 82.44 | 80.31 | 81.36 |
LLM(ALL) | 79.17 | 78.46 | 78.82 (↓) |
LGLM1(ALL) | 80.28 | 79.52 | 79.90 (↓) |
LGLM2(ALL) | 79.59 | 80.58 | 80.08 (↓) |
SR-HMM(ALL) | 80.59 | 79.35 | 79.96 (↓) |
LLM(non-De)+Berkeley(De) | 81.12 | 79.18 | 79.64 (↓) |
LGLM1(non-De)+Berkeley(De) | 80.82 | 79.94 | 80.38 (↓) |
LGLM2(non-De)+Berkeley(De) | 81.14 | 80.07 | 80.60 (↓) |
SR-HMM(non-De)+Berkeley(De) | 81.32 | 80.01 | 80.66 (↓) |
4.4 Enhancing Tagging via Stacking
We study a simple way of integrating multiple heterogeneous models in order to exploit their complementary strengths and thereby improve tagging accuracy beyond what is possible by either model in isolation. The method integrates the heterogeneous models by allowing the outputs of SR-HMM and the parser to define features for the LLM/LGLM. Similar to our work on combining a sequence model and a parser, Rush et al. (2010) proposed a principled decoding technique based on dual decomposition to take advantages of heterogeneous models. There are two differences between their model and ours. First, the base models for combining are separately trained in their solution. In other words, one key difference is whether to allow integration of base models at learning time. Second, the application of the decomposition technique is dependent on the solvability of sub-problems. This technique, therefore, is not as flexible as stacking.
4.4.1 Stacked Learning
Stacked generalization is a meta-learning algorithm that was first proposed in Wolpert (1992) and Breiman (1996). Stacked learning has been applied as a system ensemble method in several NLP tasks, such as joint word segmentation and POS tagging (Sun 2011), and dependency parsing (Nivre and McDonald 2008). The idea is to include two “levels” of predictors. The first level includes one or more predictors g1, …, gK : ℝd → ℝ; each receives input x ∈ ℝd and outputs a prediction gk(x). The second level consists of a single function h : ℝd+K → ℝ that takes as input 〈x, g1(x), …, gK(x)〉 and outputs a final prediction ŷ = h(x, g1(x), …, gK(x)). The predictor, then, combines an ensemble (the gk's) with a meta-predictor (h).
Training is done as follows. The training data S = {(xt, yt) : t ∈ [1, T]} are split into L equal-sized disjoint subsets S1, …, SL. Then functions g1, …, gL (where ) are separately trained on S − Sl, and are used to construct the augmented data set Ŝ =
. Finally, each gk is trained on the original data set and the second level predictor h is trained on Ŝ. The intent of the cross-validation scheme is that
is similar to the prediction produced by a predictor which is learned on a sample that does not include xt.
This framework is also explored as a solution for learning long range features in Torres Martins et al. (2008). Torres Martins et al. explored a stacked framework for learning long range features for dependency parsing. In machine learning research, stacked learning has been applied to structured prediction (Cohen and Carvalho 2005). In this work, stacked learning is used to acquire extended training data for sub-word tagging. For example, Cohen and Carvalho (2005) described a sequential learning scheme called “stacked sequential learning.” In that meta-learning algorithm, an arbitrary base learner is augmented so as to make it aware of the labels of nearby examples.
4.4.2 Applying Stacking to POS Tagging
We use the LLMs or LGLMs (as h) for the level-1 processing, and other models (as gk) for the level-0 processing. The characteristic of discriminative learning makes LLMs/LGLMs very easy to integrate into the outputs of other models as new features. T is set to 5 to generate augmented training data for estimating h. We are relying on the ability of discriminative learning to explore informative features, which play a central role in boosting the tagging accuracy. For output labels produced by each auxiliary model, five new label uni/bigram features are added: w−1, w, w+1, w−1_w, w_w+1. This choice is tuned on the development data.
4.4.3 Evaluation
Table 13 summarizes the tagging accuracy of different stacking models. From this table, we can clearly see that the new features derived from the outputs of other models lead to substantial improvements over the baseline LLM/LGLM. The output structures provided by the SR-PCFG model is most effective in improving the LLM/LGLM baseline systems. Among different stacking models, the syntax-free hybrid one (i.e., stacking LLM/LGLM with SR-HMM) does not need any treebank to train their systems. For the situations in which parsers are not available, this is a good solution. Moreover, the decoding algorithms for linear-chain Markov models are very fast. Therefore the syntax-free hybrid system is more appealing for many NLP applications.
Tagging accuracies (%) of different stacking models on the development data.
. | LLM . | LGLM1 . | LGLM2 . |
---|---|---|---|
+SR-HMM | 94.59 | 94.80 | 94.83 |
+SR-PCFG | 94.83 | 95.06 | 95.04 |
+Word clustering+SR-HMM | 95.03 | 95.11 | 95.12 |
+Word clustering+SR-PCFG | 95.40 | 95.45 | 95.54 |
. | LLM . | LGLM1 . | LGLM2 . |
---|---|---|---|
+SR-HMM | 94.59 | 94.80 | 94.83 |
+SR-PCFG | 94.83 | 95.06 | 95.04 |
+Word clustering+SR-HMM | 95.03 | 95.11 | 95.12 |
+Word clustering+SR-PCFG | 95.40 | 95.45 | 95.54 |
Table 14 shows the F1 scores of the DEC/DEG prediction obtained by different stacking models. Compared with Table 10, we can see that the hybrid sequence model is still not good at handling long-distance ambiguities. As a result, it still does not serve the parser well, though it achieves higher overall precision. On the other hand, the syntax-based hybrid model can refine the POS tags returned by the same parser, and therefore improve the final parsing results. In other words, by parsing twice, we can obtain better phrase-structure trees.
F1 score of the DEC/DEG prediction and parsing performance of different stacking models on the development data.
. | DEC . | DEG . | LP . | LR . | LF . |
---|---|---|---|---|---|
LGLM1(SR-HMM) | 82.72 | 86.99 | 81.13% | 80.12% | 80.63 (↓) |
LGLM1(SR-PCFG) | 87.05 | 90.06 | 82.66% | 81.20% | 81.92 (↑) |
. | DEC . | DEG . | LP . | LR . | LF . |
---|---|---|---|---|---|
LGLM1(SR-HMM) | 82.72 | 86.99 | 81.13% | 80.12% | 80.63 (↓) |
LGLM1(SR-PCFG) | 87.05 | 90.06 | 82.66% | 81.20% | 81.92 (↑) |
4.5 Combining Both
We have introduced two separate improvements for Chinese POS tagging that capture different types of lexical relations. We therefore expect further improvement by combining both enhancements, since their contributions to the task is different. We still use the stacking model to integrate the discriminative tagger and the Berkeley parser. The only difference between current experiment and the previous experiment is that the discriminative models are trained with the help of word clustering features. The last line of Table 13 also shows the performance of the new hybrid models on the development data set. We can see that the improvements that come from two methods, namely, capturing syntagmatic and paradigmatic relations, do not overlap much and their combination yields a better combined result.
5. Reducing Hybrid Models to Sequence Models
We have shown that higher accuracy can be achieved by applying learning techniques to capture deep lexical relations. Especially, syntagmatic lexical relations have been shown playing an essential role in Chinese POS tagging. To capture such relations, we utilize hybrid models that obtain such information from a syntactic parser. However, it is inappropriate to use computationally expensive parsers to improve POS tagging for many realistic NLP applications, mainly because of efficiency considerations. In this section, we investigate the feasibility of capturing some longer-distance dependencies in a sequence model.
5.1 The Idea
We explore unlabeled data to transfer the predictive power of hybrid models to sequence models. The main idea behind this is to use a fast model to approximate the function learned by a slower, larger, but better-performing ensemble model. Unlike the true function that is unknown, the function learned by a high-performing model is available and can be used to label large amounts of pseudo data. A fast and expressive model trained on large-scale pseudo data will not overfit and will approximate the function learned by the high performing model well. This allows a slow, complex model such as a massive ensemble to be compressed into a fast sequence model such as a first-order LGLM with very little loss in performance.
This idea to use unlabeled data to transfer the predictive power of one model to another has been investigated in many areas, for example, from high accuracy neural networks to more interpretable decision trees (Craven 1996), from high accuracy ensembles to faster and more compact neural networks (Bucila, Caruana, and Niculescu-Mizil 2006), from structured prediction models to local classification models (Liang, Daumé, and Klein 2008), or from complicated parsing models to simpler ones (Petrov et al. 2010).
5.2 Applying Structured Compilation to POS Tagging
We do some experiments to explore the feasibility of reducing hybrid tagging models to a SR-HMM or LGLM for Chinese POS tagging. The large-scale unlabeled data we use in our experiments come from the Chinese Gigaword. We choose the Mandarin news text (i.e., Xinhua newswire). We tag Gigaword sentences by applying a successful model, namely, the stacked second order LGLMs with Berkeley parser. According to our evaluation, the automatically annotated texts obtained in this step is of relatively high quality. Note that the process of this step is somewhat time-consuming, given that a chart parser is used. We viewed the annotated results as pseudo training data, which is imperfect but still of high quality. Such pseudo training data could be of very large-scale theoretically and practically. Together with gold standard training data, large-scale pseudo training data can be used to train SR-HMMs and LGLMs. We expect that the SR-HMM tagger can be improved by exploring better latent variables, and that the discriminative taggers can be improved by using features in a larger context.
5.3 Beam Decoding for LGLM
For a number of NLP tasks, including tagging and parsing, the generic beam-search algorithmic technique has been shown to be very powerful to build efficient systems with comparable accuracy (Zhang and Clark 2011). In our model re-compilation case, to train a second order LGLM on a very large data set is quite time-consuming. Rather than the Viterbi algorithm, we here use the beam search algorithm for decoding. Beyond simple beam decoding that essentially implements the greedy search strategy, Huang and Sagae (2010) discuss how the state-merging strategy that is used by dynamic programming methods can be applied to enhance a beam decoder. Considering that the total number of possible tags is much larger than conventional tagging, we implement a beam-search algorithm with state merging for our discriminative tagger.
In a second-order model, the basic factor contains three consecutive tags, but only the last two influence future decoding. That means that all partial tag sequences with the same last two tags can be merged together. Specifically, at each decoding step, our decoder first generates all new partial tag sequences by labeling the next word, then the top-b sequences with different last two tags are collected for future prediction while others are thrown away. With the state-merging strategy, our beam decoder can perform dynamic programming too. Note that when the beam width is large enough, our decoding algorithm actually searches the whole space and is exactly a Viterbi decoder.
5.4 Multi-View Learning with Unlabeled Data
The key for the success of hybrid tagging models is the existence of a large diversity among learners. Zhou (2009) argued that when there are many labeled training examples, unlabeled instances are still helpful for hybrid models because they can help to increase the diversity among the base learners. The author also briefly introduced a preliminary theoretical study. In this work, we also combine the re-trained models to see if we can benefit more. The final combination is very simple: We utilize voting as the strategy for final combination. In the tagging phase, the re-trained LGLM and SR-HMM systems with different settings output multiple tagging results, in which each word is assigned one POS label. The final tagging is the voting result of these labels.
5.5 Evaluation
5.5.1 Reducing Hybrid Models to SR-HMMs
With the increase of (pseudo) training data, a SR-HMM may learn better latent variables to subcategorize POS tags, which could significantly improve a purely supervised SR-HMM. In our experiments, SR-HMM models are trained with six, seven, and eight iterations of split, merge, smooth. Table 15 shows the performance of the re-trained SR-HMMs. The first column is the number of sentences of pseudo sentences, and the second column lists the number of words. The pseudo sentences are selected from the Xinhua news section of the Chinese Gigaword. We can clearly see that the idea to leverage unlabeled data to transfer the predictive ability of the hybrid model works. Self-training can also slightly improve a SR-HMM (Huang, Eidelman, and Harper 2009). Our auxiliary experiments show that self-training is not as effective as our structure compilation method.
Tagging accuracies (%) of re-compiled SR-HMM models on the development data. “I-x” denotes the number (x) of split-merge-smooth iterations for training. Bold identifies best performance results.
With the increase of training iterations, finer-grained latent variables are estimated and they can enhance tagging. Note that the training procedure on the purely supervised setting obtains the best tagging results at iteration 6. More training data, even if it is not perfect, can improve the generative learning process. The table also presents the performance with respect to DEC/DEG disambiguation. The results suggest that finer-grained latent variables lead to better long-range disambiguation.
5.5.2 Reducing Hybrid Models to LGLMs
To increase the expressive power of a discriminative classification model, we extend the feature templates. This strategy is proposed by Liang, Daumé, and Klein (2008). In our experiments, we increase the window size of word uni-/bigram features to approximate longer distance dependencies. For window size 3, we will add w−3, w3, w−3w−2, and w2w3 as new features; for size 4, we will add w−4, w−3, w3, w4, w−4w−3, w−3w−2, w2w3, and w3w4. Using features derived from a longer window is harmful when only limited labeled data are available. That is why we only use these features in the structure compilation setting. Table 16 shows the performance of the re-compiled first- and second-order LGLMs. The “+MKCLS+11.96M” algorithm is used to provide word clustering information, and the number of total clusters is 500. Similar to the generative model, the discriminative LGLM tagger can be improved too. The second-order model performs slightly better than the first-order one. Considering the decoding time is equivalent because of the fixed beam width, the second-order model is a better choice for application.
Tagging accuracies (%) of re-compiled LGLM1 and LGLM2 models on the development data. The beam size is set to 4. “win=x” denotes the window size (x) of word uni-/bigrams for feature extraction. Bold identifies best performance results.
In these experiments, we set the beam width for decoding to be 4. Our auxiliary experiments shows that the “beam search with state merging” is quite effective, even with a very small beam size. We vary the beam width and present the results in Table 17.
Tagging accuracies (%) relative to beam width on the development data. The LGLM2 model is applied.
%Sent . | Beam . | win=2 . | win=3 . | win=4 . |
---|---|---|---|---|
500K | 8 | 95.19 | 95.31 | 95.26 |
500K | 16 | 95.18 | 95.31 | 95.31 |
500K | 32 | 95.22 | 95.25 | 95.34 |
500K | 64 | 95.20 | 95.30 | 95.27 |
%Sent . | Beam . | win=2 . | win=3 . | win=4 . |
---|---|---|---|---|
500K | 8 | 95.19 | 95.31 | 95.26 |
500K | 16 | 95.18 | 95.31 | 95.31 |
500K | 32 | 95.22 | 95.25 | 95.34 |
500K | 64 | 95.20 | 95.30 | 95.27 |
Compared with the generative model, the re-compiled discriminative model is more effective and more efficient. Although the time complexity for the SR-HMM is linear with respect to the number of words contained in a sentence, the practical running time is influenced by the number of latent variables. Even if we expect further accuracy improvements via adding more data and using more split-merge-smooth iterations to get more effective latent variables, such a setting will significantly affect the tagging efficiency. On the other hand, the beam decoder for the discriminative model achieves equivalent tagging accuracies to the Viterbi decoder. As a result, the efficiency of both training and testing can be guaranteed. Another advantage of the discriminative tagger is its relatively good prediction power of the longer-distance dependencies. The best re-compiled LGLM2 obtains better DEC/DEG prediction than the Berkeley parser.
5.5.3 Voting
Table 18 is the final voting results of the SR-HMM and LGLM. We use three base models for combination, which is the minimum for performing voting. In other words, the final tagging is the voting result of these three labels. Obviously, the re-trained models are still diverse and complementary, so the voting can further improve the sequence models. The result of the best hybrid sequence model is equivalent to the best stacking models.
Tagging accuracies (%) of the voting models on the development data. Bold identifies best performance results.
Voter 1 . | Voter 2 . | Voter 3 . | Acc. . |
---|---|---|---|
SR-HMM, I-8, 1000K | SR-HMM, I-7, 1000K | LGLM, win=4, 1000K | 95.17 |
SR-HMM, I-8, 1000K | LGLM, win=4, 1000K | LGLM, win=3, 1000K | 95.45 |
SR-HMM, I-8, 1000K | LGLM, win=4, 1000K | LGLM, win=4, 500K | 95.54 |
Voter 1 . | Voter 2 . | Voter 3 . | Acc. . |
---|---|---|---|
SR-HMM, I-8, 1000K | SR-HMM, I-7, 1000K | LGLM, win=4, 1000K | 95.17 |
SR-HMM, I-8, 1000K | LGLM, win=4, 1000K | LGLM, win=3, 1000K | 95.45 |
SR-HMM, I-8, 1000K | LGLM, win=4, 1000K | LGLM, win=4, 500K | 95.54 |
5.5.4 Improved Parsing
There are two ways for the sequence models to encode long-range information. On one hand, the models can be built upon high-order linear structures (e.g., Ye et al. 2009). One of main challenges of this solution is the high computational complexity. On the other hand, sequence models can incorporate features extracted from a larger context (e.g., by extending window size). This solution cannot work well if only a limited amount of annotated data is available. The key idea underlying structure compilation is to appropriately utilize automatically annotated data to estimate weights for more contextual features. Because features extracted from a larger context provide important clues to detect longer-distance relationships, a re-compiled sequence model can approximate the behavior of a parser to some extent.
Purely supervised sequence models are not good at predicting function words, and accordingly are not good enough to be used as front modules to parsers. The re-compiled models can mimic some behaviors of parsers, and therefore are suitable for parsing. Especially, we have seen that the predictive power for the function word disambiguation is enhanced significantly. Our evaluation shows that the significant improvement of the POS tagging stop harming syntactic parsing. Results in Table 19 6 indicate that the parsing accuracy of the Berkeley parser can be simply improved by in-putting the Berkeley parser with the re-trained sequential tagging results. Additionally, the success to separate tagging and parsing can improve the efficiency of the syntactic processing.
Accuracies (%) of parsing based on re-compiled tagging. Column “SC” denotes whether structure compilation is applied.
. | SC . | LP . | LR . | LF . |
---|---|---|---|---|
Berkeley | - - | 82.44 | 80.31 | 81.36 |
SR-HMM | NO | 80.59 | 79.35 | 79.96 (↓) |
LGLM2 | NO | 79.59 | 80.58 | 80.08 (↓) |
SR-HMM | YES | 82.86 | 80.60 | 81.22 (↓) |
LGLM2 | YES | 82.50 | 81.39 | 81.94 (↑) |
Voting | YES | 82.57 | 81.47 | 82.01 (↑) |
. | SC . | LP . | LR . | LF . |
---|---|---|---|---|
Berkeley | - - | 82.44 | 80.31 | 81.36 |
SR-HMM | NO | 80.59 | 79.35 | 79.96 (↓) |
LGLM2 | NO | 79.59 | 80.58 | 80.08 (↓) |
SR-HMM | YES | 82.86 | 80.60 | 81.22 (↓) |
LGLM2 | YES | 82.50 | 81.39 | 81.94 (↑) |
Voting | YES | 82.57 | 81.47 | 82.01 (↑) |
5.5.5 Final Results
Table 20 shows the performance of different systems evaluated on the test data. Our final sequence model achieves the state-of-the-art performance, which is obtained by combining a state-of-the-art parser as well as sequence models.
5.5.6 Comparison with Other Taggers
We compare our final sequence labeling based tagger to other representative taggers. Though most research papers report experiments on CTB, they usually define different training/developement/test sets. Nevertheless, numeric performance still reflects accuracy level of existing systems and our tagger. The first three taggers for comparison are based on the joint POS tagging and dependency parsing architecture, which is able to leverage on rich syntactic information to capture syntagmatic relations. They also use global linear models for disambiguation, given that such discriminative learning method achieves state-of-the-art for both tagging and parsing. The major difference between these three taggers is the corresponding parsing approach: They apply transition-based, graph-based and easy-first methods, respectively. Table 21 presents the results. We can see our re-compiled tagger achieves significantly better results, though it utilizes a simpler technique (i.e., sequence labeling) and does not explicitly use syntactic information.
Comparison with other taggers. Tagging accuracies are all evaluated on CTB, but different training and test data sets are used.
System . | Architecture . | Learning . | Acc. (%) . |
---|---|---|---|
Ours | Sequential Tagging | Linear | 95.34 |
Hatori et al. 2011 | Transition-based Joint Tagging & Parsing | Linear | 94.01 |
Li et al. 2011 | Graph-based Joint Tagging & Parsing | Linear | 93.08 |
Ma et al. 2012 | Easy-first Joint Tagging & Parsing | Linear | 94.27 |
Very recently, neural networks have been widely applied various NLP tasks, including word segmentation (Chen et al. 2015; Ma and Hinrichs 2015), syntactic parsing (Chen and Manning 2014; Weiss et al. 2015), and machine translation (Devlin et al. 2014). We also compare our tagger with a neural network–based tagger. Alberti et al. (2015) introduced a neural network–based joint tagging and parsing model that obtains state-of-the-art results on multiple languages. Table 22 shows the results. Because their experiments used the data from CoNLL 2009 shared task, their results are directly comparable to ours. We can see that our final tagger is significantly better than this currently developed neural network–based system.
Comparison with other taggers. Tagging accuracies are obtained on the test data of CoNLL 2009 shared task.
System . | Architecture . | Learning . | Acc. (%) . |
---|---|---|---|
Ours | Sequential Tagging | Linear | 95.34 |
Alberti et al. 2015 | Transition-based Joint Tagging & Parsing | Neural | 94.62 |
System . | Architecture . | Learning . | Acc. (%) . |
---|---|---|---|
Ours | Sequential Tagging | Linear | 95.34 |
Alberti et al. 2015 | Transition-based Joint Tagging & Parsing | Neural | 94.62 |
6. Related Work
Many successful tagging algorithms designed for English have been applied to many other languages as well. In some cases, the methods work well without large modifications, such as for German POS tagging. But a number of augmentations and changes became necessary when dealing with highly inflected or agglutinative languages, as well as analytic languages, of which Chinese is the focus of this article.
Both discriminative and generative models are explored for accurate Chinese POS tagging (Ng and Low 2004; Tseng, Jurafsky, and Manning 2005; Huang, Harper, and Wang 2007; Huang, Eidelman, and Harper 2009). Ng and Low (2004) and Tseng et al. (2005) introduced a maximum entropy–based model, which includes morphological features for unknown word recognition. Huang, Harper, and Wang (2007) and Huang, Eidelman, and Harper (2009) mainly focused on the generative HMM models. To enhance a trigram HMM model, Huang, Harper, and Wang (2007) proposed a re-ranking procedure to include both morphology and syntactic structure features, which is difficult to capture for a generative model. Different from the discriminative re-ranking strategy, Huang, Eidelman, and Harper (2009) proposed a latent variable incorporated model to improve a bigram HMM model.
Recently, researchers developed several models that integrate tagging into parsing (Hatori et al. 2011; Li et al. 2011; Bohnet and Nivre 2012; Ma et al. 2012; Alberti et al. 2015). The joint decoding architecture on one hand allows tagging to use rich syntactic features to improve accuracy, but on the other hand decreases the decoding efficiency. Different from the joint tagging and parsing approach, our method does not explicitly use syntactic features in the tagging phase. Only a simple sequence labeler with beam search is applied and therefore our tagger is much more efficient.
Our work also borrows some ideas from investivations in Chinese word segmentation. Notably, the idea to harvest string knowledges from large-scale raw texts to define new features for disambiguation is also successfully applied in our early work on semi-supervised segmentation (Sun and Xu 2011). Recently, neural network models have been widely applied to induce various linguistic knowledges in an unsupervised learning fashion. Such models have also been applied to word segmentation (Zheng, Chen, and Xu 2013; Chen et al. 2015; Ma and Hinrichs 2015). As an alternative way to exploit unlabeled data, neural network models can be also applied in our solution.
7. Conclusion
Chinese POS tagging has been proven much more challenging because of language- specific properties. We hold a view of structuralist linguistics and study the impact of paradigmatic and syntagmatic lexical relations on Chinese POS tagging. First, we harvest word partition information from large-scale raw texts to capture paradigmatic relations and use such knowledge to enhance a supervised tagger via feature engineering. Second, we comparatively analyze syntax-free and syntax-based models and use a stacking model to integrate a sequential tagger and a chart parser to capture syntagmatic relations that have a great impact on non-local disambiguation. Both enhancements significantly improve the state-of-the-art of Chinese POS tagging. The final model results in an error reduction of 18% over a state-of-the-art baseline. To improve tagging efficiency at test time, we explore unlabeled data to transfer the predictive power of hybrid models to simple sequence or even local classification models. Hybrid systems are utilized to create large-scale pseudo training data for cheap models. By applying complex machine learning techniques, we are able to build good sequential POS taggers. Another advantage of our system is that it serves as a front-end to a parser very well, and more accurate POS tagging yields more accurate phrase-structure parsing.
Appendix A. POS Tags Used in This Article
The CTB utilizes syntactic distribution as the main criterion for distinguishing lexical categories. In Table A.1, we present a brief introduction to the POS tags mentioned in Table 9 and 10. For more details, refer to the original annotation guidelines.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under grants no. 61300064 and 61331011, and the National High-Tech R&D Program under grant no. 2015AA015403. We are very grateful to the anonymous reviewers for their insightful and constructive comments and suggestions.
Notes
Some relevant information is copied from Table 12.
References
Author notes
The authors are with the Institute of Computer Science and Technology, the MOE Key Laboratory of Computational Linguistics, Peking University, Beijing 100871, China. E-mail: [email protected], [email protected]. Xiaojun Wan is the corresponding author.