Abstract
Finding the right representations for words is critical for building accurate NLP systems when domain-specific labeled data for the task is scarce. This article investigates novel techniques for extracting features from n-gram models, Hidden Markov Models, and other statistical language models, including a novel Partial Lattice Markov Random Field model. Experiments on part-of-speech tagging and information extraction, among other tasks, indicate that features taken from statistical language models, in combination with more traditional features, outperform traditional representations alone, and that graphical model representations outperform n-gram models, especially on sparse and polysemous words.
1. Introduction
NLP systems often rely on hand-crafted, carefully engineered sets of features to achieve strong performance. Thus, a part-of-speech (POS) tagger would traditionally use a feature like, “the previous token is the” to help classify a given token as a noun or adjective. For supervised NLP tasks with sufficient domain-specific training data, these traditional features yield state-of-the-art results. However, NLP systems are increasingly being applied to the Web, scientific domains, personal communications like e-mails and tweets, among many other kinds of linguistic communication. These texts have very different characteristics from traditional training corpora in NLP. Evidence from POS tagging (Blitzer, McDonald, and Pereira 2006; Huang and Yates 2009), parsing (Gildea 2001; Sekine 1997; McClosky 2010), and semantic role labeling (SRL) (Pradhan, Ward, and Martin 2007), among other NLP tasks (Daumé III and Marcu 2006; Chelba and Acero 2004; Downey, Broadhead, and Etzioni 2007; Chan and Ng 2006; Blitzer, Dredze, and Pereira 2007), shows that the accuracy of supervised NLP systems degrades significantly when tested on domains different from those used for training. Collecting labeled training data for each new target domain is typically prohibitively expensive. In this article, we investigate representations that can be applied to weakly supervised learning, that is, learning when domain-specific labeled training data are scarce.
A growing body of theoretical and empirical evidence suggests that traditional, manually crafted features for a variety of NLP tasks limit systems' performance in this weakly supervised learning for two reasons. First, feature sparsity prevents systems from generalizing accurately, because many words and features are not observed in training. Also because word frequencies are Zipf-distributed, this often means that there is little relevant training data for a substantial fraction of parameters (Bikel 2004b), especially in new domains (Huang and Yates 2009). For example, word-type features form the backbone of most POS-tagging systems, but types like “gene” and “pathway” show up frequently in biomedical literature, and rarely in newswire text. Thus, a classifier trained on newswire data and tested on biomedical data will have seen few training examples related to sentences with features “gene” and “pathway” (Blitzer, McDonald, and Pereira 2006; Ben-David et al. 2010).
Further, because words are polysemous, word-type features prevent systems from generalizing to situations in which words have different meanings. For instance, the word type “signaling” appears primarily as a present participle (VBG) in Wall Street Journal (WSJ) text, as in, “Interest rates rose, signaling that …” (Marcus, Marcinkiewicz, and Santorini 1993). In biomedical text, however, “signaling” appears primarily in the phrase “signaling pathway,” where it is considered a noun (NN) (PennBioIE 2005); this phrase never appears in the WSJ portion of the Penn Treebank (Huang and Yates 2010).
Our response to the sparsity and polysemy challenges with traditional NLP representations is to seek new representations that allow systems to generalize to previously unseen examples. That is, we seek representations that permit classifiers to have close to the same accuracy on examples from other domains as they do on the domain of the training data. Our approach depends on the well-known distributional hypothesis, which states that a word's meaning is identified with the contexts in which it appears (Harris 1954; Hindle 1990). Our goal is to develop probabilistic statistical language models that describe the contexts of individual words accurately. We then construct representations, or mappings from word tokens and types to real-valued vectors, from statistical language models. Because statistical language models are designed to model words' contexts, the features they produce can be used to combat problems with polysemy. And by careful design of the statistical language models, we can limit the number of features that they produce, controlling how sparse those features are in training data.
Our specific contributions are as follows:
- 1.
We show how to generate representations from a variety of language models, including n-gram models, Brown clusters, and Hidden Markov Models (HMMs). We also introduce a Partial-Lattice Markov Random Field (PL-MRF), which is a tractable variation of a Factorial Hidden Markov Model (Ghahramani and Jordan 1997) for language modeling, and we show how to produce representations from it.
- 2.
We quantify the performance of these representations in experiments on POS tagging in a domain adaptation setting, and weakly supervised information extraction (IE). We show that the graphical models outperform n-gram representations, even when the n-gram models leverage larger corpora for training. The PL-MRF representation achieves a state-of-the-art 93.8% accuracy on a biomedical POS tagging task, which represents a 5.5 percentage point absolute improvement over more traditional POS tagging representations, a 4.8 percentage point improvement over a tagger using an n-gram representation, and a 0.7 percentage point improvement over a tagger with an n-gram representation using several orders of magnitude more training data. The HMM representation improves over the n-gram model by 7 percentage points on our IE task.
- 3.
We analyze how sparsity, polysemy, and differences between domains affects the performance of a classifier using different representations. Results indicate that statistical language model representations, and especially graphical model representations, provide the best features for sparse and polysemous words.
The next section describes background material and related work on representation learning for NLP. Section 3 presents novel representations based on statistical language models. Sections 4 and 5 discuss evaluations of the representations, first on sequence-labeling tasks in a domain adaptation setting, and second on a weakly supervised set-expansion task. Section 6 concludes and outlines directions for future work.
2. Background and Previous Work on Representation Learning
2.1 Terminology and Notation
In a traditional machine learning task, the goal is to make predictions on test data using a hypothesis that is optimized on labeled training data. In order to do so, practitioners predefine a set of features and try to estimate classifier parameters from the observed features in the training data. We call these feature sets representations of the data.
Formally, let be an instance space for a learning problem. Let be the space of possible labels for an instance, and let be the target function to be learned. A representation is a function , for some suitable feature space (such as ℝd). We refer to dimensions of as features, and for an instance we refer to values for particular dimensions of R(x) as features of x. Given a set of training examples, a learning machine's task is to select a hypothesish from the hypothesis space, a subset of . Errors by the hypothesis are measured using a loss function that measures the cost of the mismatch between the target function f(x) and the hypothesis h(R(x)).
As an example, the instance set for POS tagging in English is the set of all English sentences, and is the space of POS sequences containing labels like NN (for noun) and VBG (for present participle). The target function f is the mapping between sentences and their correct POS labels. A traditional representation in NLP converts sentences into sequences of vectors, one for each word position. Each vector contains values for features like, “+1 if the word at this position ends with -tion, and 0 otherwise.” A typical loss function would count the number of words that are tagged differently by f(x) and h(R(x)).
2.2 Representation-Learning Problem Formulation
The representation-learning problem formulation in Equation (2) can in fact be reduced to the general learning formulation in Equation (1) by setting the fixed representation R to be the identity function, and setting the hypothesis space to be from the representation-learning task. We introduce the new formulation primarily as a way of changing the perspective on the learning task: most NLP systems consider a fixed, manually crafted transformation of the original data to some new space, and investigate hypothesis classes over that space. In the new formulation, systems learn the transformation to the feature space, and then apply traditional classification or regression algorithms.
2.3 Theory on Domain Adaptation
We refer to the distribution D over the instance space as a domain. For example, the newswire domain is a distribution over sentences that gives high probability to sentences about governments and current events; the biomedical literature domain gives high probability to sentences about proteins and regulatory pathways. In domain adaptation, a system observes a set of training examples (R(x), f(x)), where instances are drawn from a source domain , to learn a hypothesis for classifying examples drawn from a separate target domain . We assume that large quantities of unlabeled data are available for the source domain and target domain, and call these samples US and UT, respectively. For any domain , let represent the induced distribution over the feature space given by .
Crucially, the distance between domains depends on the features in the representation. The more that features appear with different frequencies in different domains, the worse this bound becomes. In fact, one lower bound for the d1 distance is the accuracy of the best classifier for predicting whether an unlabeled instance y = R(x) belongs to domain S or T (Ben-David et al. 2010). Thus, if R provides one set of common features for examples from S, and another set of common features for examples from T, the domain of an instance becomes easy to predict, meaning the distance between the domains grows, and the bound on our classifier's performance grows worse.
In light of Ben-David et al.'s theoretical findings, traditional representations in NLP are inadequate for domain adaptation because they contribute to the d1 distance between domains. Although many previous studies have shown that lexical features allow learning systems to achieve impressively low error rates during training, they also make texts from different domains look very dissimilar. For instance, a feature based on the word “bank” or “CEO” may be common in a domain of newswire text, but scarce or nonexistent in, say, biomedical literature. Ben David et al.'s theory predicts greater variance in the error rate of the target domain classifier as the distance grows.
At the same time, traditional representations contribute to data sparsity, a lack of sufficient training data for the relevant parameters of the system. In traditional supervised NLP systems, there are parameters for each word type in the data, or perhaps even combinations of word types. Because vocabularies can be extremely large, this leads to an explosion in the number of parameters. As a consequence, for many of their parameters, supervised NLP systems have zero or only a handful of relevant labeled examples (Bikel 2004a, 2004b). No matter how sophisticated the learning technique, it is difficult to estimate parameters without relevant data. Because vocabularies differ across domains, domain adaptation greatly exacerbates this issue of data sparsity.
2.4 Problem Formulation for the Domain Adaptation Setting
Note that there is an underlying tension between the two terms of the objective function: The best representation for the source domain would naturally include domain-specific features, and allow a hypothesis to learn domain-specific patterns. We are aiming, however, for the best general classifier, which happens to be trained on training data from one domain (or a few domains). The domain-specific features contribute to distance between domains, and to classifier errors on data taken from domains not seen in training. By optimizing for this combined objective function, we allow the optimization method to trade off between features that are best for classifying source-domain data and features that allow generalization to new domains.
Unlike the representation-learning problem-formulation in Equation (2), Equation (5) does not reduce to the standard machine-learning problem (Equation (1)). In a sense, the d1 term acts as a regularizer on , which also affects . Representation learning for domain adaptation is a fundamentally novel learning task.
2.5 Tractable Representation Learning: Statistical Language Models as Representations
For most hypothesis classes and any interesting space of representations, Equations (2) and (5) are completely intractable to optimize exactly. Even given a fixed representation, it is intractable to compute the best hypothesis for many hypothesis classes. And the d1 metric is intractable to compute from samples of a distribution, although Ben-David et al. (2007, 2010) propose some tractable bounds. We view these problem formulations as high-level goals rather than as computable objectives.
As a tractable objective, in this work we describe an investigation into the use of statistical language models as a way to represent the meanings of words. This approach depends on the well-known distributional hypothesis, which states that a word's meaning is identified with the contexts in which it appears (Harris 1954; Hindle 1990). From this hypothesis, we can formulate the following testable prediction, which we call the statistical language model representation hypothesis, or LMRH:
To the extent that a model accurately describes a word's possible contexts, parameters of that model are highly informative descriptors of the word's meaning, and are therefore useful as features in NLP tasks like POS tagging, chunking, NER, and information extraction.
The LMRH says, essentially, that for NLP tasks, we can decouple the task of optimizing a representation from the task of optimizing a hypothesis. To learn a representation, we can train a statistical language model on unlabeled text, and then use parameters or latent states from the statistical language model to create a representation function. Optimizing a hypothesis then follows the standard learning framework, using the representation from the statistical language model.
The LMRH is similar to the manifold and cluster assumptions behind other semi-supervised approaches to machine learning, such as Alternating Structure Optimization (ASO) (Ando and Zhang 2005) and Structural Correspondence Learning (SCL) (Blitzer, McDonald, and Pereira 2006). All three of these techniques use predictors built on unlabeled data as a way to harness the manifold and cluster assumptions. However, the LMRH is distinct from at least ASO and SCL in important ways. Both ASO and SCL create multiple “synthetic” or “pivot” prediction tasks using unlabeled data, and find transformations of the input feature space that perform well on these tasks. The LMRH, on the other hand, is more specific — it asserts that for language problems, if we optimize word representations on a single task (the language modeling task), this will lead to strong performance on weakly supervised tasks. In reported experiments on NLP tasks, both ASO and SCL use certain synthetic predictors that are essentially language modeling tasks, such as the task of predicting whether the next token is of word type w. To the extent that these techniques' performance relies on language-modeling tasks as their “synthetic predictors,” they can be viewed as evidence in support of the LMRH.
One significant consequence of the LMRH is that it allows us to leverage well-developed techniques and models from statistical language modeling. Section 3 presents a series of statistical language models that we investigate for learning representations for NLP.
2.6 Previous Work
There is a long tradition of NLP research on representations, mostly falling into one of four categories: 1) vector space models of meaning based on document-level lexical co-occurrence statistics (Salton and McGill 1983; Sahlgren 2006; Turney and Pantel 2010); 2) dimensionality reduction techniques for vector space models (Deerwester et al. 1990; Ritter and Kohonen 1989; Honkela 1997; Kaski 1998; Sahlgren 2001; Sahlgren 2005; Blei, Ng, and Jordan 2003; Väyrynen and Honkela 2004; Väyrynen and Honkela 2005; Väyrynen, Honkela, and Lindqvist 2007); 3) using clusters that are induced from distributional similarity (Brown et al. 1992; Pereira, Tishby, and Lee 1993; Martin, Liermann, and Ney 1998) as non-sparse features (Miller, Guinness, and Zamanian 2004; Ratinov and Roth 2009; Lin and Wu 2009; Candito and Crabbe 2009; Koo, Carreras, and Collins 2008; Suzuki et al. 2009; Zhao et al. 2009); and, recently, 4) neural network statistical language models (Bengio 2008; Bengio et al. 2003; Morin and Bengio 2005; Mnih, Yuecheng, and Hinton 2009; Mnih and Hinton 2007, 2009) as representations (Weston, Ratle, and Collobert 2008; Collobert and Weston 2008; Bengio et al. 2009). Our work is a form of distributional clustering for representations, but where previous work has used bigram and trigram statistics to form clusters, we build sophisticated models that attempt to capture the context of a word, and hence its similarity to other words, more precisely. Our experiments show that the new graphical models provide representations that outperform those from previous work on several tasks.
Neural network statistical language models have recently achieved state-of-the-art perplexity results (Mnih and Hinton 2009), and representations based on them have improved in-domain chunking, NER, and SRL (Weston, Ratle, and Collobert 2008; Turian, Bergstra, and Bengio 2009; Turian, Ratinov, and Bengio 2010). As far as we are aware, Turian, Ratinov, and Bengio (2010) is the only other work to test a learned representation on a domain adaptation task, and they show improvement on out-of-domain NER with their neural net representations. Though promising, the neural network models are computationally expensive to train, and these statistical language models work only on fixed-length histories (n-grams) rather than full observation sequences. Turian, Ratinov, and Bengio's (2010) tests also show that Brown clusters perform as well or better than neural net models on all of their chunking and NER tests. We concentrate on probabilistic graphical models with discrete latent states instead. We show that HMM-based and other representations significantly outperform the more commonly used Brown clustering (Brown et al. 1992) as a representation for domain adaptation settings of sequence-labeling tasks.
Most previous work on domain adaptation has focused on the case where some labeled data are available in both the source and target domains (Chan and Ng 2006; Daumé III and Marcu 2006; Blitzer, Dredze, and Pereira 2007; Daumé III 2007; Jiang and Zhai 2007a, 2007b; Dredze and Crammer 2008; Finkel and Manning 2009; Dredze, Kulesza, and Crammer 2010). Learning bounds for this domain-adaptation setting are known (Blitzer et al. 2007; Mansour, Mohri, and Rostamizadeh 2009). Approaches to this problem setting have focused on appropriately weighting examples from the source and target domains so that the learning algorithm can balance the greater relevance of the target-domain data with the larger source-domain data set. In some cases, researchers combine this approach with semi-supervised learning to include unlabeled examples from the target domain as well (Daumé III, Kumar, and Saha 2010). These techniques do not handle open-domain corpora like the Web, where they require expert input to acquire labels for each new single-domain corpus, and it is difficult to come up with a representative set of labeled training data for each domain. Our technique requires only unlabeled data from each new domain, which is significantly easier and cheaper to acquire. Where target-domain labeled data is available, however, these techniques can in principle be combined with ours to improve performance, although this has not yet been demonstrated empirically.
A few researchers have considered the more general case of domain adaptation without labeled data in the target domain. Perhaps the best known is Blitzer, McDonald, and Pereira's (2006) Structural Correspondence Learning (SCL). SCL uses “pivot” words common to both source and target domains, and trains linear classifiers to predict these pivot words from their context. After an SVD reduction of the weight vectors for these linear classifiers, SCL projects the original features through these weight vectors to obtain new features that are added to the original feature space. Like SCL, our language modeling techniques attempt to predict words from their context, and then use the output of these predictions as new features. Unlike SCL, we attempt to predict all words from their context, and we rely on traditional probabilistic methods for language modeling. Our best learned representations, which involve significantly different techniques from SCL, especially latent-variable probabilistic models, significantly outperform SCL in POS tagging experiments.
Other approaches to domain adaptation without labeled data from the target domain include Satpal and Sarawagi (2007), who show that by changing the optimization function during conditional random field (CRF) training, they can learn classifiers that port well to new domains. Their technique selects feature subsets that minimize the distance between training text and unlabeled test text, but unlike our techniques, theirs cannot learn representations with features that do not appear in the original feature set. In contrast, we learn hidden features through statistical language models. McClosky, Charniak, and Johnson (2010) use classifiers from multiple source domains and features that describe how much a target document diverges from each source domain to determine an optimal weighting of the source-domain classifiers for parsing the target text. However, it is unclear if this “source-combination” technique works well on domains that are not mixtures of the various source domains. Dai et al. (2007) use KL-divergence between domains to directly modify the parameters of their naive Bayes model for a text classification task trained purely on the source domain. These last two techniques are not representation learning, and are complementary to our techniques.
Our representation-learning approach to domain adaptation is an instance of semi-supervised learning. Of the vast number of semi-supervised approaches to sequence labeling in NLP, the most relevant ones here include Suzuki and Isozaki's (2008) combination of HMMs and CRFs that uses over a billion words of unlabeled text to achieve the current best performance on in-domain chunking, and semi-supervised approaches to improving in-domain SRL with large quantities of unlabeled text (Weston, Ratle, and Collobert 2008; Deschacht and Moens 2009; and Fürstenau and Lapata 2009). Ando and Zhang's (2005) semi-supervised sequence labeling technique has been tested on a domain adaptation task for POS tagging (Blitzer, McDonald, and Pereira 2006); our representation-learning approaches outperform it. Unlike most semi-supervised techniques, we concentrate on a particularly simple task decomposition: unsupervised learning for new representations, followed by standard supervised learning. In addition to our task decomposition being simple, our learned representations are also task-independent, so we can learn the representation once, and then apply it to any task.
One of the best-performing representations that we consider for domain adaptation is based on the HMM (Rabiner 1989). HMMs have of course also been used for supervised, semi-supervised, and unsupervised POS tagging on a single domain (Banko and Moore 2004; Goldwater and Griffiths 2007). Recent efforts on improving unsupervised POS tagging have focused on incorporating prior knowledge into the POS induction model (Graça et al. 2009; Toutanova and Johnson 2007), or on new training techniques like contrastive estimation (Smith and Eisner 2005) for alternative sequence models. Despite the fact that completely connected, standard HMMs perform poorly at the POS induction task (Johnson 2007), we show that they still provide very useful features for a supervised POS tagger. Experiments in information extraction have previously also shown that HMMs provide informative features for this quite different, semantic processing task (Downey, Schoenmackers, and Etzioni 2007; Ahuja and Downey 2010).
This article extends our previous work on learning representations for domain adaptation (Huang and Yates 2009, 2010) by investigating new language representations—the naive Bayes representation and PL-MRF representation (Huang et al. 2011)—by analyzing results in terms of polysemy, sparsity, and domain divergence; by testing on new data sets including a Chinese POS tagging task; and by providing an empirical comparison with Brown clusters as representations.
3. Learning Representations of Distributional Similarity
In this section, we will introduce several representation learning models.
3.1 Traditional POS-Tagging Representations
As an example of our terminology, we begin by describing a representation used in traditional POS taggers (this representation will later form a baseline for our POS tagging experiments). The instance set is the set of English sentences, and is the set of POS tag sequences. A traditional representation Trad-R maps a sentence to a sequence of boolean-valued vectors, one vector per word xi in the sentence. Dimensions for each latent vector include indicators for the word type of xi and various orthographic features. Table 1 presents the full list of features in Trad-R. Because our IE task classifies word types rather than tokens, this baseline is not appropriate for that task. Herein, we describe how we can learn representations R by using a variety of statistical language models, for use in both our IE and POS tagging tasks. All representations for POS tagging inherit the features from Trad-R all representations for IE do not.
Representation . | Features . |
---|---|
Trad-R | ∀w1[xi = w] |
∀s ∈ Suffixes1[xi ends with s] | |
1[xi contains a digit] | |
n-gram-R | ∀w′,w″P(w′ ww″) / P(w) |
Lsa-R | ∀w, j {v′left(w)}j |
∀w, j {v′right(w)}j | |
NB-R | |
Hmm-Token-R | |
Hmm-Type-R | ∀kP(y = k|x = w) |
I-Hmm-Token-R | |
I-Hmm-Type-R | ∀j,kP(y.,j = k|x = w) |
Brown-Token-R | ∀j ∈ {−2, −1,0,1,2} |
∀p ∈ {4,6,10,20} prefix(yi+j, p) | |
Brown-Type-R | ∀p prefix(y, p) |
Lattice-Token-R | |
Lattice-Type-R | ∀kP(y = k|x = w) |
Representation . | Features . |
---|---|
Trad-R | ∀w1[xi = w] |
∀s ∈ Suffixes1[xi ends with s] | |
1[xi contains a digit] | |
n-gram-R | ∀w′,w″P(w′ ww″) / P(w) |
Lsa-R | ∀w, j {v′left(w)}j |
∀w, j {v′right(w)}j | |
NB-R | |
Hmm-Token-R | |
Hmm-Type-R | ∀kP(y = k|x = w) |
I-Hmm-Token-R | |
I-Hmm-Type-R | ∀j,kP(y.,j = k|x = w) |
Brown-Token-R | ∀j ∈ {−2, −1,0,1,2} |
∀p ∈ {4,6,10,20} prefix(yi+j, p) | |
Brown-Type-R | ∀p prefix(y, p) |
Lattice-Token-R | |
Lattice-Type-R | ∀kP(y = k|x = w) |
3.2 n-gram Representations
n-gram representations, which we call n-gram-R, model a word type w in terms of the n-gram contexts in which w appears in a corpus. Specifically, for word w we generate the vector P(w′ ww″)/P(w), the conditional probability of observing the word sequence w′ to the left and w″ to the right of w. Each dimension in this vector represents a combination of the left and right words. The experimental section describes the particular corpora and statistical language modeling methods used for estimating probabilities. Note that these features depend only on the word type w, and so for every token xi = w, n-gram-R provides the same set of features regardless of local context.
One drawback of n-gram-R is that it does not handle sparsity well—the features are as sparsely observed as the lexical features in Trad-R, except that n-gram-R features can be obtained from larger corpora. As an alternative, we apply latent semantic analysis (LSA) (Deerwester et al. 1990) to compute a reduced-rank representation. For word w, let vright(w) represent the right context vector of w, which in each dimension contains the value of P(ww″) / P(w) for some word w″, as observed in the n-gram model. Similarly, let vleft(w) be the left context vector of w. We apply LSA to the set of right context vectors and the set of left context vectors separately,1 to find reduced-rank versions v′right(w) and v′left(w), where each dimension represents a combination of several context word types. We then use each component of v′right(w) and v′left(w) as features. After experimenting with different choices for the number of dimensions to reduce our vectors to, we choose a value of 10 dimensions as the one that maximizes the performance of our supervised sequence labelers on held-out data. We call this model Lsa-R.
3.3 A Context-Dependent Representation Using Naive Bayes
Because our factorization of the sentence does not take into account the fact that the trigrams overlap, the resulting statistical language model is mass-deficient. Worse still, it is throwing away information from the dependencies among trigrams which might help make better clustering decisions. Nevertheless, this model closely mirrors many of the clustering algorithms used in previous approaches to representation learning for sequence labeling (Ushioda 1996; Miller, Guinness, and Zamanian 2004; Koo, Carreras, and Collins 2008; Lin and Wu 2009; Ratinov and Roth 2009), and therefore serves as an important benchmark.
For a reasonable choice of S (i.e., |S| ≪ |V|), each feature should be observed often in a sufficiently large training data set. Therefore, compared with n-gram-R, NB-R produces far fewer features. On the other hand, its features for xi depend not just on the contexts in which xi has appeared in the statistical language model's training data, but also on xi − 1 and xi + 1 in the current sentence. Furthermore, because the range of the features is much more restrictive than real-valued features, it is less prone to data sparsity or variations across domains than real-valued features.
3.4 Context-Dependent, Structured Representations: The Hidden Markov Model
In previous work, we have implemented several representations based on hidden Markov models (Rabiner 1989), which we used for both sequential labeling (like POS tagging [Huang et al. 2011] and NP chunking [Huang and Yates 2009]) and IE (Downey, Schoenmackers, and Etzioni 2007). Figure 2 shows a graphical model of an HMM. An HMM is a generative probabilistic model that generates each word xi in the corpus conditioned on a latent variable yi. Each yi in the model takes on integral values from 1 to K, and each one is generated by the latent variable for the preceding word, yi − 1. The joint distribution for a corpus x = (x1, …, xN) and a set of state vectors y = (y1, …, yN) is given by: . Using expectation-maximization (EM) (Dempster, Laird, and Rubin 1977), it is possible to estimate the distributions for P(xi|yi) and P(yi|yi − 1) from unlabeled data.
We construct two different representations from HMMs, one for sequence-labeling tasks and one for IE. For sequence labeling, we use the Viterbi algorithm to produce the optimal setting y* of the latent states for a given sentence x, or y* = argmaxP(x, y). We use the value of as a new feature for xi that represents a cluster of distributionally similar words. For IE, we require features for word types w, rather than tokens xi. Applying Bayes' rule to the HMM parameters, we compute a distribution P(Y|x = w), where Y is a single latent node, x is a single token, and w is its word type. We then use each of the K values for P(Y = k|x = w), where k ranges from 1 to K, as features. This set of features represents a “soft clustering” of w into K different clusters. We refer to these representations as Hmm-Token-R and Hmm-Type-R, respectively.
We also compare against a multi-layer variation of the HMM from our previous work (Huang and Yates 2010). This model trains an ensemble of M independent HMM models on the same corpus, initializing each one randomly. We can then use the Viterbi-optimal decoded latent state of each independent HMM model as a separate feature for a token, or the posterior distribution for P(Y|x = w) from each HMM as a separate set of features for each word type. We refer to this statistical language model as an I-HMM, and the representations as I-Hmm-Token-R and I-Hmm-Type-R, respectively.
Finally, we compare against Brown clusters (Brown et al. 1992) as learned features. Although not traditionally described as such, Brown clustering involves constructing an HMM model in which each word type is restricted to having exactly one latent state that may generate it. Brown et al. describe a greedy agglomerative clustering algorithm for training this model on unlabeled text. Following Turian, Ratinov, and Bengio (2010), we use Percy Liang's implementation of this algorithm for our comparison, and we test runs with 100, 320, 1,000 and 3,200 clusters. We use features from these clusters identical to Turian et al.'s.2 Turian et al. have shown that Brown clusters match or exceed the performance of neural network-based statistical language models in domain adaptation experiments for named-entity recognition, as well as in-domain experiments for NER and chunking.
Because HMM-based representations offer a small number of discrete states as features, they have a much greater potential to combat sparsity than do n-gram models. Furthermore, for token-based representations, these models can potentially handle polysemy better than n-gram statistical language models by providing different features in different contexts.
3.5 A Novel Lattice Statistical Language Model Representation
Our final statistical language model is a novel latent-variable statistical language model, called a Partial Lattice MRF (PL-MRF), with rich latent structure, shown in Figure 3. The model contains a lattice of M ×N latent states, where N is the number of words in a sentence and M is the number of layers in the model. The dotted and solid lines in the figure together form a complete lattice of edges between these nodes; the PL-MRF uses only the solid edges. Formally, let , where N is the length of the sentence; let i denote a position in the sentence, and let j denote a layer in the lattice. If i < c and j is odd, or if j is even and i > c, we delete edges between yi,j and yi,j + 1 from the complete lattice. The same set of nodes remains, but the partial lattice contains fewer edges and paths between the nodes. A central “trunk” at i = c connects all layers of the lattice, and branches from this trunk connect either to the branches in the layer above or the layer below (but not both).
The result is a model that retains most of the edges of the complete lattice, but unlike the complete lattice, it supports tractable inference. As M, N → ∞, five out of every six edges from the complete lattice appear in the PL-MRF. However, the PL-MRF makes the branches conditionally independent from one another, except through the trunk. For instance, the left branch between layers 1 and 2 ((y1,1,y1,2) and (y2,1,y2,2)) in Figure 3 are disconnected; similarly, the right branch between layers 2 and 3 ((y4,2,y4,3) and (y5,2, y5,3)) are disconnected, except through the trunk and the observed nodes. As a result, excluding the observed nodes, this model has a low tree-width of 2 (excluding observed nodes), and a variety of efficient dynamic programming and message-passing algorithms for training and inference can be readily applied (Bodlaender 1988). Our inference algorithm passes information from the branches inwards to the trunk, and then upward along the trunk, in time O(K4MN). In contrast, a fully connected lattice model has tree-width = min (M,N), making inference and learning intractable (Sutton, McCallum, and Rohanimanesh 2007), partly because of the difficulty in enumerating and summing over the exponentially-many configurations y for a given x.
We can justify the choice of this model from a linguistic perspective as a way to capture the multi-dimensional nature of words. Linguists have long argued that words have many different features in a high dimensional space: They can be separately described by part of speech, gender, number, case, person, tense, voice, aspect, mass vs. count, and a host of semantic categories (agency, animate vs. inanimate, physical vs. abstract, etc.), to name a few (Sag, Wasow, and Bender 2003). In the PL-MRF, each layer of nodes is intended to represent some latent dimension of words.
As with our HMM models, we create two representations from PL-MRFs, one for tokens and one for types. For tokens, we decode the model to compute y*, the matrix of optimal latent state values for sentence x. For each layer j and and each possible latent state value k, we add a boolean feature for token xi that is true iff . For word types, we compute distributions over the latent state space. Let y be a column vector of latent variables for word type w. For a PL-MRF model with M layers of binary variables, there are 2M possible values for y. Our type representation computes a probability distribution over these 2M possible values, and uses each probability as a feature for w.3 We refer to these two representations as Lattice-Token-R and Lattice-Type-R, respectively.
For tractability, we modify the training procedure to train the PL-MRF one layer at a time. Let θi represent the set of parameters relating to features of layer i, and let θ¬i represent all other parameters. We fix θ¬0 = 0, and optimize θ0 using contrastive estimation. After convergence, we fix θ¬1, and optimize θ1, and so on. For training each layer, we use a convergence threshold of 10−6 on the objective function in Equation (7), and each layer typically converges in under 100 iterations.
4. Domain Adaptation with Learned Representations
We evaluate the representations described earlier on POS tagging and NP chunking tasks in a domain adaptation setting.
4.1 A Rich Problem Setting for Representation Learning
Existing supervised NLP systems are domain-dependent: There is a substantial drop in their performance when tested on data from a new domain. Domain adaptation is the task of overcoming this domain dependence. The aim is to build an accurate system for a target domain by training on labeled examples from a separate source domain. This problem is sometimes also called transfer learning (Raina et al. 2007).
Two of the challenges for NLP representations, sparsity and polysemy, are exacerbated by domain adaptation. New domains come with new words and phrases that appear rarely (or even not at all) in the training domain, thus increasing problems with data sparsity. And even for words that do appear commonly in both domains, the contexts around the words will change from the training domain to the target domain. As a result, domain adaptation adds to the challenge of handling polysemous words, whose meaning depends on context.
In short, domain adaptation is a challenging setting for testing NLP representations. We now present several experiments testing our representations against state-of-the-art POS taggers in a variety of domain adaptation settings, showing that the learned representations surpass the previous state-of-the-art, without requiring any labeled data from the target domain.
4.2 Experimental Set-up
For domain adaptation, we test our representations on two sequence labeling tasks: POS tagging and chunking. To incorporate learned representation into our models, we follow this general procedure, although the details vary by experiment and are given in the following sections. First, we collect a set of unannotated text from both the training domain and test domain. Second, we learn representations on the unannotated text. We then automatically annotate both the training and test data with features from the learned representation. Finally, we train a supervised linear-chain CRF model on the annotated training set and apply it to the test set.
We use an open source CRF software package designed by Sunita Sarawagi to train and apply our CRF models.4 As is standard, we use two kinds of feature functions: transition and observation. Transition feature functions indicate, for each pair of labels l and l′, whether zi = l and zi − 1 = l′. Boolean observation feature functions indicate, for each label l and each feature f provided by a representation, whether zi = l and xi has feature f. For each label l and each real-valued feature f in representation R, real-valued observation feature functions have value f (x) if zi = l, and are zero otherwise.
4.3 Domain Adaptation for POS Tagging
Our first experiment tests the performance of all the representations we introduced earlier on an English POS tagging task, trained on newswire text, to tag biomedical research literature. We follow Blitzer et al.'s experimental set-up. The labeled data consists of the WSJ portion of the Penn Treebank (Marcus, Marcinkiewicz, and Santorini 1993) as source domain data, and 561 labeled sentences (9,576 tokens) from the biomedical research literature database MEDLINE as target domain data (PennBioIE 2005). Fully 23% of the tokens in the labeled test text are never seen in the WSJ training data. The unlabeled data consists of the WSJ text plus 71,306 additional sentences of MEDLINE text (Blitzer, McDonald, and Pereira 2006). As a preprocessing step, we replace hapax legomena (defined as words that appear once in our unlabeled training data) with the special symbol *UNKNOWN*, and do the same for words in the labeled test sets that never appeared in any of our unlabeled training text.
For representations, we tested Trad-R, n-gram-R, Lsa-R, NB-R, Hmm-Token-R, I-Hmm-Token-R (between 2 and 8 layers), and Lattice-Token-R (8, 12, 16, and 20 layers). Each latent node in the I-HMMs had 80 possible values, creating 808 ≈ 1015 possible configurations of the eight-layer I-HMM for a single word. Each node in our PL-MRF is binary, creating a much smaller number (220 ≈ 106) of possible configurations for each word in a 20-layer representation. To give the n-gram model the largest training data set available, we trained it on the Web 1Tgram corpus (Brants and Franz 2006). We included the top 500 most common n-grams for each word type, and then used mutual information on the training data to select the top 10,000 most relevant n-gram features for all word types, in order to keep the number of features manageable. We incorporated n-gram features as binary values indicating whether xi appeared with the n-gram or not. For comparison, we also report on the performance of Brown clusters (100, 320, 1,000, and 3,200 possible clusters), following Turian, Ratinov, and Bengio (2010). Finally, we compare against Blitzer, McDonald, and Pereira (2006) SCL technique, described in Section 2.6, and the standard semi-supervised learning algorithm ASO (Ando and Zhang 2005), whose results on this task were previously reported by Blitzer, McDonald, and Pereira (2006).
Table 2 shows the results for the best variation of each kind of model—20 layers for the PL-MRF, 7 layers for the I-HMM, and 3,200 clusters for the Brown clustering. All statistical language model representations outperform the Trad-R baseline.
Model . | All words . | OOV words . |
---|---|---|
Trad-R | 11.7 | 32.7 |
n-gram-R | 11.7 | 32.2 |
Lsa-R | 11.6 | 31.1 |
NB-R | 11.6 | 30.7 |
ASO | 11.6 | 29.1 |
SCL | 11.1 | 28 |
Brown-Token-R | 10.0 | 25.2 |
Hmm-Token-R | 9.5 | 24.8 |
Web1T-n-gram-R | 6.9 | 24.4 |
I-Hmm-Token-R | 6.7 | 24 |
Lattice-Token-R | 6.2 | 21.3 |
SCL+500bio | 3.9 | – |
Model . | All words . | OOV words . |
---|---|---|
Trad-R | 11.7 | 32.7 |
n-gram-R | 11.7 | 32.2 |
Lsa-R | 11.6 | 31.1 |
NB-R | 11.6 | 30.7 |
ASO | 11.6 | 29.1 |
SCL | 11.1 | 28 |
Brown-Token-R | 10.0 | 25.2 |
Hmm-Token-R | 9.5 | 24.8 |
Web1T-n-gram-R | 6.9 | 24.4 |
I-Hmm-Token-R | 6.7 | 24 |
Lattice-Token-R | 6.2 | 21.3 |
SCL+500bio | 3.9 | – |
In nearly all cases, learned representations significantly outperformed Trad-R. The best representation, the 20-layer Lattice-Token-R, reduces error by 47% (35% on OOV) relative to the baseline Trad-R, and by 44% (24% on out-of-vocabulary words (OOV)) relative to the benchmark SCL system. For comparison, this model achieved a 96.8% in-domain accuracy on Sections 22–24 of the Penn Treebank, about 0.5 percentage point shy of a state-of-the-art in-domain system with more sophisticated supervised learning (Shen, Satta, and Joshi 2007). The Brown-Token-R representation, which Turian, Ratinov, and Bengio (2010) demonstrated performed as well or better than a variety of neural network statistical language models as representations, achieved accuracies between the SCL system and the Hmm-Token-R. The Web1T-n-gram-R, I-Hmm-Token-R, and Lattice-Token-R all performed quite close to one another, but the I-Hmm-Token-R and Lattice-Token-R were trained on many orders of magnitude less text. The Lsa-R and NB-R outperformed the Trad-R baseline but not the SCL system. The n-gram-R, which was trained on the same text as the other representations except the Web1T-n-gram-R, performed far worse than the Web1T-n-gram-R.
The amount of unlabeled training data has a significant impact on the performance of these representations. This is apparent in the difference between Web1T-n-gram-R and n-gram-R, but it is also true for our other representations. Figure 4 shows the accuracy of a representative subset of our taggers on words not seen in labeled training data, as we vary the amount of unlabeled training data available to the language models. Performance grows steadily for all representations we measured, and none of the learning curves appears to have peaked. Furthermore, the margin between the more complex graphical models and the simpler n-gram models grows with increasing amounts of training data.
4.3.1 Sparsity and Polysemy
We expected that statistical language model representations would perform well in part because they provide meaningful features for sparse and polysemous words. For sparse tokens, these trends are already evident in the results in Table 2: Models that provide a constrained number of features, like HMM-based models, tend to outperform models that provide huge numbers of features (each of which, on average, is only sparsely observed in training data), like Trad-R.
As for polysemy, HMM models significantly outperform naive Bayes models and the n-gram-R. The n-gram-R's features do not depend on a token type's context at all, and the NB-R's features depend only on the tokens immediately to the right and left of the current word. In contrast, the HMM takes into account all tokens in the surrounding sentence (although the strength of the dependence on more distant words decreases rapidly). Thus the performance of the HMM compared with n-gram-R and NB-R, as well as the performance of the Lattice-Token-R compared with the Web1T-n-gram-R, suggests that representations that are sensitive to the context of a word produce better features.
To test these effects more rigorously, we selected 109 polysemous word types from our test data, along with 296 non-polysemous word types. The set of polysemous word types was selected by filtering for words in our labeled data that had at least two POS tags that began with distinct letters (e.g., VBZ and NNS). An initial set of non-polysemous word types was selected by filtering for types that appeared with just one POS tag. We then manually inspected these initial selections to remove obvious cases of word types that were in fact polysemous within a single part-of-speech, such as “bank.” We further define sparse word types as those that appear five times or fewer in all of our unlabeled data, and we define non-sparse word types as those that appear at least 50 times in our unlabeled data. Table 3 shows our POS tagging results on the tokens of our labeled biomedical data with word types matching these four categories.
. | polysemous . | not polysemous . | sparse . | not sparse . |
---|---|---|---|---|
tokens | 159 | 4,321 | 463 | 12,194 |
Trad-R | 59.5 | 78.5 | 52.5 | 89.6 |
Web1T-n-gram-R | 68.2 | 85.3 | 61.8 | 94.0 |
NB-R | 64.5 | 88.7 | 57.8 | 89.4 |
(-Web1T-n-gram-R) | (−3.7) | (+3.4) | (−4.0) | (−4.6) |
Hmm-Token-R | 67.9 | 83.4 | 60.2 | 91.6 |
(-Web1T-n-gram-R) | (−0.3) | (−1.9) | (−1.6) | (−2.4) |
I-Hmm-Token-R | 75.6 | 85.2 | 62.9 | 94.5 |
(-Web1T-n-gram-R) | (+7.4) | (−0.1) | (+1.1) | (+0.5) |
Lattice-Token-R | 70.5 | 86.9 | 65.2 | 94.6 |
(-Web1T-n-gram-R) | (+2.3) | (+1.6) | (+3.4) | (+0.6) |
. | polysemous . | not polysemous . | sparse . | not sparse . |
---|---|---|---|---|
tokens | 159 | 4,321 | 463 | 12,194 |
Trad-R | 59.5 | 78.5 | 52.5 | 89.6 |
Web1T-n-gram-R | 68.2 | 85.3 | 61.8 | 94.0 |
NB-R | 64.5 | 88.7 | 57.8 | 89.4 |
(-Web1T-n-gram-R) | (−3.7) | (+3.4) | (−4.0) | (−4.6) |
Hmm-Token-R | 67.9 | 83.4 | 60.2 | 91.6 |
(-Web1T-n-gram-R) | (−0.3) | (−1.9) | (−1.6) | (−2.4) |
I-Hmm-Token-R | 75.6 | 85.2 | 62.9 | 94.5 |
(-Web1T-n-gram-R) | (+7.4) | (−0.1) | (+1.1) | (+0.5) |
Lattice-Token-R | 70.5 | 86.9 | 65.2 | 94.6 |
(-Web1T-n-gram-R) | (+2.3) | (+1.6) | (+3.4) | (+0.6) |
As expected, all of our statistical language models outperform the baseline by a larger margin on polysemous words than on non-polysemous words. The margin between graphical model representations and the Web1T-n-gram-R model also increases on polysemous words, except for the NB-R. The Web1T-n-gram-R uses none of the local context to decide which features to provide, and the NB-R uses only the immediate left and right context, so both models ignore most of the context. In contrast, the remaining graphical models use Viterbi decoding to take into account all tokens in the surrounding sentence, which helps to explain their relative improvement over Web1T-n-gram-R on polysemous words.
The same behavior is evident for sparse words, as compared with non-sparse words: All of the statistical language model representations outperform the baseline by a larger margin on sparse words than not-sparse words, and all of the graphical models perform better relative to the Web1T-n-gram-R on sparse words than not-sparse words. By reducing the feature space from millions of possible n-gram features to L categorical features, these models ensure that each of their features will be observed often in a reasonably sized training data set. Thus representations based on graphical models help address two key issues in building representations for POS tagging.
4.3.2 Domain Divergence
Besides sparsity and polysemy, Ben-David et al.'s (2007, 2010) theoretical analysis of domain adaptation shows that the distance between two domains under a representation R of the data is crucial for a good representation. We test their predictions using learned representations.
Intuitively, we aim to measure the distance between two domains by measuring whether features appear more commonly in one domain than in the other. For instance, the biomedical domain is far from the newswire domain under the Trad-R representation because word-based features like protein, gene, and pathway appear far more commonly in the biomedical domain than the newswire domain. Likewise, bank and president appear far more commonly in newswire text. Since the d1 distance is related to the optimal classifier for distinguishing two domains, it makes sense to measure the distance by comparing the frequencies of these features: a classifier can easily use the occurrence of words like bank and protein to accurately predict whether a given sentence belongs to the newswire or biomedical domain.
More formally, let S and T be two domains, and let f be a feature5 in representation R—that is, a dimension of the image space of R. Let V be the set of possible values that f can take on. Let US be an unlabeled sample drawn from S, and likewise for UT. We first compute the relative frequencies of the different values of f in R(US) and R(UT), and then compute dJS between these empirical distributions. Let pf represent the empirical distribution over V estimated from observations of feature f in R(US), and let qf represent the same distribution estimated from R(UT).
Definition 1
For a multidimensional representation, we compute the full domain divergence as a weighted sum over the domain divergences for its features. Because individual features may vary in their relevance to a sequence-labeling task, we use weights to indicate their importance to the overall distance between the domains. We set the weight wf for feature f proportional to the norm of CRF parameters related to f in the trained POS tagger. That is, let θ be the CRF parameters for our trained POS tagger, and let θf = {θl,v|l be the state for zi and v be the value for f}. We set .
Definition 2
Blitzer (2008) uses a different notion of domain divergence to approximate the d1 divergence, which we also experimented with. He trains a CRF classifier on examples labeled with a tag indicating which domain the example was drawn from. We refer to this type of classifier as a domain classifier. Note that these should not be confused with our CRFs used for POS tagging, which take as input examples which are labeled with POS sequences. For the domain classifier, we tag every token from the WSJ domain as 0, and every token from the biomedical domain as 1. Blitzer then uses the accuracy of his domain classifier on a held-out test set as his measure of domain divergence. A high accuracy for the domain classifier indicates that the representation makes the two domains easy to separate, and thus high accuracy signifies a high domain divergence. To measure domain divergence using a domain classifier, we trained our representations on all of the unlabeled data for this task, as before. We then used 500 randomly sampled sentences from the WSJ domain, and 500 randomly sampled biomedical sentences, and labeled these with 0 for the WSJ data and 1 for the biomedical data. We measured the error rate of our domain-classifier CRF as the average error rate across folds when performing three-fold cross-validation on these 1,000 sentences.
Figure 5 plots the accuracies and JS domain divergences for our POS taggers. Figure 6 shows the difference between target-domain error and source-domain error as a function of JS domain divergence. Figures 7 and 8 show the same information, except that the x axis plots the accuracy of a domain classifier as the way of measuring domain divergence. These results give empirical support to Ben-David et al.'s (2007, 2010) theoretical analysis: Smaller domain divergence—whether measured by JS domain divergence or by the accuracy of a domain classifier—correlates strongly with better target-domain accuracy. Furthermore, smaller domain divergence correlates strongly with a smaller difference in the accuracy of the taggers on the source and target domains.
Although both the JS domain divergence and the domain classifier provide only approximations of the d1 metric for domain divergence, they agree very strongly: In both cases, the Lattice-Token-R representations had the lowest domain divergence, followed by the I-Hmm-Token-R representations, followed by Trad-R, with n-gram-R somewhere between Lattice-Token-R and I-Hmm-Token-R. The main difference between the two metrics appears to be that the JS domain divergence gives a greater domain divergence to the eight-layer Lattice-Token-R model and the n-gram-R, placing them past the four- through eight-layer I-Hmm-Token-R representations. The domain classifier places these models closer to the other Lattice-Token-R representations, just past the seven-layer I-Hmm-Token-R representation.
The domain divergences of all models, using both techniques for measuring divergence, remain significantly far from zero, even under the best representation. As a result, there is ample room to experiment with even less-divergent representations of the two domains, to see if they might yield ever-increasing target-domain accuracies. Note that this is not simply a matter of adding more layers to the layered models. The I-Hmm-Token-R model performed best with seven layers, and the eight-layer representation had about the same accuracy and domain divergence as the five-layer model. This may be explained by the fact that the I-HMM layers are trained independently, and so additional layers may be duplicating other ones, and causing the supervised classifier to overfit. But it also shows that our current methodology has no built-in technique for constraining the domain divergence in our representations—the decrease in domain divergence from our more sophisticated representations is a coincidental byproduct of our training methodology, but there is no guarantee that our current mechanisms will continue to decrease domain divergence simply by increasing the number of layers. An important consideration for future research is to devise explicit learning mechanisms that guide representations towards smaller domain divergences.
4.4 Domain Adaptation for Noun-Phrase Chunking and Chinese POS Tagging
We test the generality of our representations by using them for other tasks, domains, and languages. Here, we report on further sequence-labeling tasks in a domain adaptation setting: noun phrase chunking for adaptation from news text to biochemistry journals, and POS tagging in Mandarin for a variety of domains. In the next section, we describe the use of our representations in a weakly supervised information extraction task.
For chunking, the training set consists of the CoNLL 2000 shared task data for source-domain labeled data (Sections 15–18 of the WSJ portion of the Penn Treebank, labeled with chunk tags) (Tjong, Sang, and Buchholz 2000). For test data, we used biochemistry journal data from the Open American National Corpus6 (OANC). One of the authors manually labeled 198 randomly selected sentences (5,361 tokens) from the OANC biochemistry text with noun-phrase chunk information.7 We focus on noun phrase chunks because they are relatively easy to annotate manually, but contain a large variety of open-class words that vary from domain to domain. The labeled training set consists of 8,936 sentences and 211,726 tokens. Twenty-three percent of chunks in the test set begin with an OOV word (especially adjective-noun constructions like “aqueous formation” and “angular recess”), and 29% begin with a word seen at most twice in training data; we refer to these as OOV chunks and rare chunks. For our unlabeled data, we use 15,000 sentences (358,000 tokens; Sections 13–19) of the Penn Treebank and 45,000 sentences (1,083,000 tokens) from the OANC's biochemistry section. We tested Trad-R (augmented with features for automatically generated POS tags), Lsa-R, n-gram-R, NB-R, Hmm-Token-R, I-Hmm-Token-R (7 layers, which performed best for POS tagging) and Lattice-Token-R (20 layers) representations.
Figure 9 shows our NP chunking results for this domain adaptation task. The performance improvements for the HMM-based chunkers are impressive: Lattice-Token-R reduces error by 57% with respect to Trad-R, and comes close to state-of-the-art results for chunking on newswire text. The results suggest that this representation allows the CRF to generalize almost as well to out-of-domain text as in-domain text. Improvements are greatest on OOV and rare chunks, where Lattice-Token-R made absolute improvements over Trad-R by 0.17 and 0.09 F1, respectively. Improvements for the single-layer Hmm-Token-R were smaller but still significant: 36% relative reduction in error overall, and 32% for OOV chunks.
The improved performance from our HMM-based chunker caused us to wonder how well the chunker could work without some of its other features. We removed all tag features and orthographic features and all features for word types that appear fewer than 20 times in training. This chunker still achieves 0.91 F1 on OANC data, and 0.93 F1 on WSJ data (Section 20), outperforming the Trad-R system in both cases. It has only 20% as many features as the baseline chunker, greatly improving its training time. Thus these features are more valuable to the chunker than features from automatically produced tags and features for all but the most common words.
For Chinese POS tagging, we use text from the UCLA Corpus of Written Chinese (Tao and Xiao 2007), which is part of the Lancaster Corpus of Mandarin Chinese (LCMC). The UCLA Corpus consists of 11,192 sentences of word-segmented and POS-tagged text in 13 genres (see Table 4). We use gold-standard word segmentation labels for training and testing. The LCMC tagset consists of 50 Chinese POS tags. On average, each genre contains 5,284 word tokens, for a total of 68,695 tokens among all genres. We use the ‘news’ genre as our source domain, which we use for training and development data. For test data, we randomly select 20% of every other genre. For our unlabeled data, we use all of the ‘news’ text, plus the remaining 80% of the texts from the other genres. As before, we replace hapax legomena in the unlabeled data with the special symbol *UNKNOWN*, and do the same for word types in the labeled test sets that never appear in our unlabeled training texts. We compare against a state-of-the-art Chinese POS tagger for in-domain text, the CRF-based Stanford tagger (Tseng, Jurafsky, and Manning 2005). We obtained the code for this tagger,8 and retrained it on our training data set.
Domain . | Stanford . | Trad . | NGr . | LSA . | NB . | HMM . | I-H . | LAT . |
---|---|---|---|---|---|---|---|---|
lore | 88.4 | 84.0 | 84.2 | 85.3 | 85.3 | 89.7 | 89.9 | 90.1* |
religion | 83.5 | 79.1 | 79.4 | 79.8 | 80.0 | 85.2 | 85.6 | 85.9* |
humour | 89.0 | 84.2 | 84.5 | 86.2 | 86.8 | 89.6 | 89.6 | 89.9* |
general-fic | 87.5 | 84.5 | 85.0 | 85.3 | 85.7 | 89.4 | 89.7 | 89.9* |
essay | 88.4 | 83.2 | 83.7 | 84.0 | 84.3 | 89.0 | 89.1 | 90.1* |
mystery | 87.4 | 82.4 | 83.4 | 84.3 | 85.3 | 90.1 | 91.1 | 91.3** |
romance | 87.5 | 84.2 | 84.5 | 85.3 | 86.1 | 89.0 | 89.5 | 89.8** |
science-fic | 88.6 | 82.1 | 82.5 | 83.0 | 83.0 | 87.0 | 88.3 | 88.6 |
skills | 82.7 | 77.3 | 77.7 | 78.2 | 78.4 | 84.9 | 85.0 | 85.1** |
science | 86.0 | 82.0 | 82.3 | 82.4 | 82.4 | 87.8 | 87.8 | 87.9* |
adventure-fic | 82.1 | 74.3 | 75.2 | 76.1 | 77.8 | 81.7 | 82.0 | 82.2 |
report | 91.7 | 84.2 | 85.1 | 85.3 | 86.1 | 91.9 | 91.9 | 91.9 |
news | 98.8** | 96.9 | 92.3 | 93.4 | 94.3 | 94.2 | 97.0 | 97.1 |
all but news | 87.0 | 81.2 | 82.0 | 82.8 | 83.6 | 88.1 | 88.4 | 88.8** |
all domains | 88.7 | 83.2 | 83.6 | 84.4 | 85.5 | 89.5 | 89.7 | 90.0** |
Domain . | Stanford . | Trad . | NGr . | LSA . | NB . | HMM . | I-H . | LAT . |
---|---|---|---|---|---|---|---|---|
lore | 88.4 | 84.0 | 84.2 | 85.3 | 85.3 | 89.7 | 89.9 | 90.1* |
religion | 83.5 | 79.1 | 79.4 | 79.8 | 80.0 | 85.2 | 85.6 | 85.9* |
humour | 89.0 | 84.2 | 84.5 | 86.2 | 86.8 | 89.6 | 89.6 | 89.9* |
general-fic | 87.5 | 84.5 | 85.0 | 85.3 | 85.7 | 89.4 | 89.7 | 89.9* |
essay | 88.4 | 83.2 | 83.7 | 84.0 | 84.3 | 89.0 | 89.1 | 90.1* |
mystery | 87.4 | 82.4 | 83.4 | 84.3 | 85.3 | 90.1 | 91.1 | 91.3** |
romance | 87.5 | 84.2 | 84.5 | 85.3 | 86.1 | 89.0 | 89.5 | 89.8** |
science-fic | 88.6 | 82.1 | 82.5 | 83.0 | 83.0 | 87.0 | 88.3 | 88.6 |
skills | 82.7 | 77.3 | 77.7 | 78.2 | 78.4 | 84.9 | 85.0 | 85.1** |
science | 86.0 | 82.0 | 82.3 | 82.4 | 82.4 | 87.8 | 87.8 | 87.9* |
adventure-fic | 82.1 | 74.3 | 75.2 | 76.1 | 77.8 | 81.7 | 82.0 | 82.2 |
report | 91.7 | 84.2 | 85.1 | 85.3 | 86.1 | 91.9 | 91.9 | 91.9 |
news | 98.8** | 96.9 | 92.3 | 93.4 | 94.3 | 94.2 | 97.0 | 97.1 |
all but news | 87.0 | 81.2 | 82.0 | 82.8 | 83.6 | 88.1 | 88.4 | 88.8** |
all domains | 88.7 | 83.2 | 83.6 | 84.4 | 85.5 | 89.5 | 89.7 | 90.0** |
The Chinese POS tagging results are shown in Table 4. The Lattice-Token-R outperforms the state-of-the-art Stanford tagger on all target domains. Overall, on all out-of-domain tests, Lattice-Token-R provides a relative reduction in error of 13.8% compared with the Stanford tagger. The best performance is on the ‘mystery’ domain, where the Lattice-Token-R model reaches 91.3% accuracy, a 3.9 percentage points improvement over the Stanford tagger. Its performance on the in-domain ‘news’ test set is significantly worse (1.7 percentage points) than the Stanford tagger, suggesting that the Stanford tagger relies on domain-dependent features that are helpful for tagging news, but not for tagging in general. The Lattice-Token-R's accuracy is still significantly worse on out-of-domain text than in-domain text, but the gap between the two (8.3 percentage points) is better than the gap for the Stanford tagger (11.8 percentage points). We believe that the lower out-of-domain performance of our Chinese POS tagger, compared with our English POS tagger and our chunker, was at least in part due to having far less unlabeled text available for this task.
5. Information Extraction Experiments
In this section, we evaluate our learned representations on their ability to capture semantic, rather than syntactic, information. Specifically, we investigate a set-expansion task in which we're given a corpus and a few “seed” noun phrases from a semantic category (e.g., Superheroes), and our goal is to identify other examples of the category in the corpus. This is a different type of weakly supervised task from the earlier domain adaptation tasks because we are given only a handful of positive examples from a category, rather than a large sample of positively and negatively labeled training examples from a separate domain.
Existing set-expansion techniques utilize the distributional hypothesis: Candidate noun phrases for a given semantic class are ranked based on how similar their contextual distributions are to those of the seeds. Here, we measure how performance on the set-expansion task varies when we employ different representations for the contextual distributions.
5.1 Methods
The set-expansion task we address is formalized as follows. Given a corpus, a set of seeds from some semantic category C, and a separate set of candidate phrases P, output a ranking of the phrases in P in decreasing order of likelihood of membership in the semantic category C.
For any given representation R, the set-expansion algorithm we investigate is straightforward: We rank candidate phrases in increasing order of the distance between their feature vectors and those of the seeds. The particular distance metrics utilized are detailed subsequently.
Because set expansion is performed at the level of word types rather than tokens, it requires type-based representations. We compare Hmm-Type-R, n-gram-R, Lattice-Type-R, and Brown-Type-R in this experiment. We used a 25-state HMM, and the Lattice-Type-R as described in the previous section. Following previous set-expansion experiments with n-grams (Ahuja and Downey 2010), we use a trigram model with Kneser-Ney smoothing for n-gram-R.
The distances between the candidate phrases and the seeds for Hmm-Type-R, n-gram-R, and Lattice-Type-R representations are calculated by first creating a prototypical “seed feature vector” equal to the mean of the feature vectors for each of the seeds in the given representation. Then, we rank candidate phrases in order of increasing distance between their feature vector and the seed feature vector. As a distance measure between vectors (in this case, probability distributions), we compute the average of five standard distance measures, including KL and JS divergence, and cosine, Euclidean, and L1 distance. In experiments, we found that improving upon this simple averaging was not easy—in fact, tuning a weighted average of the distance measures for each representation did not improve results significantly on held-out data.
For Brown clusters, we use prefixes of all possible lengths as features. We define the similarity between two Brown representation feature vectors to be the number of features they share in common (this is equivalent to the length of the longest common prefix between the two original Brown cluster labels). The candidate phrases are then ranked in decreasing order of the sum of their similarity scores to each of the seeds. We experimented with normalizing the similarity scores by the longer of the two vector lengths, and found this to decrease results slightly. We use unnormalized (integer) similarity scores for Brown clusters in our experiments.
5.2 Data Sets
We utilized a set of approximately 100,000 sentences of Web text, joining multi-word named entities in the corpus into single tokens using the Lex algorithm (Downey, Broadhead, and Etzioni 2007). This process enables each named entity (the focus of the set-expansion experiments) to be treated as a single token, with a single representation vector for comparison. We developed all word type representations using this corpus.
To obtain examples of multiple semantic categories, we utilized selected Wikipedia “listOf” pages from Pantel et al. (2009) and augmented these with our own manually defined categories, such that each list contained at least ten distinct examples occurring in our corpus. In all, we had 432 examples across 16 distinct categories such as Countries, Greek Islands, and Police TV Dramas.
5.3 Results
For each semantic category, we tested five different random selections of five seed examples, treating the unselected members of the category as positive examples, and all other candidate phrases as negative examples. We evaluate using the area under the precision-recall curve (AUC) metric.
The results are shown in Table 5. All representations improve performance over a random baseline, equal to the average AUC over five random orderings for each category, and the graphical models outperform the n-gram representation. I-Hmm-Type-R and Brown clustering in the particular case of 1,000 clusters perform best, with Hmm-Type-R performing nearly as well. Brown clusters give somewhat lower results as the number of clusters varies.
model . | AUC . |
---|---|
I-Hmm-Type-R | 0.18 |
Hmm-Type-R | 0.17 |
Brown-Type-R-3200 | 0.16 |
Brown-Type-R-1000 | 0.18 |
Brown-Type-R-320 | 0.15 |
Brown-Type-R-100 | 0.13 |
Lattice-Type-R | 0.11 |
n-gram-R baseline | 0.10 |
Random baseline | 0.10 |
model . | AUC . |
---|---|
I-Hmm-Type-R | 0.18 |
Hmm-Type-R | 0.17 |
Brown-Type-R-3200 | 0.16 |
Brown-Type-R-1000 | 0.18 |
Brown-Type-R-320 | 0.15 |
Brown-Type-R-100 | 0.13 |
Lattice-Type-R | 0.11 |
n-gram-R baseline | 0.10 |
Random baseline | 0.10 |
As with POS tagging, we expect that language model representations improve performance on the IE task by providing informative features for sparse word types. However, because the IE task classifies word types rather than tokens, we expect the representations to provide less benefit for polysemous word types. To test these hypotheses, we measured how IE performance changed in sparse or polysemous settings. We identified polysemous categories as those for which fewer than 90% of the category members had the category as a clear dominant sense (estimated manually); other categories were considered non-polysemous. Categories whose members had a median number of occurrences in the corpus of less than 30 were deemed sparse, and others non-sparse. IE performance on these subsets of the data are shown in Table 6. Both graphical model representations outperform the n-gram representation more on sparse words, as expected. For polysemy, the picture is mixed: The Lattice-Type-R outperforms n-Gram-R on polysemous categories, whereas HMM-Type-R's performance advantage over n-Gram-R decreases.
One surprise on the IE task is that the Lattice-Type-R performs significantly less well than the Hmm-Type-R, whereas the reverse is true on POS tagging. We suspect that the difference is due to the issue of classifying types vs. tokens. Because of their more complex structure, PL-MRFs tend to depend more on transition parameters than do HMMs. Furthermore, our decision to train the PL-MRFs using contrastive estimation with a neighborhood that swaps consecutive pairs of words also tends to emphasize transition parameters. As a result, we believe the posterior distribution over latent states given a word type is more informative in our HMM model than the PL-MRF model. We measured the entropy of these distributions for the two models, and found that H(PPL-MRF(y|x = w)) = 9.95 bits, compared with with H(PHMM(y|x = w)) = 2.74 bits, which supports the hypothesis that the drop in the PL-MRF's performance on IE is due to its dependence on transition parameters. Further experiments are warranted to investigate this issue.
. | polysemous . | not-polysemous . | sparse . | not-sparse . |
---|---|---|---|---|
types | 222 | 210 | 266 | 166 |
categs. | 12 | 4 | 13 | 3 |
n-Gram-R | 0.07 | 0.17 | 0.06 | 0.25 |
Lattice-Type-R | 0.09 | 0.15 | 0.1 | 0.19 |
-n-Gram-R | +0.02 | −0.02 | +0.04 | −0.06 |
HMM-Type-R | 0.14 | 0.26 | 0.15 | 0.32 |
-n-Gram-R | +0.07 | +0.09 | +0.09 | +0.07 |
. | polysemous . | not-polysemous . | sparse . | not-sparse . |
---|---|---|---|---|
types | 222 | 210 | 266 | 166 |
categs. | 12 | 4 | 13 | 3 |
n-Gram-R | 0.07 | 0.17 | 0.06 | 0.25 |
Lattice-Type-R | 0.09 | 0.15 | 0.1 | 0.19 |
-n-Gram-R | +0.02 | −0.02 | +0.04 | −0.06 |
HMM-Type-R | 0.14 | 0.26 | 0.15 | 0.32 |
-n-Gram-R | +0.07 | +0.09 | +0.09 | +0.07 |
5.4 Testing the Language Model Representation Hypothesis in IE
The language model representation hypothesis (Section 2) suggests that all else being equal, more accurate language models will provide features that lead to better performance on NLP tasks. Here, we test this hypothesis on the set expansion IE task.
Figures 10 and 11 show how the performance of the Hmm-Type-R varies with the language modeling accuracy of the underlying HMM. Language modeling accuracy is measured in terms of perplexity on held-out text. Here, we use set expansion data sets from previous work (Ahuja and Downey 2010). The first two are composed of extractions from the TextRunner information extraction system (Banko et al. 2007) and are denoted as Unary (361 examples) and Binary (265 examples). The second, Wikipedia (2,264 examples), is a sample of Wikipedia concept names. We evaluate the performance of several different trained HMMs with numbers of latent states K ranging from 5 to 1,600 (to help illustrate how IE and LM performance varies even when model capacity is fixed, we include three distinct models with K = 100 states trained separately over the full corpus). We used a distributed implementation of HMM training and corpus partitioning techniques (Yang, Yates, and Downey 2013) to enable training of our larger capacity HMM models on large data sets.
The results provide support for the language model representation hypothesis, showing that IE performance does tend to improve as language model perplexity decreases. On the smaller Unary and Binary sets (Figure 10), although IE accuracy does decrease for the lowest-perplexity models, overall language model perplexity exhibits a negative correlation with IE area under the precision-recall curve (the Pearson correlation coefficient is −0.18 for Unary, and −0.28 for Binary). For Wikipedia (Figure 11), the trend is more consistent, with IE performance increasing monotonically as perplexity decreases for models trained on the full training corpus (the Pearson correlation coefficient is −0.90).
Figure 11 also illustrates how LM and IE performance changes as the amount of training text varies. In general, increasing the training corpus size increases IE performance and decreases perplexity. Over all data points in the figure, IE performance correlates most strongly with model perplexity (−0.68 Pearson correlation, −0.88 Spearman correlation), followed by corpus size (0.66, 0.71) and model capacity (−0.05, 0.38). The small negative Pearson correlation between model capacity and IE performance is primarily due to the model with 1,600 states trained on 4% of the corpus. This model has a large parameter space and sparse training data, and thus suffers from overfitting in terms of both model perplexity and IE performance. If we ignore this overfit model, the Pearson correlation between model capacity and IE performance for the other models in the Figure is 0.24.
Our results show that IE based on distributional similarity tends to improve as the quality of the latent variable model used to measure distributional similarity improves. A similar trend was exhibited in our previous work (Ahuja and Downey 2010); here, we extend the previous results to models with more latent states and a larger, more reliable test set (Wikipedia). The results suggest that scaling up the training of latent variable models to utilize larger training corpora and more latent states may be a promising direction for improving IE capabilities.
6. Conclusion and Future Work
Our study of representation learning demonstrates that by using statistical language models to aggregate information across many unannotated examples, it is possible to find accurate distributional representations that can provide highly informative features to weakly supervised sequence labelers and named-entity classifiers. For both domain adaptation and weakly supervised set expansion, our results indicate that graphical models outperform n-gram models as representations, in part for their greater ability to handle sparsity and polysemy. Our IE task provides important evidence to support the Language Model Representation Hypothesis, showing that the AUC of the IE system correlates more with language model perplexity than the size of the training data or the capacity of the language model. Finally, our sequence labeling experiments provide empirical evidence in support of theoretical work on domain adaptation, showing that target-domain tagging accuracy is highly correlated with two different measures of domain divergence.
Representation learning remains a promising area for finding further improvements in various NLP tasks. The representations we have described are trained in an unsupervised fashion, so a natural extension is to investigate supervised or semi-supervised representation-learning techniques. As mentioned previously, our current techniques have no built-in methods for enforcing that they provide similar features in different domains; devising a mechanism that enforces this could allow for less domain-divergent and potentially more accurate representations. We have considered sequence labeling, but another promising direction is to apply these techniques to more complex structured prediction tasks, like parsing or relation extraction. Our current approach to sequence labeling requires retraining of a CRF for every new domain; incremental retraining techniques for new domains would speed up the process. Finally, models that combine our representation learning approach with instance weighting and other forms of supervised domain adaptation may take better advantage of labeled data in target domains, when it is available.
Acknowledgments
This material is based on work supported by the National Science Foundation under grant no. IIS-1065397.
Notes
Compare with Dhillon, Foster, and Ungar (2011), who use canonical correlation analysis to find a simultaneous reduction of the left and right context vectors, a significantly more complex undertaking.
Percy Liang's implementation is available at http://metaoptimize.com/projects/wordreprs/.
This representation is only feasible for small numbers of layers, and in our experiments that require type representations, we used M = 10. For larger values of M, other representations are also possible. We also experimented with a representation which included only M possible values: For each layer l, we included P(yl = 0|w) as a feature. We used the less-compact representation in our experiments because results were better.
Available from http://sourceforge.net/projects/crf/.
For simplicity, the definition we provide here works only for discrete features, although it is possible to extend this definition to continuous-valued features.
Available from http://www.anc.org/OANC/.
The labeled data for this experiment are available from the first author's Web site.
Available at http://nlp.stanford.edu/software/tagger.shtml.
References
Author notes
1805 N. Broad St., Wachman Hall 324, Philadelphia, PA 19122, USA. E-mail: {fei.huang,yuhong,yates}@temple.edu.
2133 Sheridan Road, Evanston, IL, 60208. E-mail: [email protected].
2133 Sheridan Road, Evanston, IL, 60208. E-mail: [email protected].
2133 Sheridan Road, Evanston, IL, 60208. E-mail: [email protected].