Abstract
Fine-grained opinion analysis methods often make use of linguistic features but typically do not take the interaction between opinions into account. This article describes a set of experiments that demonstrate that relational features, mainly derived from dependency-syntactic and semantic role structures, can significantly improve the performance of automatic systems for a number of fine-grained opinion analysis tasks: marking up opinion expressions, finding opinion holders, and determining the polarities of opinion expressions. These features make it possible to model the way opinions expressed in natural-language discourse interact in a sentence over arbitrary distances. The use of relations requires us to consider multiple opinions simultaneously, which makes the search for the optimal analysis intractable. However, a reranker can be used as a sufficiently accurate and efficient approximation.
A number of feature sets and machine learning approaches for the rerankers are evaluated. For the task of opinion expression extraction, the best model shows a 10-point absolute improvement in soft recall on the MPQA corpus over a conventional sequence labeler based on local contextual features, while precision decreases only slightly. Significant improvements are also seen for the extended tasks where holders and polarities are considered: 10 and 7 points in recall, respectively. In addition, the systems outperform previously published results for unlabeled (6 F-measure points) and polarity-labeled (10–15 points) opinion expression extraction. Finally, as an extrinsic evaluation, the extracted MPQA-style opinion expressions are used in practical opinion mining tasks. In all scenarios considered, the machine learning features derived from the opinion expressions lead to statistically significant improvements.
1. Introduction
Automatic methods for the analysis of opinions (textual expressions of emotions, beliefs, and evaluations) have attracted considerable attention in the natural language processing community in recent years (Pang and Lee 2008). Apart from their interest from a linguistic and psychological point of view, the technologies emerging from this research have obvious practical uses, either as stand-alone applications or supporting other tools such as information retrieval or question answering systems.
The research community initially focused on high-level tasks such as retrieving documents or passages expressing opinion, or classifying the polarity of a given text, and these coarse-grained problem formulations naturally led to the application of methods derived from standard retrieval or text categorization techniques. The models underlying these approaches have used very simple feature representations such as purely lexical (Pang, Lee, and Vaithyanathan 2002; Yu and Hatzivassiloglou 2003) or low-level grammatical features such as part-of-speech tags and functional words (Wiebe, Bruce, and O'Hara 1999). This is in line with the general consensus in the information retrieval community that very little can be gained by complex linguistic processing for tasks such as text categorization and search (Moschitti and Basili 2004). There are a few exceptions, such as Karlgren et al. (2010), who showed that construction features added to a bag-of-words representation resulted in improved performance on a number of coarse-grained opinion analysis tasks. Similarly, Greene and Resnik (2009) argued that a speaker's attitude can be predicted from syntactic features such as the selection of a transitive or intransitive verb frame.
In contrast to the early work, recent years have seen a shift towards more detailed problem formulations where the task is not only to find a piece of opinionated text, but also to extract a structured representation of the opinion. For instance, we may determine the person holding the opinion (the holder) and towards which entity or fact it is directed (the topic), whether it is positive or negative (the polarity), and the strength of the opinion (the intensity). The increasing complexity of representation leads us from retrieval and categorization deep into natural language processing territory; the methods used here have been inspired by information extraction and semantic role labeling, combinatorial optimization, and structured machine learning. For such tasks, deeper representations of linguistic structure have seen more use than in the coarse-grained case. Syntactic and shallow-semantic relations have repeatedly proven useful for subtasks of opinion analysis that are relational in nature, above all for determining the holder or topic of a given opinion, in which case there is considerable similarity to tasks such as semantic role labeling (Kim and Hovy 2006; Ruppenhofer, Somasundaran, and Wiebe 2008).
There has been no systematic research, however, on the role played by linguistic structure in the relations between opinions expressed in text, despite the fact that the opinion expressions in a sentence are not independent but organized rhetorically to achieve a communicative effect intended by the speaker. We therefore expect that the interplay between opinion expressions can be exploited to derive information useful for the analysis of opinions expressed in text. In this article, we start from this intuition and propose several novel features derived from the interdependencies between opinion expressions on the syntactic and shallow-semantic levels.
Based on these features, we devised structured prediction models for (1) extraction of opinion expressions, (2) joint expression extraction and holder extraction, and (3) joint expression extraction and polarity labeling. The models were trained using a number of discriminative machine learning methods. Because the interdependency features required us to consider more than one opinion expression at a time, the inference steps carried out at training and prediction time could not rely on commonly used opinion expression mark-up methods based on Viterbi search, but we show that an approximate search method using reranking suffices for this purpose: In a first step a base system using local features and efficient search generates a small set of hypotheses, and in a second step a classifier using the complex features selects the final output from the hypothesis set. This approach allows us to make use of arbitrary features extracted from the complete set of opinion expressions in a sentence, without having to impose any restriction on the expressivity of the features. An additional advantage is that it is fairly easy to implement as long as the underlying system is able to generate k-best output.
The interaction-based reranking systems were evaluated on a test set extracted from the MPQA corpus, and compared to strong baselines consisting of stand-alone systems for opinion expression mark-up, opinion holder extraction, and polarity classification. Our evaluations showed that (1) the best opinion expression mark-up system we evaluated achieved a 10-point absolute improvement in soft recall, and a 5-point improvement in F-measure, over the baseline sequence labeler. Our system outperformed previously described opinion expression mark-up tools by six points in overlap F-measure. (2) The recall was boosted by almost 10 points for the holder extraction task (over three points in F-measure) by modeling the interaction of opinion expressions with respect to holders. (3) We saw an improvement for the extraction of polarity-labeled expression of four F-measure points. Our result for opinion extraction and polarity labeling is especially striking when compared with the best previously published end-to-end system for this task: 10–15 points in F-measure improvement. In addition to the performance evaluations, we studied the impact of features on the subtasks, and the effect of the choice of the machine learning method for training the reranker.
As a final extrinsic evaluation of the system, we evaluated the usefulness of its output in a number of applications. Although there have been several publications detailing the extraction of MPQA-style opinion expressions, as far as we are aware there has been no attempt to use them in an application. In contrast, we show that the opinion expressions as defined by the MPQA corpus may be used to derive machine learning features that are useful in two practical opinion mining tasks; the addition of these features leads to statistically significant improvements in all scenarios we evaluated. First, we develop a system for the extraction of evaluations of product attributes from product reviews (Hu and Liu 2004a; Hu and Liu 2004b; Popescu and Etzioni 2005; Titov and McDonald 2008), and we show that the features derived from opinion expressions lead to significant improvement. Secondly, we show that fine-grained opinion structural information can even be used to build features that improve a coarse-grained sentiment task: document polarity classification of reviews (Pang, Lee, and Vaithyanathan 2002; Pang and Lee 2004).
After the present introduction, Section 2 gives a linguistic motivation and an overview of the related work; Section 3 describes the MPQA opinion corpus and its underlying representation; Section 4 illustrates the baseline systems: a sequence labeler for the extraction of opinion expressions and classifiers for opinion holder extraction and polarity labeling; Section 5 reports on the main contribution: the description of the interaction models and their features; finally, Section 7 presents the experimental results and Section 8 derives the conclusions.
2. Motivation and Related Work
Intuitively, interdependency features could be useful in the process of locating and disambiguating expressions of opinion. These expressions tend to occur in patterns, and the presence of one opinionated piece of text may influence our interpretation of another as opinionated or not. Consider, for instance, the word said in sentences (a) and (b) in Example (1), where the presence or non-presence of emotionally loaded words in the complement clause provides evidence for the interpretation as a subjective opinion or an objective speech event. (In the example, opinionated expressions are marked S for subjective and the non-opinionated speech event O for objective.)
Example (1)
- (a)
“We will identify the [culprits]s of these clashes and [punish]s them,” he [said]s.
- (b)
On Monday, 80 Libyan soldiers disembarked from an Antonov transport plane carrying military equipment, an African diplomat [said]o.
Moreover, opinions expressed in a sentence are interdependent when it comes to the resolution of their holders—the person or entity having the attitude expressed in the opinion expression. Clearly, the structure of the sentence is also influential for this task because certain linguistic constructions provide evidence for opinion holder correlation. In the most obvious case, shown in the two sentences in Example (2), pejorative words share the opinion holder with the communication and categorization verbs dominating them. (Here, opinions are marked S and holders H.)
Example (2)
- (a)
[Domestic observers]h [tended to side with]s the MDC, [denouncing]s the election as [fraud-tainted]s and [unfair]s.
- (b)
[Bush]h [labeled]s North Korea, Iran and Iraq an “[axis of evil]s.”
In addition, interaction is important when determining opinion polarity. Here, relations that influence polarity interpretation include coordination, verb–complement, as well as other types of discourse relations. In particular, the presence of a Comparison discourse relation, such as contrast or concession (Prasad et al. 2008), may allow us to infer that opinion expressions have different polarities. In Example (3), we see how contrastive discourse connectives (underlined) are used when there are contrasting polarities in the surrounding opinion expressions. (Positive opinions are tagged ‘+’, negative ‘−’.)
Example (3)
- (a)
[This is no blind violence but rather targeted violence]−,” Annemie Neyts [said]−. “However, the movement [is more than that]+.”
- (b)
[Trade alone will not save the world]−,” Neyts [said]−, but it constitutes an [important]+ factor for economic development.
The problems we focus on in this article—extracting opinion expressions with holders and polarity labeling—have naturally been studied previously, especially since the release of the MPQA corpus (Wiebe, Wilson, and Cardie 2005). For the first subtask, because the MPQA corpus uses span-based annotation to represent opinions, it is natural to apply straightforward sequence labeling techniques to extract them. This idea has resulted in a number of publications (Choi, Breck, and Cardie 2006; Breck, Choi, and Cardie 2007). Such systems do not use any features describing the interaction between opinions, and it would not be possible to add interaction features because a Viterbi-based sequence labeler by construction is restricted to using local features in a small contextual window.
Works using syntactic features to extract topics and holders of opinions are numerous (Bethard et al. 2005; Kobayashi, Inui, and Matsumoto 2007; Joshi and Penstein-Rosé 2009; Wu et al. 2009). Semantic role analysis has also proven useful: Kim and Hovy (2006) used a FrameNet-based semantic role labeler to determine holder and topic of opinions. Similarly, Choi, Breck, and Cardie (2006) successfully used a PropBank-based role labeler for opinion holder extraction, and Wiegand and Klakow (2010) recently applied tree kernel learning methods on a combination of syntactic and semantic role trees for extracting holders, but did not consider their relations to the opinion expressions. Ruppenhofer, Somasundaran, and Wiebe (2008) argued that semantic role techniques are useful but not completely sufficient for holder and topic identification, and that other linguistic phenomena must be studied as well. Choi, Breck, and Cardie (2006) built a joint model of opinion expression extraction and holder extraction and applied integer linear programming to carry out the optimization step.
While the tasks of opinion expression detection and polarity classification of opinion expressions (Wilson, Wiebe, and Hoffmann 2009) have mostly been studied in isolation, Choi and Cardie (2010) developed a sequence labeler that simultaneously extracted opinion expressions and assigned them polarity values and this is so far the only published result on joint opinion segmentation and polarity classification. Their experiment, however, lacked the obvious baseline: a standard pipeline consisting of an expression tagger followed by a polarity classifier. In addition, although their model is the first end-to-end system for opinion expression extraction and polarity classification, it is still based on sequence labeling and thus by construction limited in feature expressivity.
On a conceptual level, discourse-oriented approaches (Asher, Benamara, and Mathieu 2009; Somasundaran et al. 2009; Zirn et al. 2011) applying interaction features for polarity classification are arguably the most related because they are driven by a vision similar to ours: Individual opinion expressions interplay in discourse and thus provide information about each other. On a practical level there are obvious differences, since our features are extracted from syntactic and shallow-semantic linguistic representations, which we argue are reflections of discourse structure, while they extract features directly from a discourse representation. It is doubtful whether automatic discourse representation extraction in text is currently mature enough to provide informative features, whereas syntactic parsing and shallow-semantic analysis are today fairly reliable. Another related line of work is represented by Choi and Cardie (2008), where interaction features based on compositional semantics were used in a joint model for the assignment of polarity values to pre-segmented opinion expressions in a sentence.
3. The MPQA Corpus and its Annotation of Opinion Expressions
The most detailed linguistic resource useful for developing automatic systems for opinion analysis is the MPQA corpus (Wiebe, Wilson, and Cardie 2005). In this article, we use the word opinion in its broadest sense, equivalent to the word private state used by the MPQA annotators (page 2): “opinions, emotions, sentiments, speculations, evaluations, and other private states (Quirk et al. 1985), i.e., internal states that cannot be directly observed by others.”
The central building block in the MPQA annotation is the opinion expression (or subjective expression): A text piece that expresses a private state, allowing us to draw the conclusion that someone has a particular emotion or belief about some topic. Identifying these units allows us to carry out further analysis, such as the determination of opinion holder and the polarity of the opinion. The annotation scheme defines two types of opinion expressions: direct subjective expressions (DSEs), which are explicit mentions of attitude or evaluation, and expressive subjective elements (ESEs), which signal the attitude of the speaker by the choice of words. The prototypical example of a DSE would be a verb of statement, feeling, or categorization such as praise or disgust. ESEs, on the other hand, are less easy to categorize syntactically; prototypical examples would include simple value-expressing adjectives such as beautiful and strongly charged words like appeasement or big government. In addition to DSEs and ESEs, the corpus also contains annotation for non-subjective statements, which are referred to as objective speech events (OSEs). Some words such as say may appear as DSEs or OSEs depending on the context, so for an automatic system there is a need for disambiguation.
Example (4) shows a number of sentences from the MPQA corpus where DSEs and ESEs have been manually annotated.
Example (4)
- (a)
He [made such charges]dse [despite the fact]ese that women's political, social, and cultural participation is [not less than that]ese of men.
- (b)
[However]ese, it is becoming [rather fashionable]ese to [exchange harsh words]dse with each other [like kids]ese.
- (c)
For instance, he [denounced]dse as a [human rights violation]ese the banning and seizure of satellite dishes in Iran.
- (d)
This [is viewed]dse as the [main impediment]ese to the establishment of political order in the country.
Every expression in the corpus is connected to an opinion holder,1 an entity that experiences the sentiment or utters the evaluation that appears textually in the opinion expression. For DSEs, it is often fairly straightforward to find the opinion holders because they tend to be realized as direct semantic arguments filling semantic roles such as Speaker or Experiencer—and the DSEs tend to be verbs or nominalizations. For ESEs, the connection between the expression and the opinion holder is typically less clear-cut than for DSEs; the holder is more frequently implicit or located outside the sentence for ESEs than for DSEs.
The MPQA corpus does not annotate links directly from opinion expressions to particular mentions of a holder entity. Instead, the opinions are connected to holder coreference chains that may span the whole text. Some opinion expressions are linked to entities that are not explicitly mentioned in the text. If this entity is the author of the text, it is called the writer, otherwise implicit. Example (5) shows sentences annotated with expressions and holders.
Example (5)
- (a)
For instance, [he]h1 [denounced]dse/h1 as a [human rights violation]ese/h1 the banning and seizure of satellite dishes in Iran.
- (b)
[(writer)]h1: [He]h2 [made such charges]dse/h2 [despite the fact]ese/h1 that women's political, social, and cultural participation is [not less than that]ese/h1 of men.
- (c)
[(implicit)]h1: This [is viewed]dse/h1 as the [main impediment]ese/h1 to the establishment of political order in the country.
Finally, MPQA associates opinion expressions (DSEs and ESEs, but not OSEs) with a polarity feature taking the values Positive, Negative, Neutral, and Both, and with an intensity feature taking the values Low, Medium, High, and Extreme. The two sentences in Example (6) from the MPQA corpus show opinion expressions with polarities. Positive polarity is represented with a ‘+’ and negative with a ‘-’.
Example (6)
- (a)
We foresaw electoral [fraud]− but not [daylight robbery]−.
- (b)
Join in this [wonderful]+ event and help Jameson Camp continue to provide the year-round support that gives kids a [chance to create dreams]+.
The corpus does not currently contain annotation of topics (evaluees) of opinions, although there have been efforts to add this separately (Stoyanov and Cardie 2008).
4. Baseline Systems for Fine-Grained Opinion Analysis
The assessment of our reranking-based systems requires us to compare against strong baselines. We developed (1) a sequence labeler for opinion expression extraction similar to that by Breck, Choi, and Cardie (2007), (2) a set of classifiers to determine the opinion holder, and (3) a multiclass classifier that assigns polarity to a given opinion expression similar to that described by Wilson, Wiebe, and Hoffmann (2009). These tools were also used to generate the hypothesis sets for the rerankers described in Section 5.
4.1 Sequence Labeler for Opinion Expression Mark-up
To extract opinion expressions, we implemented a standard sequence labeler for subjective expression mark-up similar to the approach by Breck, Choi, and Cardie (2007). The sequence labeler extracted basic grammatical and lexical features (word, lemma, and POS tag), as well as prior polarity and intensity features derived from the lexicon created by Wilson, Wiebe, and Hoffmann (2005), which we refer to as subjectivity clues. It is important to note that prior subjectivity does not always imply subjectivity in a particular context; this is why contextual features are essential for this task. The grammatical features and subjectivity clues were extracted in a window of size 3 around the word in focus. We encoded the opinionated expression brackets by means of the IOB2 scheme (Tjong Kim Sang and Veenstra 1999). When using this representation, we are unable to handle overlapping opinion expressions, but they are fortunately rare.
To exemplify, Figure 1 shows an example of a sentence and how it is processed by the sequence labeler. The ESE defenseless situation is encoded in IOB2 as two tags B-ESE and I-ESE. There are four input columns (words, lemmas, POS tags, subjectivity clues) and one output column (opinion expression tags in IOB2 encoding). The figure also shows the sliding window from which the feature extraction function can extract features when predicting an output tag (at the arrow).
We trained the model using the method by Collins (2002), with a Viterbi decoder and the online Passive–Aggressive algorithm (Crammer et al. 2006) to estimate the model weights. The learning algorithm parameters were tuned on a development set. When searching for the best value of the C parameter, we varied it along a log scale from 0.001 to 100, and the best value was 0.1. We used the max-loss version of the algorithm and ten iterations through the training set.
4.2 Classifiers for Opinion Holder Extraction
The problem of extracting opinion holders for a given opinion expression is in many ways similar to argument detection in semantic role labeling (Choi, Breck, and Cardie 2006; Ruppenhofer, Somasundaran, and Wiebe 2008). For instance, in the simplest case when the opinion expression is a verb of evaluation or categorization, finding the holder would entail finding a semantic argument such as an Experiencer or Communicator. We therefore approached this problem using methods inspired by semantic role labeling: Given an opinion expression in a sentence, we define binary classifiers that decide whether each noun phrase of the sentence is its holder or not. Separate classifiers were trained to extract holders for DSEs, ESEs, and OSEs.
Hereafter, we describe the feature set used by the classifiers. Our walkthrough example is given by the sentence in Figure 1. Some features are derived from the syntactic and shallow semantic analysis of the sentence, shown in Figure 2 (Section 6.1 gives more details on this).
Syntactic path. Similarly to the path feature widely used in semantic role labeling (SRL), we extract a feature representing the path in the dependency tree between the expression and the holder (Johansson and Nugues 2008). For instance, the path from denounced to HRW in the example is VC↑SBJ↓.
Shallow-semantic relation. If there is a direct shallow-semantic relation between the expression and the holder, we use a feature representing its semantic role, such as A0 between denounced and HRW.
Expression head word, POS, and lemma. denounced, VBD, denounce for denounced; situation, NN, situation for defenseless situation.
Head word and POS of holder candidate. HRW, NNP for HRW.
Dominating expression type. When locating the holder for the ESE defenseless situation, we extract a feature representing the fact that it is syntactically dominated by a DSE. At test time, the expression and its type were extracted automatically.
Context words and POS for holder candidate. Words adjacent to the left and right; for HRW we extract Right:has, Right:VBZ.
Expression Verb Voice. Similar to the common voice feature used in SRL. Takes the values Active, Passive, and None (for non-verbal opinion expressions). In the example, we get Active for denounced and None for defenseless situation.
Expression Dependency Relation to Parent. VC and OBJ, respectively.
There are also differences compared with typical argument extraction in SRL, however. First, as outlined in Section 3, it is important to note that the MPQA corpus does not annotate direct links from opinions to holders, but from opinions to holder coreference chains. To handle this issue, we used the following approach when training the classifier: We created a positive training instance for each member of the coreference chain occurring in the same sentence as the opinion, and negative training instances for all other noun phrases in the sentence. We do not use coreference information at test time, in order for the system not to rely on automatic coreference resolution.
A second complication is that in addition to the explicit noun phrases in the same sentence, an opinion may be linked to an implicit holder; a special case of implicit holder is the writer of the text. To detect implicit and writer links, we trained two separate classifiers that did not use the features requiring a holder phrase. We did not try to link opinion expressions to explicitly expressed holders outside the sentence; this is an interesting open problem with some connections to inter-sentential semantic role labeling, a problem whose study is in its infancy (Gerber and Chai 2010).
We implemented the classifiers as linear support vector machines (SVMs; Boser, Guyon, and Vapnik 1992) using the Liblinear software (Fan et al. 2008). To handle the restriction that every expression can have at most one holder, the classifier selects only the highest-scoring opinion holder candidate at test time. We tuned the learning parameters on a development set, and the best results were obtained with an L2-regularized L2-loss SVM and a C value of 1.
4.3 Polarity Classifier
Given an expression, we use a classifier to assign a polarity value: Positive, Neutral, or Negative. Following Choi and Cardie (2010), the polarity value Both was mapped to Neutral—the expressions having this value were in any case very few. In the cases where the polarity value was empty or missing, we set the polarity to Neutral. In addition, the annotators of the MPQA corpus may use special uncertainty labels in the case where the annotator was unsure of which polarity to assign, such as Uncertain-positive. In these cases, we just removed the uncertainty label.
We again trained SVMs to carry out this classification. The problem of polarity classification has been studied in detail by Wilson, Wiebe, and Hoffmann (2009), who used a set of carefully devised linguistic features. Our classifier is simpler and is based on fairly shallow features. Like the sequence labeler for opinion expressions, this classifier made use of the lexicon of subjectivity clues.
The feature set used by the polarity classifier consisted of the following features. The examples come from the opinion expression defenseless situation in Figure 1.
Words in expression: defenseless, situation.
POS tags in expression: JJ, NN
Subjectivity clues of words in expression: None.
Word bigrams in expression: defenseless_situation.
Words before and after expression: B:the, A:of.
POS tags before and after expression: B:DT, A:IN.
5. Fine-Grained Opinion Analysis with Interaction Features
Because there is a combinatorial number of ways to segment a sentence into opinion expressions, and label the opinion expressions with type labels (DSE, ESE, OSE) as well as polarities and opinion holders, the tractability of the opinion expression segmentation task will obviously depend on whether we impose restrictions on the problem in a way that allows for efficient inference. Most previous work (Choi, Breck, and Cardie 2006; Breck, Choi, and Cardie 2007; Choi and Cardie 2010) used Markov factorizations and could thus apply standard sequence labeling techniques where the argmax step was carried out using the Viterbi algorithm (as described in Section 4.1). As we argued in the introduction, however, it makes linguistic sense that opinion expression segmentation and other tasks could be improved if relations between possible expressions were considered; these relations can be syntactic or semantic in nature, for instance.
We will show that adding relational features causes the Markov assumption to break down and the problem of finding the best analysis to become computationally intractable. We thus have to turn to approximate inference methods based on reranking, which can be trained efficiently.
5.1 Formalization of the Model
5.2 Approximate Inference with Interaction Features
It is easy to see that the inference step arg maxyw · Φ(x,y) is NP-hard for unrestricted pairwise interaction feature representations φ: This class of models includes simpler ones such as loopy Markov random fields, where inference is known to be NP-hard and require the use of approximate approaches such as belief propagation. Although it is possible that search algorithms for approximate inference in our model could be devised along similar lines, we sidestepped this issue by using a reranking decomposition of the problem:
Apply a simple model based on local context features ΦL but no structural interaction features. Generate a small hypothesis set of size k.
Apply a complex model using interaction features ΦNL to pick the top hypothesis from the hypothesis set.
The hypothesis generation procedure becomes slightly more complex when polarity values and opinion holders of the opinion expressions enter the picture. In that case, we not only need hypotheses generated by a sequence labeler, but also the outputs of a secondary classifier: the holder extractor (Section 4.2) or the polarity classifier (Section 4.3). The hypothesis set generation thus proceeds as follows:
For a given sentence, let the base sequence labeler generate up to k1 sequences of unlabeled opinion expressions;
for every sequence, apply the secondary classifier to generate up to k2 outputs.
To illustrate this process we give a hypothetical example, assuming k1 = k2 = 2 and the sentence The appeasement emboldened the terrorists. We first apply the opinion expression extractor to generate a set of k1 possible segmentations of the sentence:
The [appeasement] emboldened the [terrorists]
The [appeasement] [emboldened] the [terrorists]
The [appeasement]− emboldened the [terrorists]−
The [appeasement]− [emboldened]+ the [terrorists]−
The [appeasement]0 emboldened the [terrorists]−
The [appeasement]− [emboldened]0 the [terrorists]−
5.3 Training the Rerankers
In addition to the approximate inference method to carry out the maximization of Equation (1), we still need a machine learning method to assign weights to the vector w by estimating on a training set. We investigated a number of machine learning approaches to train the rerankers.
5.3.1 Structured SVM Learning
We first applied the method of large-margin estimation for structured output spaces, also known as structured support vector machines. In this method, we use quadratic optimization to find the smallest weight vector w that satisfies the constraint that the difference in output score between the correct output y and an incorrect output ŷ should be at least Δ(y, ŷ), where Δ is a loss function based on the degree of error in the output ŷ with respect to the gold standard y. This is a generalization of the well-known support vector machine from binary classification to prediction of structured objects.
The optimization problem (4) is usually not solved directly because the number of constraints makes a direct solution of the optimization program intractable for most realistic types of problems. Instead, an approximation has to be used in practice, and we used the SVMstruct software package (Tsochantaridis et al. 2005; Joachims, Finley, and Yu 2009), which finds a solution to the quadratic program by successively finding its most violated constraints and adding them to a working set. We used the default values for the learning parameters, except for the parameter C, which was determined by optimizing on a development set. This resulted in a C value of 500.
We defined the loss function Δ as 1 minus the intersection F-measure defined in Section 7.1. When training rerankers for the complex extraction tasks (expressions + holders or expressions + polarities), we used a weighted combination of F-measures for the primary task (expressions) and the secondary task (holders or polarities, see Sections 7.1.1 and 7.1.2, respectively).
5.3.2 On-line Learning
In addition to the structured SVM learning method, we trained models using two variants of on-line learning. Such learning methods are a feasible alternative for performance reasons. We investigated two on-line learning algorithms: the popular structured perceptron (Collins 2002) and the Passive–Aggressive (PA) algorithm (Crammer et al. 2006). To increase robustness, we used an averaged implementation (Freund and Schapire 1999) of both algorithms.
The difference between the two algorithms is the way the weight vector is incremented in each step. In the perceptron, for a given input x, we compute an update to the current weight vector by computing the difference between the correct output y and the predicted output ŷ. Pseudocode is given in Algorithm 1.
The PA algorithm, with pseudocode in Algorithm 2, is based on the theory of large-margin learning similar to the structured SVM. Here we instead base the update step on the ŷ that violates the margin constraints maximally, also taking the loss function Δ into account. The update step length τ is computed based on the margin; this update is bounded by a regularization constant C, which we set to 0.005 after tuning on a development set. The number N of iterations through the training set was 8 for both on-line algorithms.
6. Overview of the Interaction Features
The feature extraction function ΦNL extracts three groups of interaction features: (1) features considering the opinion expressions only; (2) features considering opinion holders; and (3) features considering polarity values.
In addition to the interaction features ΦNL, the rerankers used features representing the scores output by the base models (opinion expression sequence labeler and secondary classifiers); they did not directly use the local features ΦL. We normalized the scores over the k candidates so that their exponentials summed to 1.
6.1 Syntactic and Shallow Semantic Analysis
The features used by the rerankers, as well as the opinion holder extractor in Section 4.2, are to a large extent derived from syntactic and semantic role structures. To extract them, we used the syntactic–semantic parser by Johansson and Nugues (2008), which annotates the sentences with dependency syntax (Mel'čuk 1988) and shallow semantics in the PropBank (Palmer, Gildea, and Kingsbury 2005) and NomBank (Meyers et al. 2004) frameworks, using the format of the CoNLL-2008 Shared Task (Surdeanu et al. 2008). The system includes a sense disambiguator that assigns PropBank or NomBank senses to the predicate words.
Figure 2 shows an example of the structure of the annotation: The sentence HRW denounced the defenseless situation of these prisoners, where denounced is a DSE and defenseless situation is an ESE, has been annotated with dependency syntax (above the text) and semantic role structure (below the text). The predicate denounced, which is an instance of the PropBank frame denounce.01, has two semantic arguments: the Speaker (A0, or Agent in VerbNet terminology) and the Subject (A1, or Theme), which are realized on the surface-syntactic level as a subject and a direct object, respectively. Similarly, situation has the NomBank frame situation.01 and an Experiencer semantic argument (A0).
6.2 Opinion Expression Interaction Features
The rerankers use two types of structural features: syntactic features extracted from the dependency tree, and semantic features extracted from the predicate–argument (semantic role) graph.
The syntactic features are based on paths through the dependency tree. This leads to a minor complication for multiword opinion expressions; we select the shortest possible path in such cases. For instance, in the sentence in Figure 2, the path will be computed between denounced and situation.
We used the following syntactic interaction features. All examples refer to Figure 2.
Syntactic Path. Given a pair of opinion expressions, we use a feature representing the labels of the two expressions and the path between them through the syntactic tree, following standard practice from dependency-based semantic role labeling (Johansson and Nugues 2008). For instance, for the DSE denounced and the ESE defenseless situation in Figure 2, we represent the syntactic configuration using the feature DSE:OBJ↓:ESE, meaning that the syntactic relation between the DSE and the ESE consists of a single link representing a grammatical object.
Lexicalized Path. Same as above, but with lexical information attached: DSE/denounced:OBJ↓:ESE/situation.
Dominance. In addition to the features based on syntactic paths, we created a more generic feature template describing dominance relations between expressions. For instance, from the graph in Figure 2, we extract the feature DSE/denounced→ESE/situation, meaning that a DSE with the word denounced dominates an ESE with the word situation.
Predicate Sense Label. For every predicate found inside an opinion expression, we add a feature consisting of the expression label and the predicate sense identifier. For instance, the verb denounced, which is also a DSE, is represented with the feature DSE/denounce.01.
Predicate and Argument Label. For every argument of a predicate inside an opinion expression, we also create a feature representing the predicate–argument pair: DSE/denounced.01:A0.
Connecting Argument Label. When a predicate inside some opinion expression is connected to some argument inside another opinion expression, we use a feature consisting of the two expression labels and the argument label. For instance, the ESE defenseless situation is connected to the DSE denounced via an A1 label, and we represent this using a feature DSE:A1:ESE.
6.3 Opinion Holder Interaction Features
In addition, we modeled the interaction between different opinions with respect to their holders. We used the following two features to represent this interaction:
Shared Holders. A feature representing whether or not two opinion expressions have the same holder. For instance, if a DSE dominates an ESE and they have the same holder as in Figure 2, where the holder is HRW, we represent this by the feature DSE:ESE:true.
Holder types + Path. A feature representing the types of the holders, combined with the syntactic path between the expressions. The types take the following possible values: explicit, implicit, writer. In Figure 2, we would thus extract the feature DSE/Expl:OBJ↓:ESE/Expl.
6.4 Polarity Interaction Features
The model used the following features that take the polarities of the expressions into account. These features are extracted from DSEs and ESEs only, because the OSEs have no polarity values. The examples of extracted features are given with respect to the two opinion expressions (denounced and defenseless situation) in Figure 2, both of which have a negative polarity value.
Polarity pair. For every pair of opinion expressions in the sentence, we create a feature consisting of the pair of polarity values, such as Negative:Negative.
Polarity pair and syntactic path. Negative:OBJ↓:Negative.
Polarity pair and syntactic dominance. Negative→Negative.
Polarity pair and word pair. Negative-denounced:Negative-situation.
Polarity pair and expression types. Adds the expression types (ESE or DSE) to the polarity pair: DSE-Negative:ESE-Negative.
Polarity pair and types and syntactic path. Adds syntactic information to the type and polarity combination: DSE-Negative:OBJ↓:ESE-Negative.
Polarity pair and shallow-semantic relation. When two opinion expressions are directly connected through a link in the shallow-semantic structure, we create a feature based on the semantic role label of the connecting link: Negative:A1:Negative.
Polarity pair and words along syntactic path. We follow the syntactic path between the two expressions and create a feature for every word we pass on the way. In the example, no such feature is extracted because the expressions are directly connected.
7. Experiments
We trained and evaluated the rerankers on version 2.0 of the MPQA corpus,2 which contains 692 documents. We discarded one document whose annotation was garbled and we split the remaining 691 into a training set (541 documents) and a test set (150 documents). We also set aside a development set of 90 documents from the training set that we used when developing features and tuning learning algorithm parameters; all experiments described in this article, however, used models that were trained on the full training set. Table 1 shows some statistics about the training and test sets: the number of documents and sentences; the number of DSEs, ESEs, and OSEs; and the number of expressions marked with the various polarity labels.
We considered three experimental settings: (1) opinion expression extraction; (2) joint opinion expression and holder extraction; and (3) joint opinion expression and polarity classification. Finally, the polarity-based opinion extraction system was used in an extrinsic evaluation: document polarity classification of movie reviews.
. | Training . | Test . |
---|---|---|
Documents | 541 | 150 |
Sentences | 12,010 | 3,743 |
DSE | 8,389 | 2,442 |
ESE | 10,279 | 3,370 |
OSE | 3,048 | 704 |
Positive | 3,192 | 1,049 |
Negative | 6,093 | 1,675 |
Neutral | 9,105 | 3,007 |
Both | 278 | 81 |
. | Training . | Test . |
---|---|---|
Documents | 541 | 150 |
Sentences | 12,010 | 3,743 |
DSE | 8,389 | 2,442 |
ESE | 10,279 | 3,370 |
OSE | 3,048 | 704 |
Positive | 3,192 | 1,049 |
Negative | 6,093 | 1,675 |
Neutral | 9,105 | 3,007 |
Both | 278 | 81 |
To generate the training data for the rerankers, we carried out a 5-fold hold-out procedure: We split the training set into five pieces, trained a sequence labeler and secondary classifiers on pieces 1–4, applied them to piece 5, and so on.
7.1 Evaluation Metrics
Conventionally, when measuring the quality of a system for an information extraction task, a predicted entity is counted as correct if it exactly matches the boundaries of a corresponding entity in the gold standard; there is thus no reward for close matches. Because the boundaries of the spans annotated in the MPQA corpus are not strictly defined in the annotation guidelines (Wiebe, Wilson, and Cardie 2005), however, measuring precision and recall using exact boundary scoring will result in figures that are too low to be indicative of the usefulness of the system. Therefore, most work using this corpus instead use overlap-based precision and recall measures, where a span is counted as correctly detected if it overlaps with a span in the gold standard (Choi, Breck, and Cardie 2006; Breck, Choi, and Cardie 2007). As pointed out by Breck, Choi, and Cardie (2007), this is problematic because it will tend to reward long spans—for instance, a span covering the whole sentence will always be counted as correct if the gold standard contains any span for that sentence. Conversely, the overlap metric does not give higher credit to a span that is perfectly detected than to one that has a very low overlap with the gold standard.
The precision and recall measures proposed here correct the problem with overlap-based measures: If the system proposes a span covering the whole sentence, the span coverage will be low and result in a low soft precision, and a low soft recall will be assigned if only a small part of a gold standard span is covered. Note that our measures are bounded below by the exact measures and above by the overlap-based measures.
7.1.1 Opinion Holders
To score the extraction of opinion holders, we started from the same basic idea: Assign a score based on intersection. The evaluation of this task is more complex, however, because (1) we only want to give credit for holders for correctly extracted opinion expressions; (2) the gold standard links opinion expressions to coreference chains rather than individual mentions of holders; and (3) the holder entity may be the writer or implicit (see Section 4.2).
We therefore used the following method: If the system has proposed an opinion expression e and its holder h, we first located the expression e′ in the gold standard that most closely corresponds to e, that is e′ = argmaxxc(x, e), regardless of the span labels of e and e′. To assign a score to the proposed holder entity, we then selected the most closely corresponding gold standard holder entity h′ in the coreference chain H′ linked to e′ : h′ = argmaxx∈H′c(x, h). Finally, we computed the precision and recall scores using c(h′, h) and c(h, h′). We stress again that the gold standard coreference chains were used for evaluation purposes only, and that our system did not make use of them at test time.
If the system guesses that the holder of some opinion is the writer entity, we score it as perfectly detected (coverage 1) if the coreference chain H annotated in the gold standard contains the writer, and a full error (coverage 0) otherwise, and similar if h is implicit.
7.1.2 Polarity
In our experiments involving opinion expressions with polarities, we report precision and recall values for polarity-labeled opinion expression segmentation: In order to be assigned an intersection score above zero, a segment must be labeled with the correct polarity. In the gold standard, all polarity labels were changed as described in Section 4.3. In these evaluations, OSEs were ignored and DSEs and ESEs were not distinguished.
7.2 Experiments in Opinion Expression Extraction
The first task we considered was the extraction of opinion expression (labeled with expression types). We first studied the impact of the machine learning method and hypothesis set size on the reranker performance. Then, we carried out an analysis of the effectiveness of the features used by the reranker. We finally compared the performance of the expression extraction system with previous work (Breck, Choi, and Cardie 2007).
7.2.1 Evaluation of Machine Learning Methods
We compared the machine learning methods described in Section 5. In these experiments, we used a hypothesis set size k of 8. All features from Section 6.2 were used. Table 2 shows the results of the evaluations using the precision and recall measures described earlier.3 The baseline is the result of taking the top-scoring labeling from the base sequence labeler.
We note that the margin-based methods—structured SVM and the on-line PA algorithm—outperform the perceptron soundly, which shows the benefit of learning methods that make use of the cost function Δ. Comparing the two best-performing learning methods, we note that the reranker using the structured SVM is more recall-oriented whereas the PA-based reranker more precision-oriented; the difference in F-measure is not statistically significant. In the remainder of this article, all rerankers are trained using the PA learning algorithm (with the same parameters) because its training process is much faster than that of the structured SVM.
Learning method . | P . | R . | F . |
---|---|---|---|
Baseline | 63.4 ± 1.5 | 46.8 ± 1.2 | 53.8 ± 1.1 |
Structured SVM | 61.8 ± 1.5 | 52.5 ± 1.3 | 56.8 ± 1.1 |
Perceptron | 62.8 ± 1.5 | 48.1 ± 1.3 | 54.5 ± 1.2 |
Passive–Aggressive | 63.5 ± 1.5 | 51.8 ± 1.3 | 57.0 ± 1.1 |
Learning method . | P . | R . | F . |
---|---|---|---|
Baseline | 63.4 ± 1.5 | 46.8 ± 1.2 | 53.8 ± 1.1 |
Structured SVM | 61.8 ± 1.5 | 52.5 ± 1.3 | 56.8 ± 1.1 |
Perceptron | 62.8 ± 1.5 | 48.1 ± 1.3 | 54.5 ± 1.2 |
Passive–Aggressive | 63.5 ± 1.5 | 51.8 ± 1.3 | 57.0 ± 1.1 |
7.2.2 Candidate Set Size
In any method based on reranking, it is important to study the influence of the hypothesis set size on the quality of the reranked output. In addition, an interesting question is what the upper bound on reranker performance is—the oracle performance. Table 3 shows the result of an experiment that investigates these questions.
As is common in reranking tasks, the reranker can exploit only a fraction of the potential improvement—the reduction of the F-measure error ranges between 10% and 15% of the oracle error reduction for all hypothesis set sizes.
The most visible effect of the reranker is that the recall is greatly improved. This does not seem to have an adverse effect on the precision, however, until the candidate set size goes above eight—in fact, the precision actually improves over the baseline for small candidate set sizes. After the size goes above eight, the recall (and the F-measure) still rises, but at the cost of decreased precision. In the remainder of this article, we used a k value of 64, which we thought gave a good balance between processing time and performance.
7.2.3 Feature Analysis
We studied the impact of syntactic and semantic structural features on the performance of the reranker. Table 4 shows the result of an investigation of the contribution of the syntactic features. Using all the syntactic features (and no semantic features) gives an F-measure roughly four points above the baseline. We then carried out an ablation test and measured the F-measure obtained when each one of the three syntactic features has been removed. It is clear that the unlexicalized syntactic path is the most important syntactic feature; this feature causes a two-point drop in F-measure when removed, which is clearly statistically significant (p < 0.0001). The effect of the two lexicalized features is smaller, with only Dominance causing a significant (p < 0.05) drop when removed.
Feature set . | P . | R . | F . |
---|---|---|---|
Baseline | 63.4 ± 1.5 | 46.8 ± 1.2 | 53.8 ± 1.1 |
All syntactic features | 62.5 ± 1.4 | 53.2 ± 1.2 | 57.5 ± 1.1 |
Removed Syntactic Path | 64.4 ± 1.5 | 48.7 ± 1.2 | 55.5 ± 1.1 |
Removed Lexical Path | 62.6 ± 1.4 | 53.2 ± 1.2 | 57.5 ± 1.1 |
Removed Dominance | 62.3 ± 1.5 | 52.9 ± 1.2 | 57.2 ± 1.1 |
Feature set . | P . | R . | F . |
---|---|---|---|
Baseline | 63.4 ± 1.5 | 46.8 ± 1.2 | 53.8 ± 1.1 |
All syntactic features | 62.5 ± 1.4 | 53.2 ± 1.2 | 57.5 ± 1.1 |
Removed Syntactic Path | 64.4 ± 1.5 | 48.7 ± 1.2 | 55.5 ± 1.1 |
Removed Lexical Path | 62.6 ± 1.4 | 53.2 ± 1.2 | 57.5 ± 1.1 |
Removed Dominance | 62.3 ± 1.5 | 52.9 ± 1.2 | 57.2 ± 1.1 |
A similar result was obtained when studying the semantic features (Table 5). Removing the connecting label feature, which is unlexicalized, has a greater effect than removing the other two semantic features, which are lexicalized. Only the connecting label causes a statistically significant drop when removed (p < 0.0001).
Feature set . | P . | R . | F . |
---|---|---|---|
Baseline | 63.4 ± 1.5 | 46.8 ± 1.2 | 53.8 ± 1.1 |
All semantic features | 61.3 ± 1.4 | 53.8 ± 1.3 | 57.3 ± 1.1 |
Removed Predicate Sense Label | 61.3 ± 1.4 | 53.8 ± 1.3 | 57.3 ± 1.1 |
Removed Predicate+Argument Label | 61.0 ± 1.4 | 53.6 ± 1.3 | 57.0 ± 1.1 |
Removed Connecting Argument Label | 60.7 ± 1.4 | 50.5 ± 1.2 | 55.1 ± 1.1 |
Feature set . | P . | R . | F . |
---|---|---|---|
Baseline | 63.4 ± 1.5 | 46.8 ± 1.2 | 53.8 ± 1.1 |
All semantic features | 61.3 ± 1.4 | 53.8 ± 1.3 | 57.3 ± 1.1 |
Removed Predicate Sense Label | 61.3 ± 1.4 | 53.8 ± 1.3 | 57.3 ± 1.1 |
Removed Predicate+Argument Label | 61.0 ± 1.4 | 53.6 ± 1.3 | 57.0 ± 1.1 |
Removed Connecting Argument Label | 60.7 ± 1.4 | 50.5 ± 1.2 | 55.1 ± 1.1 |
Because our most effective structural features combine a pair of opinion expression labels with a tree fragment, it is interesting to study whether the expression labels alone would be enough. If this were the case, we could conclude that the improvement is caused not by the structural features, but just by learning which combinations of labels are common in the training set, such as that DSE+ESE would be more common than OSE+ESE. We thus carried out an experiment comparing a reranker using label pair features against rerankers based on syntactic features only, semantic features only, and the full feature set. Table 6 shows the results. We see that the reranker using label pairs indeed achieves a performance well above the baseline. Its performance is below that of any reranker using structural features, however. In addition, we see no improvement when adding label pair features to the structural feature set; this is to be expected because the label pair information is subsumed by the structural features.
Feature set . | P . | R . | F . |
---|---|---|---|
Baseline | 63.4 ± 1.5 | 46.8 ± 1.2 | 53.8 ± 1.1 |
Label pairs | 62.0 ± 1.5 | 52.7 ± 1.2 | 57.0 ± 1.1 |
All syntactic features | 62.5 ± 1.4 | 53.2 ± 1.2 | 57.5 ± 1.1 |
All semantic features | 61.3 ± 1.4 | 53.8 ± 1.3 | 57.3 ± 1.1 |
Syntactic + semantic | 61.0 ± 1.4 | 55.7 ± 1.2 | 58.2 ± 1.1 |
Syntactic + semantic + label pairs | 61.6 ± 1.4 | 54.8 ± 1.3 | 58.0 ± 1.1 |
Feature set . | P . | R . | F . |
---|---|---|---|
Baseline | 63.4 ± 1.5 | 46.8 ± 1.2 | 53.8 ± 1.1 |
Label pairs | 62.0 ± 1.5 | 52.7 ± 1.2 | 57.0 ± 1.1 |
All syntactic features | 62.5 ± 1.4 | 53.2 ± 1.2 | 57.5 ± 1.1 |
All semantic features | 61.3 ± 1.4 | 53.8 ± 1.3 | 57.3 ± 1.1 |
Syntactic + semantic | 61.0 ± 1.4 | 55.7 ± 1.2 | 58.2 ± 1.1 |
Syntactic + semantic + label pairs | 61.6 ± 1.4 | 54.8 ± 1.3 | 58.0 ± 1.1 |
7.2.4 Analysis of the Performance Depending on Expression Type
In order to better understand the performance details of the expression extraction, we analyzed how well it extracted the three different classes of expressions. Table 7 shows the results of this evaluation. The DSE row in the table thus shows the results of the performance on DSEs, without taking ESEs or OSEs into account.
Apart from evaluations of the three different types of expressions, we evaluated the performance for a number of combined classes that we think may be interesting for applications: DSE & ESE, finding all opinionated expressions and ignoring objective speech events; DSE & OSE, finding opinionated and non-opinionated speech and categorization events and ignoring expressive elements; and unlabeled evaluation of all types of MPQA expressions. The same extraction system was used in all experiments, and it was not retrained to maximize the different measures of performance.
Again, the strongest overall tendency is that the reranker boosts the recall. Going into the details, we see that the reranker gives very large improvements for DSEs and OSEs, but a smaller improvement for the combined DSE & OSE class. This shows that one of the most clear benefits of the complex features is to help disambiguate these expressions. This also affects the performance for general opinionated expressions (DSE & ESE).
7.2.5 Comparison with Breck, Choi, and Cardie (2007)
Comparison of systems in opinion expression detection is often nontrivial because evaluation settings have differed widely. Since our problem setting—marking up and labeling opinion expressions in the MPQA corpus—is most similar to that described by Breck, Choi, and Cardie (2007), we carried out an evaluation using the setting from their experiment.
For compatibility with their experimental set-up, this experiment differed from the ones described in the previous sections in the following ways:
The results were measured using the overlap-based precision and recall, although this is problematic as pointed out in Section 7.1.
The system did not need to distinguish DSEs and ESEs and did not have to detect the OSEs.
Instead of the training/test split used in the previous evaluations, the systems were evaluated using a 10-fold cross-validation over the same set of 400 documents and the same cross-validation split as used in Breck, Choi, and Cardie's experiment. Each of the 10 rerankers was evaluated on one fold and trained on data generated in a cross-validation over the remaining nine folds.
System . | P . | R . | F . |
---|---|---|---|
Breck, Choi, and Cardie (2007) | 71.64 | 74.70 | 73.05 |
Baseline | 86.1 ± 1.0 | 66.7 ± 0.8 | 75.1 ± 0.7 |
Reranked | 83.4 ± 1.0 | 75.0 ± 0.8 | 79.0 ± 0.6 |
System . | P . | R . | F . |
---|---|---|---|
Breck, Choi, and Cardie (2007) | 71.64 | 74.70 | 73.05 |
Baseline | 86.1 ± 1.0 | 66.7 ± 0.8 | 75.1 ± 0.7 |
Reranked | 83.4 ± 1.0 | 75.0 ± 0.8 | 79.0 ± 0.6 |
We see that the performance of our system is clearly higher—in both precision and recall—than all results reported by Breck, Choi, and Cardie (2007). Note that our system was optimized for the intersection metric rather than the overlap metric and that we did not retrain it for this evaluation.
7.3 Opinion Holder Extraction
Table 9 shows the performance of our holder extraction systems, evaluated using the scoring method described in Section 7.1.1. We compared the performance of the reranker using opinion holder interaction features (Section 6.3) to two baselines: The first of them consisted of the opinion expression sequence labeler (ES, Section 4.1) and the holder extraction classifier (HC, Section 4.2), without modeling any interactions between opinions. The second and more challenging baseline was implemented by adding the opinion expression reranker (ER) without holder interaction features to the pipeline. This results in a large performance boost simply as a consequence of improved expression detection, because a correct expression is required to get credit for a holder. However, both baselines are outperformed by the reranker using holder interaction features, which we refer to as the expression/holder reranker (EHR); the differences to the strong baseline in recall and F-measure are both statistically significant (p < 0.0001).
System . | P . | R . | F . |
---|---|---|---|
ES+HC | 57.7 ± 1.7 | 45.3 ± 1.3 | 50.8 ± 1.3 |
ES+ER+HC | 53.3 ± 1.5 | 52.0 ± 1.4 | 52.6 ± 1.3 |
ES+HC+EHR | 53.2 ± 1.6 | 55.1 ± 1.5 | 54.2 ± 1.4 |
System . | P . | R . | F . |
---|---|---|---|
ES+HC | 57.7 ± 1.7 | 45.3 ± 1.3 | 50.8 ± 1.3 |
ES+ER+HC | 53.3 ± 1.5 | 52.0 ± 1.4 | 52.6 ± 1.3 |
ES+HC+EHR | 53.2 ± 1.6 | 55.1 ± 1.5 | 54.2 ± 1.4 |
We carried out an ablation test to gauge the impact of the two holder interaction features; we see in Table 10 that both of them contribute to improving the recall, and the effect on the precision is negligible. The statistical significance for the recall improvement is highest for Shared Holders (p < 0.0001) and lower for Holder Types + Path (p < 0.02).
Feature set . | P . | R . | F . |
---|---|---|---|
Both features | 53.2 ± 1.6 | 55.1 ± 1.5 | 54.2 ± 1.4 |
Removed Holder Types + Path | 53.1 ± 1.6 | 54.6 ± 1.5 | 53.8 ± 1.3 |
Removed Shared Holders | 53.1 ± 1.5 | 53.6 ± 1.5 | 53.3 ± 1.3 |
Feature set . | P . | R . | F . |
---|---|---|---|
Both features | 53.2 ± 1.6 | 55.1 ± 1.5 | 54.2 ± 1.4 |
Removed Holder Types + Path | 53.1 ± 1.6 | 54.6 ± 1.5 | 53.8 ± 1.3 |
Removed Shared Holders | 53.1 ± 1.5 | 53.6 ± 1.5 | 53.3 ± 1.3 |
We omit a comparison with previous work in holder extraction because our formulation of the opinion holder extraction problem is different from those used in previous publications. Choi, Breck, and Cardie (2006) used the holders of a simplified set of opinion expressions, whereas Wiegand and Klakow (2010) extracted every entity tagged as “source” in MPQA regardless of whether it was connected to any opinion expression. Neither of them extracted implicit or writer holders.
Table 11 shows a detailed breakdown of the holder extraction results based on opinion expression type (DSE, OSE, and ESE), and whether the holder is internally or externally located; that is, whether or not the holder is textually realized in the same sentence as the opinion expression. In addition, Table 12 shows the performance for the two types of externally located holders.
Writer . | P . | R . | F . |
---|---|---|---|
ES+HC | 44.8±3.0 | 42.8±2.6 | 43.8±2.4 |
ES+ER+HC | 40.6±2.6 | 50.3±2.7 | 44.9±2.3 |
ES+HC+EHR | 42.7±2.8 | 49.7±2.9 | 45.9±2.4 |
Implicit | P | R | F |
ES+HC | 41.2±6.4 | 28.3±4.8 | 33.6±4.9 |
ES+ER+HC | 38.7±5.4 | 34.4±5.1 | 36.4±4.7 |
ES+HC+EHR | 43.1±5.9 | 32.9±5.0 | 37.4±4.8 |
Writer . | P . | R . | F . |
---|---|---|---|
ES+HC | 44.8±3.0 | 42.8±2.6 | 43.8±2.4 |
ES+ER+HC | 40.6±2.6 | 50.3±2.7 | 44.9±2.3 |
ES+HC+EHR | 42.7±2.8 | 49.7±2.9 | 45.9±2.4 |
Implicit | P | R | F |
ES+HC | 41.2±6.4 | 28.3±4.8 | 33.6±4.9 |
ES+ER+HC | 38.7±5.4 | 34.4±5.1 | 36.4±4.7 |
ES+HC+EHR | 43.1±5.9 | 32.9±5.0 | 37.4±4.8 |
As we noted in previous evaluations, the most obvious change between the baseline system and the reranker is that the recall and F-measure are improved; this is the case in every single evaluation. As previously, a large share of the improvement is explained simply by improved expression detection, which can be seen by comparing the reranked system to the strong baseline (ES+ER+HC). For the most important situations, however, we see improvement when using the reranker with holder interaction features. In those cases it outperforms the strong baseline significantly: DSE internal: p < 0.001, ESE internal p < 0.001, ESE external p < 0.05 (Table 11), writer p < 0.05 (Table 12). The only common case where the improvement is not statistically significant is OSE internal.
The improvements are most notable for internally located holders, and especially for the ESEs. Extracting the opinion holder for ESEs is often complex because the expression and the holder are typically not directly connected on the syntactic or shallow-semantic level, as opposed to the typical situation for DSEs. When we use the reranker, however, the interaction features may help us make use of the holders of other opinion expressions in the same sentence; for instance, the interaction features make it easier to distinguish cases like “the film was [awful]ese” with an external (writer) holder from cases such as “I [thought]dsethe film was [awful]ese” with an internal holder directly connected to a DSE.
7.4 Polarity Classification
To evaluate the effect of the polarity-based reranker, we carried out experiments to compare it with two baseline systems similarly to the evaluations of holder extraction performance. Table 13 shows the precision, recall, and F-measures. The evaluation used the polarity-based intersection metric (Section 7.1.2). The first baseline consisted of an expression segmenter and a polarity classifier (ES+PC), and the second also included an expression reranker (ER). The reranker using polarity interaction features is referred to as the expression/polarity reranker (EPR).
System . | P . | R . | F . |
---|---|---|---|
ES+PC | 56.5 ± 1.7 | 38.4 ± 1.2 | 45.7 ± 1.2 |
ES+ER+PC | 53.8 ± 1.6 | 44.5 ± 1.3 | 48.8 ± 1.2 |
ES+PC+EPR | 54.7 ± 1.6 | 45.6 ± 1.3 | 49.7 ± 1.2 |
System . | P . | R . | F . |
---|---|---|---|
ES+PC | 56.5 ± 1.7 | 38.4 ± 1.2 | 45.7 ± 1.2 |
ES+ER+PC | 53.8 ± 1.6 | 44.5 ± 1.3 | 48.8 ± 1.2 |
ES+PC+EPR | 54.7 ± 1.6 | 45.6 ± 1.3 | 49.7 ± 1.2 |
The result shows that the polarity-based reranker gives a significant boost in recall, which is in line with our previous results that also mainly improved the recall. The precision shows a slight decrease from the ES+PC baseline but much lower than the recall improvement. The differences between the polarity reranker and the strongest baseline are all statistically significant (precision p < 0.02, recall and F-measure p < 0.005).
In addition, we evaluated the performance for individual polarity values. The figures are shown in Table 14. We see that the differences in performance when adding the polarity reranker are concentrated to the more frequent polarity values (Neutral and Negative).
Positive . | P . | R . | F . |
---|---|---|---|
ES+PC | 53.5 ± 3.7 | 37.3 ± 3.0 | 43.9 ± 2.8 |
ES+ER+PC | 50.5 ± 3.4 | 41.8 ± 3.0 | 45.8 ± 2.6 |
ES+PC+EPR | 51.0 ± 3.5 | 41.6 ± 3.1 | 45.8 ± 2.7 |
Neutral | P | R | F |
ES+PC | 56.4 ± 2.3 | 37.8 ± 1.7 | 45.3 ± 1.7 |
ES+ER+PC | 54.0 ± 2.1 | 45.2 ± 1.8 | 49.2 ± 1.6 |
ES+PC+EPR | 55.8 ± 2.1 | 46.1 ± 1.8 | 50.5 ± 1.6 |
Negative | P | R | F |
ES+PC | 58.4 ± 2.8 | 40.1 ± 2.4 | 47.6 ± 2.2 |
ES+ER+PC | 55.5 ± 2.7 | 45.0 ± 2.3 | 49.7 ± 2.0 |
ES+PC+EPR | 54.9 ± 2.7 | 47.0 ± 2.4 | 50.6 ± 2.0 |
Positive . | P . | R . | F . |
---|---|---|---|
ES+PC | 53.5 ± 3.7 | 37.3 ± 3.0 | 43.9 ± 2.8 |
ES+ER+PC | 50.5 ± 3.4 | 41.8 ± 3.0 | 45.8 ± 2.6 |
ES+PC+EPR | 51.0 ± 3.5 | 41.6 ± 3.1 | 45.8 ± 2.7 |
Neutral | P | R | F |
ES+PC | 56.4 ± 2.3 | 37.8 ± 1.7 | 45.3 ± 1.7 |
ES+ER+PC | 54.0 ± 2.1 | 45.2 ± 1.8 | 49.2 ± 1.6 |
ES+PC+EPR | 55.8 ± 2.1 | 46.1 ± 1.8 | 50.5 ± 1.6 |
Negative | P | R | F |
ES+PC | 58.4 ± 2.8 | 40.1 ± 2.4 | 47.6 ± 2.2 |
ES+ER+PC | 55.5 ± 2.7 | 45.0 ± 2.3 | 49.7 ± 2.0 |
ES+PC+EPR | 54.9 ± 2.7 | 47.0 ± 2.4 | 50.6 ± 2.0 |
Finally, we carried out an evaluation in the setting4 of Choi and Cardie (2010) and the figures are shown in Table 15. The table shows our baseline and integrated systems along with the figures5 from Choi and Cardie. Instead of a single value for all polarities, we show the performance for every individual polarity value (Positive, Neutral, Negative). This evaluation uses the overlap metric instead of the intersection-based one. As we have pointed out, we use the overlap metric for compatibility although it is problematic.
Positive . | P . | R . | F . |
---|---|---|---|
ES+PC | 59.4 ± 2.6 | 46.1 ± 2.1 | 51.9 ± 2.0 |
ES+ER+PC | 53.1 ± 2.3 | 50.9 ± 2.2 | 52.0 ± 1.9 |
ES+PC+EPR | 58.2 ± 2.5 | 49.3 ± 2.2 | 53.4 ± 2.0 |
ES+PC+EPRp | 63.6 ± 2.8 | 44.9 ± 2.2 | 52.7 ± 2.1 |
Choi and Cardie (2010) | 67.1 | 31.8 | 43.1 |
Neutral | P | R | F |
ES+PC | 60.9 ± 1.4 | 49.2 ± 1.2 | 54.5 ± 1.0 |
ES+ER+PC | 55.1 ± 1.2 | 57.7 ± 1.2 | 56.4 ± 1.0 |
ES+PC+EPR | 60.3 ± 1.3 | 55.8 ± 1.2 | 58.0 ± 1.1 |
ES+PC+EPRp | 68.3 ± 1.5 | 48.2 ± 1.2 | 56.5 ± 1.2 |
Choi and Cardie (2010) | 66.6 | 31.9 | 43.1 |
Negative | P | R | F |
ES+PC | 72.1 ± 1.8 | 52.0 ± 1.5 | 60.4 ± 1.4 |
ES+ER+PC | 65.4 ± 1.7 | 58.2 ± 1.4 | 61.6 ± 1.3 |
ES+PC+EPR | 67.6 ± 1.7 | 59.9 ± 1.5 | 63.5 ± 1.3 |
ES+PC+EPRp | 75.4 ± 2.0 | 55.0 ± 1.5 | 63.6 ± 1.4 |
Choi and Cardie (2010) | 76.2 | 40.4 | 52.8 |
Positive . | P . | R . | F . |
---|---|---|---|
ES+PC | 59.4 ± 2.6 | 46.1 ± 2.1 | 51.9 ± 2.0 |
ES+ER+PC | 53.1 ± 2.3 | 50.9 ± 2.2 | 52.0 ± 1.9 |
ES+PC+EPR | 58.2 ± 2.5 | 49.3 ± 2.2 | 53.4 ± 2.0 |
ES+PC+EPRp | 63.6 ± 2.8 | 44.9 ± 2.2 | 52.7 ± 2.1 |
Choi and Cardie (2010) | 67.1 | 31.8 | 43.1 |
Neutral | P | R | F |
ES+PC | 60.9 ± 1.4 | 49.2 ± 1.2 | 54.5 ± 1.0 |
ES+ER+PC | 55.1 ± 1.2 | 57.7 ± 1.2 | 56.4 ± 1.0 |
ES+PC+EPR | 60.3 ± 1.3 | 55.8 ± 1.2 | 58.0 ± 1.1 |
ES+PC+EPRp | 68.3 ± 1.5 | 48.2 ± 1.2 | 56.5 ± 1.2 |
Choi and Cardie (2010) | 66.6 | 31.9 | 43.1 |
Negative | P | R | F |
ES+PC | 72.1 ± 1.8 | 52.0 ± 1.5 | 60.4 ± 1.4 |
ES+ER+PC | 65.4 ± 1.7 | 58.2 ± 1.4 | 61.6 ± 1.3 |
ES+PC+EPR | 67.6 ± 1.7 | 59.9 ± 1.5 | 63.5 ± 1.3 |
ES+PC+EPRp | 75.4 ± 2.0 | 55.0 ± 1.5 | 63.6 ± 1.4 |
Choi and Cardie (2010) | 76.2 | 40.4 | 52.8 |
As can be seen from the table, the system by Choi and Cardie (2010) shows a large precision bias despite being optimized with respect to the recall-promoting overlap metric. In recall and F-measure, their system is significantly outperformed for all polarity values by our baseline consisting of a pipeline of opinion expression extraction and polarity classifier. In addition, our joint model clearly outperforms the pipeline. The precision is slightly lower overall, but this is offset by large boosts in recall in all cases.
In order to rule out the hypothesis that our F-measure improvement compared with the Choi and Cardie system could be caused just by rebalancing precision and recall, we additionally trained a precision-biased reranker EPRp by changing the loss function Δ (see Section 5.3) from 1 − Fi to , where Fi is the intersection F-measure and Po the overlap precision. When we use this reranker, we achieve almost the same levels of precision as reported by Choi and Cardie, even outperforming their precision for the Neutral polarity value, while the recall values are still massively higher. The precision bias causes slight drops in F-measure for the Positive and Neutral polarities.
7.5 First Extrinsic Evaluation: Extraction of Evaluations of Product Attributes
As an extrinsic evaluation of the opinion expression extraction system, we evaluated the impact of the expressions on a practical application: extraction of evaluations of attributes from product reviews. We first describe the collection we used and then the implementation of the extractor.
We used the annotated data set by Hu and Liu (2004a, 2004b)6 for the experiments in extraction of attribute evaluations from product reviews. The collection contains reviews of five products: one DVD player, two cameras, one MP3 player, and one cellular phone. In this data set, every sentence is associated with a set of attribute evaluations. An evaluation consists of an attribute name and an evaluation value between −3 and +3, where −3 means a strongly negative evaluation and +3 strongly positive. For instance, the sentence this player boasts a decent size and weight, a relatively-intuitive navigational system that categorizes based on id3 tags, and excellent sound is tagged with the attribute evaluations size +2, weight +2, navigational system +2, sound +2. In this work, we do not make use of the exact value of the evaluation but only its sign. We removed the product attribute mentions in the form of anaphoric pronouns referring to entities mentioned in previous sentences; these cases are directly marked in the data set.
7.5.1 Implementation
We considered two problems: (1) extraction of attribute evaluations without taking the polarity into account, and (2) extraction with polarity (positive or negative). The former is modeled as a binary classifier that tags each word in the review (except the punctuation) as an evaluation or not, and the latter requires the definition of a three-class polarity classifier. For both tasks, we compared three feature sets: a baseline using simple features, a stronger baseline using a lexicon, and finally a system using features derived from opinion expressions.
Similarly to the opinion expression polarity classifier, we implemented the classifiers as SVMs that we trained using Liblinear. For the extraction task without polarities, the best results were obtained using an L2-regularized L2-loss SVM and a C value of 0.1. For the polarity task, we used a multiclass SVM (Crammer and Singer 2001) with the same parameters. To handle the precision/recall tradeoff, we varied the class weighting for the null class.
The baseline classifier used features based on lexical information (word, POS tag, and lemma) in a window of size 3 around the word under consideration (the focus word). In addition, it had two features representing the overall sentence polarities. To compute the polarities, we trained bag-of-words classifiers following the implementation by Pang, Lee, and Vaithyanathan (2002). Two separate classifiers were used: one for positive and one for negative polarity. Note that these classifiers detect the presence of positive or negative polarity, which may thus occur in the same sentence. The classifiers were trained on the MPQA corpus, where we counted a sentence as positive if it contained a positive opinion expression with an intensity of at least Medium, and conversely for the negative polarity.
7.5.2 Features Using a Sentiment Lexicon
Many previous implementations for several opinion-related tasks make use of sentiment lexicons, so the stronger baseline system used features based on the subjectivity lexicon by Wilson, Wiebe, and Hoffmann (2005), which we previously used for opinion expression segmentation in Section 4.1 and for polarity classification in Section 4.3. We created a classifier using a number of features based on this lexicon.
These features make use of the syntactic and semantic structure of the sentence. In the following examples, we use the sentence The software itself was not so easy to use, presented in Figure 3. In this sentence, consider the focus word software. One word is listed in the lexicon as associated with positive sentiment: easy. The system then extracts the following features:
Sentiment lexicon polarities. For every word in the sentence that is listed in the lexicon, we add a feature. Given the example sentence, we will thus add a feature lex_pol:positive because of the word easy, which is listed as positive in the sentiment lexicon.
Closest previous and following sentiment word. If there are sentiment words before or after the focus word, we add the closest of them to the feature vector. In this case, there is no previous sentiment word, so we only extract following_word:easy.
Syntactic paths to sentiment words. For every sentiment word in the sentence, we extract a syntactic path similar to our previous feature sets. This represents a syntactic pattern describing the relation between the sentiment word and the focus words. For instance, in the example we extract the path SBJ↑PRD↓, representing a copula construction: The word software is connected to the sentiment word easy first up through a subject link and then down through a predicative complement link.
Semantic links to sentiment words. When there is a direct semantic role link between a sentiment word and the focus word, we add a feature for the semantic role label. No such features are extracted in the example sentence. The focus word is an argument but no sentiment word is also a predicate.
7.5.3 Extended Feature Set Based on MPQA Opinion Expressions
We finally created an extended feature set incorporating the following features derived from MPQA-style opinion expressions, which we extracted automatically. The features are similar in construction to those extracted by means of the sentiment lexicon. The following list describes the new features exemplified with the same sentence above, which contains a negative opinion expression not so easy.
Opinion expression polarities. For every opinion expression extracted by the automatic system, we add a feature representing the polarity of the expression. In the example, we get op_expr:negative.
Closest previous and following opinion expression word. We extract features for the closest words before and after the focus word that are contained in some opinion expression. In the example, there is an expression not so easy after the focus word software, so we get a single feature following_expr:not.
Syntactic paths to opinion expressions. For every opinion expression in the sentence, we extract a path from the expression to the focus word. Because opinion expressions frequently consist of more than one word, we use the shortest path. In this case, we will thus again get SBJ↑PRD↓.
Semantic links to opinion expressions. Finally, we extracted features in case there were semantic role links. Again, we get no features based on the semantic role structure in the example since the opinion expression contains no predicate or argument.
7.5.4 Results
We evaluated the performance of the product attribute evaluation extraction using a 10-fold cross-validation procedure on the whole data set. We evaluated three classifiers: a baseline that did not use the lexicon or the opinion expressions, a classifier that adds the lexicon-based features, and finally the classifier that adds the MPQA opinion expressions. The F-measures are shown in Table 16 for the extraction task, and Figure 4 shows the precision/recall plots. There are clear improvements when adding the lexicon features, but the highest performing system is the one that also used the opinion expression features. The difference between the two top-performing classifiers is statistically significant (p < 0.001). For the extraction task where we also consider the polarities, the difference is even greater: almost three F-measure points.
Feature representation . | Unlabeled . | Polarity-labeled . |
---|---|---|
Baseline | 49.8 ± 2.0 | 39.6 ± 2.0 |
Lexicon | 53.8 ± 2.0 | 46.2 ± 2.0 |
Opinion expressions | 54.8 ± 2.0 | 49.0 ± 2.0 |
Feature representation . | Unlabeled . | Polarity-labeled . |
---|---|---|
Baseline | 49.8 ± 2.0 | 39.6 ± 2.0 |
Lexicon | 53.8 ± 2.0 | 46.2 ± 2.0 |
Opinion expressions | 54.8 ± 2.0 | 49.0 ± 2.0 |
7.6 Second Extrinsic Evaluation: Document Polarity Classification Experiment
In a second extrinsic evaluation of the opinion expression extractor, we investigated how expression-based features affect the performance of a document-level polarity classifier of reviews as positive or negative. We followed the same evaluation protocol as in the first extrinsic evaluation, where we compare three classifiers of increasing complexity: (1) a baseline using a pure word-based representation, (2) a stronger baseline adding features derived from a sentiment lexicon, and (3) a classifier with features extracted from opinion expressions.
The task of categorizing a full document as positive or negative can be viewed as a document categorization task, and this has led to the application of standard text categorization techniques (Pang, Lee, and Vaithyanathan 2002). We followed this approach and implemented the document polarity classifier as a binary linear SVM; this learning method has a long tradition of successful application in text categorization (Joachims 2002).
For these experiments, we used six collections. The first one consisted of movie reviews written in English extracted from the Web by Pang and Lee (2004).7 This data set is an extension of a smaller set (Pang, Lee, and Vaithyanathan 2002) that has been used in a large number of experiments. The remaining five sets consisted of product reviews gathered by Blitzer, Dredze, and Pereira (2007).8 We used five of the largest subsets: reviews of DVDs, software, books, music, and cameras. In all six collections, 1,000 documents were labeled by humans as positive and 1,000 as negative.
Following Pang and Lee (2004), the documents were represented as bag-of-word feature vectors based on presence features for individual words. No weighting such as IDF was used. The vectors were normalized to unit length. Again, we trained the SVMs using Liblinear, and the best results were obtained using an L2-regularized L2-loss version of the SVM with a C value of 1.
7.6.1 Features Based on the Subjectivity Lexicon
We used features based on the subjectivity lexicon by Wilson, Wiebe, and Hoffmann (2005) that we used for opinion expression segmentation in Section 4.1 and for polarity classification in Section 4.3. For every word whose lemma is listed in the lexicon, we added a feature consisting of the word and its prior polarity and intensity to the bag-of-words feature vector.
The feature examples are taken from the sentence HRW has denounced the defenseless situation of these prisoners, where denounce is listed in the lexicon as strong/negative and prisoner as weak/negative.
Lexicon polarity.negative.
Lexicon polarity and intensity.strong/negative, weak/negative.
Lexicon polarity and word.denounced/negative, prisoners/negative.
7.6.2 Features Extracted from Opinion Expressions
Finally, we created a feature set based on the opinion expressions with polarities. We give examples from the same sentence; here, denounced is a negative DSE and defenseless situation is a negative ESE.
Expression polarity.negative.
Expression polarity and word.negative/denounced, negative/defenseless, negative/situation.
Expression type and word.DSE/denounced, ESE/defenseless, ESE/situation.
7.6.3 Evaluation Results
To evaluate the performance of the document polarity classifiers, we carried out a 10-fold cross-validation procedure for every review collection. We evaluated three classifiers: one using only bag-of-words features (“Baseline”); one using features extracted from the subjectivity lexicon (“Lexicon”); and finally one also using the expression-based features (“Expressions”).
In order to abstract away from the tuning threshold, the performances were measured using AUC, the area under ROC curve. The AUC values are given in Table 17.
Feature set . | Movie . | DVD . | Software . | Books . | Music . | Cameras . |
---|---|---|---|---|---|---|
Baseline | 93.1 ± 1.0 | 85.1 ± 1.7 | 91.0 ± 1.3 | 85.7 ± 1.6 | 84.7 ± 1.7 | 91.9 ± 1.2 |
Lexicon | 93.8 ± 1.0 | 86.6 ± 1.6 | 92.3 ± 1.2 | 87.4 ± 1.5 | 86.6 ± 1.5 | 92.9 ± 1.1 |
Expressions | 94.7 ± 0.9 | 87.2 ± 1.5 | 92.9 ± 1.1 | 88.1 ± 1.5 | 87.5 ± 1.5 | 93.6 ± 1.1 |
Feature set . | Movie . | DVD . | Software . | Books . | Music . | Cameras . |
---|---|---|---|---|---|---|
Baseline | 93.1 ± 1.0 | 85.1 ± 1.7 | 91.0 ± 1.3 | 85.7 ± 1.6 | 84.7 ± 1.7 | 91.9 ± 1.2 |
Lexicon | 93.8 ± 1.0 | 86.6 ± 1.6 | 92.3 ± 1.2 | 87.4 ± 1.5 | 86.6 ± 1.5 | 92.9 ± 1.1 |
Expressions | 94.7 ± 0.9 | 87.2 ± 1.5 | 92.9 ± 1.1 | 88.1 ± 1.5 | 87.5 ± 1.5 | 93.6 ± 1.1 |
These evaluations show that the classifier adding features extracted from the opinion expressions significantly outperforms the classifier using only a bag-of-words feature representation and also that using the lexicon-based features. This demonstrates that the extraction and disambiguation of opinion expressions in their context is useful for a coarse-grained task such as document polarity classification. The differences in AUC values between the two best configurations are statistically significant (p < 0.005 for all six collections). In addition, we show the precision/recall plots in Figure 5; we see that for all six collections, the expression-based set-up outperforms the other two near the precision/recall breakeven point.
The collection where we can see the most significant difference is the movie review set. The main difference of this collection compared with the other collections is that its documents are larger: The average size of a document here is about four times larger than in the other collections. In addition, its reviews often contain large sections that are purely factual in nature, mainly plot descriptions. The opinion expression identification may be seen as a way to process the document to highlight the interesting parts on which the classifier should focus.
8. Conclusion
We have shown that features derived from grammatical and semantic role structure can be used to improve three fundamental tasks in fine-grained opinion analysis: the detection of opinionated expressions, the extraction of opinion holders, and finally the assignment of polarity labels to opinion expressions. The main idea is to use relational features describing the interaction of opinion expressions through linguistic structures such as syntax and semantics. This is not only interesting from a practical point of view (improving performance) but also confirms our linguistic intuitions that surface-linguistic structure phenomena such as syntax and shallow semantics are used in the encoding of the rhetorical organization of the sentence, and that we can thus extract useful information from those structures.
Because our feature sets are based on interaction between opinion expressions that can appear anywhere in a sentence, exact inference in this model becomes intractable. To overcome this issue, we used an approximate search strategy based on reranking: In the first step, we used the baseline systems, which use only simple local features, to generate a relatively small hypothesis set; we then applied a classifier using interaction features to pick the final result. A common objection to reranking is that the candidate set may not be diverse enough to allow for much improvement unless it is very large; the candidates may be trivial variations that are all very similar to the top-scoring candidate. Investigating inference methods that take a less brute-force approach than plain reranking is thus another possible future direction. Interesting examples of such inference methods include forest reranking (Huang 2008) and loopy belief propagation (Smith and Eisner 2008). Nevertheless, although the development of such algorithms is a fascinating research problem, it will not necessarily result in a more usable system: Rerankers impose very few restrictions on feature expressivity and make it easy to trade accuracy for efficiency.
We investigated the effect of machine learning features, as well as other design parameters such as the choice of machine learning method and the size of the hypothesis set. For the features, we analyzed the impact of using syntax and semantics and saw that the best models are those making use of both. The most effective features we have found are purely structural, based on tree fragments in a syntactic or semantic tree. Features involving words generally did not seem to have the same impact. Sparsity may certainly be an issue for features defined in terms of tree fragments. Possible future extensions in this area could include bootstrapping methods to mine for meaningful fragments unseen in the training set, or methods to group such features into clusters to reduce the sparsity.
In addition to the core results on fine-grained opinion analysis, we have described experiments demonstrating that features extracted from opinion expressions can be used to improve practical applications: extraction of evaluations of product attributes, and document polarity classification. Although for the first task it may be fairly obvious that it is useful to carry out a fine-grained analysis of the sentence opinion structure, the second result is more unexpected because the document polarity classification task is a high-level and coarse-grained task. For both tasks, we saw statistically significant increases in performance compared not only to simple baselines, but also compared to strong baselines using a lexicon of sentiment words. Although the lexicon leads to clear improvements, the best classifiers also used the features extracted from the opinion expressions.
It is remarkable that the opinion expressions as defined by the MPQA corpus are useful for practical applications on reviews from several domains, because this corpus mainly consists of news documents related to political topics; this shows that the expression identifier has been able to generalize from the specific domains. It would still be relevant, however, to apply domain adaptation techniques (Blitzer, Dredze, and Pereira 2007). It could also be interesting to see how domain-specific opinion word lexicons could improve over the generic lexicon we used here; especially if such a lexicon were automatically constructed (Jijkoun, de Rijke, and Weerkamp 2010).
There are multiple additional opportunities for future work in this area. An important issue that we have left open is the coreference problem for holder extraction, which has been studied by Stoyanov and Cardie (2006). Similarly, recent work has tried to incorporate complex, high-level linguistic structure such as discourse representations (Asher, Benamara, and Mathieu 2009; Somasundaran et al. 2009; Zirn et al. 2011); it is clear that these structures are very relevant for explaining the way humans organize their expressions of opinions rhetorically. Theoretical depth does not necessarily guarantee practical applicability, however, and the challenge is as usual to find a middle ground that balances our goals: explanatory power in theory, significant performance gains in practice, computational tractability, and robustness in difficult circumstances.
Acknowledgments
We would like to thank Eric Breck and Yejin Choi for clarifying their results and experimental set-up, and for sharing their cross-validation split. In addition, we are grateful to the anonymous reviewers, whose feedback has helped to improve the clarity and readability of this article.
The research described here has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant 231126: LivingKnowledge—Facts, Opinions and Bias in Time, and under grant 247758: Trustworthy Eternal Systems via Evolving Software, Data and Knowledge (EternalS).
Notes
The MPQA uses the term source but we prefer the term holder because it seems to be more common.
All confidence intervals in this article are at the 95% level and were estimated using a resampling method (Hjorth 1993). The significance tests for differences were carried out using permutation tests.
In addition to polarity, their system also assigned opinion intensity, which we do not consider here.
Confidence intervals for Choi and Cardie (2010) are omitted because we had no access to their output.
References
Author notes
Språkbanken, Department of Swedish, University of Gothenburg, Box 100, SE-40530 Gothenburg, Sweden. E-mail: [email protected]. The work described here was mainly carried out at the University of Trento.
DISI – Department of Information Engineering and Computer Science, University of Trento, Via Sommarive 14, 38123 Trento (TN), Italy. E-mail: [email protected].