Abstract
Uncertainty is an important linguistic phenomenon that is relevant in various Natural Language Processing applications, in diverse genres from medical to community generated, newswire or scientific discourse, and domains from science to humanities. The semantic uncertainty of a proposition can be identified in most cases by using a finite dictionary (i.e., lexical cues) and the key steps of uncertainty detection in an application include the steps of locating the (genre- and domain-specific) lexical cues, disambiguating them, and linking them with the units of interest for the particular application (e.g., identified events in information extraction). In this study, we focus on the genre and domain differences of the context-dependent semantic uncertainty cue recognition task.
We introduce a unified subcategorization of semantic uncertainty as different domain applications can apply different uncertainty categories. Based on this categorization, we normalized the annotation of three corpora and present results with a state-of-the-art uncertainty cue recognition model for four fine-grained categories of semantic uncertainty.
Our results reveal the domain and genre dependence of the problem; nevertheless, we also show that even a distant source domain data set can contribute to the recognition and disambiguation of uncertainty cues, efficiently reducing the annotation costs needed to cover a new domain. Thus, the unified subcategorization and domain adaptation for training the models offer an efficient solution for cross-domain and cross-genre semantic uncertainty recognition.
1. Introduction
In computational linguistics, especially in information extraction and retrieval, it is of the utmost importance to distinguish between uncertain statements and factual information. In most cases, what the user needs is factual information, hence uncertain propositions should be treated in a special way: Depending on the exact task, the system should either ignore such texts or separate them from factual information. In machine translation, it is also necessary to identify linguistic cues of uncertainty because the source and the target language may differ in their toolkit to express uncertainty (one language uses an auxiliary, the other uses just a morpheme). To cite another example, in clinical document classification, medical reports can be grouped according to whether the patient definitely suffers, probably suffers, or does not suffer from an illness.
There are several linguistic phenomena that are referred to as uncertainty in the literature. We consider propositions to which no truth value can be attributed, given the speaker's mental state, as instances of semantic uncertainty. In contrast, uncertainty may also arise at the discourse level, when the speaker intentionally omits some information from the statement, making it vague, ambiguous, or misleading. Determining whether a given proposition is uncertain or not may involve using a finite dictionary of linguistic devices (i.e., cues). Lexical cues (such as modal verbs or adverbs) are responsible for semantic uncertainty whereas discourse-level uncertainty may be expressed by lexical cues and syntactic cues (such as passive constructions) as well. We focus on four types of semantic uncertainty in this study and henceforth the term cue will be taken to mean lexical cue.
The key steps of recognizing semantically uncertain propositions in a natural language processing (NLP) application include the steps of locating lexical cues for uncertainty, disambiguating them (as not all occurrences of the cues indicate uncertainty), and finally linking them with the textual representation of the propositions in question. The linking of a cue to the textual representation of the proposition can be performed on the basis of syntactic rules that depend on the word class of the lexical cue, but they are independent of the actual application domain or text type where the cue is observed. The set of cues used and the frequency of their certain and uncertain usages are domain and genre dependent, however, and this has to be addressed if we seek to craft automatic uncertainty detectors. Here we interpret genre as the basic style and formal characteristics of the writing that is independent of its topic (e.g., scientific papers, newswire texts, or business letters), and domain as a particular field of knowledge and is related to the topic of the text (e.g., medicine, archeology, or politics).
Uncertainty cue candidates do not display uncertainty in all of their occurrences. For instance, the mathematical sense of probable is dominant in mathematical texts whereas its ratio can be relatively low in papers in the humanities. The frequency of the two distinct meanings of the verb evaluate (which can be a synonym of judge [an uncertain meaning] and calculate) is also different in the bioinformatics and cell biology domains. Compare:
- (1)
To evaluateCUE the PML/RARalpha role in myelopoiesis, transgenic mice expressing PML/RARalpha were engineered.
- (2)
Our method was evaluated on the Lindahl benchmark for fold recognition.
In this article we focus on the domain-dependent aspects of uncertainty detection and we examine the recognition of uncertainty cues in context. We do not address the problem of linking cues to propositions in detail (see, e.g., Chapman, Chu, and Dowling [2007] and Kilicoglu and Bergler [2009] for the information extraction case).
For the empirical investigation of the domain dependent aspects, data sets are required from various domains. To date, several corpora annotated for uncertainty have been constructed for different genres and domains (BioScope, FactBank, WikiWeasel, and MPQA, to name but a few). These corpora cover different aspects of uncertainty, however, being grounded on different linguistic models, which makes it hard to exploit cross-domain knowledge in applications. These differences in part stem from the varied application needs across application domains. Different types of uncertainty and classes of linguistic expressions are relevant for different domains. Although hypotheses and investigations form a crucial part of the relevant cases in scientific applications, they are less prominent in newswire texts, where beliefs and rumors play a major role. This finding motivates a more fine-grained treatment of uncertainty. In order to bridge the existing gaps between application goals, these typical cases need to be differentiated. A fine-grained categorization enables the individual treatment of each subclass, which is less dependent on domain differences than using one coarse-grained uncertainty class. Moreover, this approach enables each particular application to identify and select from a pool of models only those aspects of uncertainty that are relevant in the specific domain.
As one of the main contributions of this study, we propose a uniform subcategorization of semantic uncertainty in which all the previous corpus annotation works can be placed, and which reveals the fundamental differences between the currently existing resources. In addition, we manually harmonized the annotations of three corpora and performed the fine-grained labeling according to the suggested subcategorization so as to be able to perform cross-domain experiments.
An important factor in training robust cross-domain models is to focus on shallow features that can be reliably obtained for many different domains and text types, and to craft models that exploit the shared knowledge from different sources as much as possible, making the adaptation to new domains efficient. The study of learning efficient models across different domains is the subject of transfer learning and domain adaptation research (cf. Daumé III and Marcu 2006; Pan and Yang 2010). The domain adaptation setting assumes a target domain (for which an accurate model should be learned with a limited amount of labeled training data), a source domain (with characteristics different from the target and for which a substantial amount of labeled data is available), and an arbitrary supervised learning model that exploits both the target and source domain data in order to learn an improved target domain model.
The success of domain adaptation mainly depends on two factors: (i) the similarity of the target and source domains (the two domains should be sufficiently similar to allow knowledge transfer); and (ii) the application of an efficient domain adaptation method (which permits the learning algorithm to exploit the commonalities of the domains while preserving the special characteristics of the target domain).
As our second main contribution, we study the impact of domain differences on uncertainty detection, how this impact depends on the distance between target and source domains concerning their domains and genres, and how these differences can be reduced to produce accurate target domain models with limited annotation effort.
Because previously existing resources exhibited fundamental differences that made domain adaptation difficult,1 to our knowledge this is the first study to analyze domain differences and adaptability in the context of uncertainty detection in depth, and also the first study to report consistently positive results in cross-training.
The main contributions of the current paper can be summarized as follows:
We provide a uniform subcategorization of semantic uncertainty (with definitions, enumerate, and test batteries for annotation) and classify all major previous studies on uncertainty corpus annotation into the proposed categorization system, in order to reveal and analyze the differences.
We provide a harmonized, fine-grained reannotation of three corpora, according to the suggested subcategorization, to allow an in-depth analysis of the domain dependent aspects of uncertainty detection.
We compare the two state-of-the-art approaches to uncertainty cue detection (i.e., the one based on token classification and the one on sequence labeling models), using a shared feature set, in the context of the CoNLL-2010 shared task, to understand their strengths and weaknesses.2
We train an accurate semantic uncertainty detector that distinguishes four fine-grained categories of semantic uncertainty (epistemic, doxastic, investigation, and condition types) and thus is better for future applications in various domains than previous models. Our experiments reveal that, similar to the best model of the CoNLL-2010 shared task for biological texts but in a fine-grained context, shallow features provide good results in recognizing semantic uncertainty. We also show that this representation is less suited to detecting discourse-level uncertainty (which was part of the CoNLL task for Wikipedia texts).
We examine in detail the differences between domains and genres as regards the language used to express semantic uncertainty, and learn how the domain or genre distance affects uncertainty recognition in texts with unseen characteristics.
We apply domain adaptation techniques to fully exploit out-of-domain data and minimize annotation costs to adapt to a new domain, and we report successful results for various text domains and genres.
The rest of the paper is structured as follows. In Section 2, our classification of uncertainty phenomena is presented in detail and it is compared with the concept of uncertainty used in existing corpora. A framework for detecting semantic uncertainty is then presented in Section 3. Related work on cue detection is summarized in Section 4, which is followed by a description of our cue recognition system and a presentation of our experimental set-up using various source and target genre and domain pairs for cross-domain learning and domain adaptation in Section 5. Our results are elaborated on in Section 6 with a focus on the effect of domain similarities and on the annotation effort needed to cover a new domain. We then conclude with a summary of our results and make some suggestions for future research.
2. The Phenomenon Uncertainty
In order to be able to introduce and discuss our data sets, experiments, and findings, we have to clarify our understanding of the term uncertainty. Uncertainty—in its most general sense—can be interpreted as lack of information: The receiver of the information (i.e., the hearer or the reader) cannot be certain about some pieces of information. In this respect, uncertainty differs from both factuality and negation; as regards the former, the hearer/reader is sure that the information is true and as for the latter, he is sure that the information is not true. From the viewpoint of computer science, uncertainty emerges due to partial observability, nondeterminism, or both (Russell and Norvig 2010). Linguistic theories usually associate the notion of modality with uncertainty: Epistemic modality encodes how much certainty or evidence a speaker has for the proposition expressed by his utterance (Palmer 1986) or it refers to a possible state of the world in which the given proposition holds (Kiefer 2005). The common point in these approaches is that in the case of uncertainty, the truth value/reliability of the proposition cannot be decided because some other piece of information is missing. Thus, uncertain propositions are those in our understanding whose truth value or reliability cannot be determined due to lack of information.
In the following, we focus on semantic uncertainty and we suggest a tentative classification of several types of semantic uncertainty. Our classification is grounded on the knowledge of existing corpora and uncertainty recognition tools and our chief goal here is to provide a computational linguistics-oriented classification. With this in mind, our subclasses are intended to be well-defined and easily identifiable by automatic tools. Moreover, this classification allows different applications to choose the subset of phenomena to be recognized in accordance with their main task (i.e., we tried to avoid an overly coarse or fine-grained categorization).
2.1 Classification of Uncertainty Types
Several corpora annotated for uncertainty have been published in different domains such as biology (Medlock and Briscoe 2007; Kim, Ohta, and Tsujii 2008; Settles, Craven, and Friedland 2008; Shatkay et al. 2008; Vincze et al. 2008; Nawaz, Thompson, and Ananiadou 2010), medicine (Uzuner, Zhang, and Sibanda 2009), news media (Rubin, Liddy, and Kando 2005; Wilson 2008; Saurí and Pustejovsky 2009; Rubin 2010), and encyclopedia (Farkas et al. 2010). As can be seen from publicly available annotation guidelines, there are many overlaps but differences as well in the understanding of uncertainty, which is sometimes connected to domain- and genre-specific features of the texts. Here we introduce a domain- and genre-independent classification of several types of semantic uncertainty, which was inspired by both theoretical and computational linguistic considerations.
2.1.1 A Tentative Classification
Based on corpus data and annotation principles, the expression uncertainty can be used as an umbrella term for covering phenomena at the semantic and discourse levels.3 Our classification of semantic uncertainty is assumed to be language-independent, but our enumerate presented here come from the English language, to keep matters simple.
Semantically uncertain propositions can be defined in terms of truth conditional semantics. They cannot be assigned a truth value (i.e., it cannot be stated for sure whether they are true or false) given the speaker's current mental state.
Semantic level uncertainty can be subcategorized into epistemic and hypothetical (see Figure 1). The main difference between epistemic and hypothetical uncertainty is that whereas instances of hypothetical uncertainty can be true, false or uncertain, epistemically uncertain propositions are definitely uncertain—in terms of possible worlds, hypothetical propositions allow that the proposition can be false in the actual world, but in the case of epistemic uncertainty the factuality of the proposition is not known.
In the case of epistemic uncertainty, it is known that the proposition is neither true nor false: It describes a possible world where the proposition holds but this possible world does not coincide with the speaker's actual world. In other words, it is certain that the proposition is uncertain. Epistemic uncertainty is related to epistemic modality: a sentence is epistemically uncertain if on the basis of our world knowledge we cannot decide at the moment whether it is true or false (hence the name) (Kiefer 2005). The source of an epistemically uncertain proposition cannot claim the uncertain proposition and be sure about its opposite at the same time.
- (3)
Epistemic: It may be raining.
As for hypothetical uncertainty, the truth value of the propositions cannot be determined either and nothing can be said about the probability of their happening. Propositions under investigation are an example of such statements: Until further analysis, the truth value of the proposition under question cannot be stated. Conditionals can also be classified as instances of hypotheses. It is also common in these two types of uncertain propositions that the speaker can utter them while it is certain (for others or even for him) that its opposite holds hence they can be called instances of paradoxical uncertainty.
Hypothetical uncertainty is connected with non-epistemic types of modality as well. Doxastic modality expresses the speaker's beliefs—which may be known to be true or false by others in the current state of the world. Necessity (duties, obligation, orders) is the main objective of deontic modality; dispositional modality is determined by the dispositions (i.e., physical abilities) of the person involved; and circumstantial modality is defined by external circumstances. Buletic modality is related to wishes, intentions, plans, and desires. An umbrella term for deontic, dispositional, circumstantial, and buletic modality is dynamic modality (Kiefer 2005).
Hypothetical:
- (4)
Dynamic: I have to go.
- (5)
Doxastic: He believes that the Earth is flat.
- (6)
Investigation: We examined the role of NF-kappa B in protein activation.
- (7)
Condition: If it rains, we'll stay in.
Conditions and instances of dynamic modality are related to the future: In the future, they may happen but at the moment it is not clear whether they will take place or not / whether they are true, false, or uncertain.
2.1.2 Comparison with other Classifications
The feasibility of the classification proposed in this study can be justified by mapping the annotation schemes used in other existing corpora to our subcategorizations of uncertainty. This systematic comparison also highlights the major differences between existing works and partly explains why enumerate for successful cross-domain application of existing resources and models are hard to find in the literature.
Most of the annotations found in biomedical corpora (Medlock and Briscoe 2007; Settles, Craven, and Friedland 2008; Shatkay et al. 2008; Thompson et al. 2008; Nawaz, Thompson, and Ananiadou 2010) fall into the epistemic uncertainty class. BioScope (Vincze et al. 2008) annotations mostly belong to the epistemic uncertainty category, with the exception of clausal hypotheses (i.e., hypotheses that are expressed by a clause headed by if or whether), which are instances of the investigation class. The probable class of Genia Event (Kim, Ohta, and Tsujii 2008) is of the epistemically uncertain type and the doubtful class belongs to the investigation class. Rubin, Liddy, and Kando (2005) consider uncertainty as a phenomenon belonging to epistemic modality: The high, moderate, and low levels of certainty coincide with our epistemic uncertainty category. The speculation annotations of the MPQA corpus also belong to the epistemic uncertainty class, with four levels (Wilson 2008). The probable and possible classes found in FactBank (Saurí and Pustejovsky 2009) are of the epistemically uncertain type, events with a generic source belong to discourse-level uncertainty, whereas underspecified events are classified as hypothetical uncertainty in our system as, by definition, their truth value cannot be determined. WikiWeasel (Farkas et al. 2010) contains annotation for epistemic uncertainty, but discourse-level uncertainty is also annotated in the corpus (see Figure 1 for an overview). The categories used for the machine reading task described in Morante and Daelemans (2011) also overlap with our fine-grained classes: Uncertain events in their system fall into our epistemic uncertainty class. Their modal events expressing purpose, need, obligation, or desire are instances of dynamic modality, whereas their conditions are understood in a similar way to our condition class. The modality types listed in Baker et al. (2010) can be classified as types of dynamic modality, except for their belief category. Instances of the latter category are either certain (It is certain that he met the president) or epistemic or doxastic modality in our system.
2.2 Types of Semantic Uncertainty Cues
We assume that the nature of the lexical unit determines the type of uncertainty it represents, that is, semantic uncertainty is highly lexical in nature. The part of speech of the uncertainty cue candidates serves as the basis for categorization, similar to the ones found in Hyland 1994; 1996; 1998 and Rizomilioti (2006). In English, modality is often associated with modal auxiliaries (Palmer 1979), but, as Table 1 shows, there are many other parts of speech that can express uncertainty. It should be added that there are cues where it depends on the context, rather than the given lexical item, what subclass of uncertainty the cue refers to, for example, may can denote epistemic modality (It may rain...) or dynamic modality (Now you may open the door). These categories are listed in Table 1.
Adjectives / adverbs | ||
probable, likely, possible, unsure, possibly, perhaps, etc. | epistemic | |
Auxiliaries | ||
may, might, can, would, should, could, etc. | semantic | |
Verbs | ||
speculative: | suggest, question, seem, appear, favor, etc. | epistemic |
psych: | think, believe, etc. | doxastic |
analytic: | investigate, analyze, examine, etc. | investigation |
prospective: | plan, want, order, allow, etc. | dynamic |
Conjunctions | ||
if, whether, etc. | investigation | |
Nouns | ||
nouns derived | speculation, proposal, consideration, etc. | same as the verb |
from uncertain verb: | ||
other | rumor, idea, etc. | doxastic |
uncertain nouns: |
Adjectives / adverbs | ||
probable, likely, possible, unsure, possibly, perhaps, etc. | epistemic | |
Auxiliaries | ||
may, might, can, would, should, could, etc. | semantic | |
Verbs | ||
speculative: | suggest, question, seem, appear, favor, etc. | epistemic |
psych: | think, believe, etc. | doxastic |
analytic: | investigate, analyze, examine, etc. | investigation |
prospective: | plan, want, order, allow, etc. | dynamic |
Conjunctions | ||
if, whether, etc. | investigation | |
Nouns | ||
nouns derived | speculation, proposal, consideration, etc. | same as the verb |
from uncertain verb: | ||
other | rumor, idea, etc. | doxastic |
uncertain nouns: |
3. A Framework for Detecting Semantic Uncertainty
In our model, uncertainty detection is a standalone task that is largely independent of the underlying application. In this section, we briefly discuss how uncertainty detection can be incorporated into an information extraction task, which is probably the most relevant application area (see Kim et al. [2009] for more details). In the information extraction context, the key steps of recognizing uncertain propositions are locating the cues, disambiguating them (as not all occurrences of the cues indicate uncertainty; recall the example of evaluate), and finally linking them with the textual representation of the propositions in question. We note here that marking the textual representations of important propositions (often referred to as events in information extraction) is actually the main goal of an information extraction system, hence we will not focus on their identification and just assume that they are already marked in texts.
The following is an example that demonstrates the process of uncertainty detection:
- (8)
In this study we hypothesizedCUE that the phosphorylation of TRAF2 inhibitsEVENT binding to the CD40 cytoplasmic domain.
3.1 Uncertainty Cue Detection and Disambiguation
The cue detection and disambiguation problem can be essentially regarded as a token labeling problem. Here the task is to assign a label to each of the tokens of a sentence in question according to whether it is the starting token of an uncertainty cue (B-CUE_TYPE), an inside token of a cue (I-CUE_TYPE), or it is not part of any cue (O). Most previous studies assume a binary classification task, namely, each token is either part of an uncertainty cue, or it is not a cue. For fine-grained uncertainty detection, a different label has to be used for each uncertainty type to be distinguished. This way, the label sequence of a sentence naturally identifies all uncertainty cues (with their types) in the sentence, and disambiguation is solved jointly with recognition.
Because the uncertainty cue vocabulary and the distribution of certain and uncertain senses of cues vary in different domains and genres, uncertainty cue detection and disambiguation are the main focus of the current study.
3.2 Linking Uncertainty Cues to Propositions
The task of linking the detected uncertainty cues to propositions can be formulated as a binary classification task over uncertainty cue and event marker pairs. The relation holds and is considered true if the cue modifies the truth value (confidence) of the event; it does not hold and is considered false if the cue does not have any impact on the interpretation of the event. That is, the pair (hypothesized, inhibits) in Example (8) is an instance of positive relation.
The linking of uncertainty cues and event markers can be established by using dependency grammar rules (i.e., the problem is mainly syntax driven). As the grammatical properties of the language are similar in various domains and genres, this task is largely domain-independent, as opposed to the recognition and disambiguation task. Because of this, we sketch the most important matching patterns, but do not address the linking task in great detail here.
The following are the characteristic rules that can be used to link uncertainty cues to event markers. For practical implementations of heuristic cue/event matching, see Chapman, Chu, and Dowling (2007) and Kilicoglu and Bergler (2009).
If the event clue has an uncertain verb, noun, preposition, or auxiliary as a (not necessarily direct) parent in the dependency graph of the sentence, the event is regarded as uncertain.
If the event clue has an uncertain adverb or adjective as its child, it is treated as uncertain.
4. Related Work on Uncertainty Cue Detection
Here we review the published works related to uncertainty cue detection. Earlier studies focused either on in-domain cue recognition for a single domain or on cue lexicon extraction from large corpora. The latter approach is applicable to multiple domains, but does not address the disambiguation of uncertain and other meanings of the extracted cue words. We are also aware of several studies that discussed the differences of cue distributions in various domains, without developing a cue detector. To the best of our knowledge, our study is the first to address the genre- and domain-adaptability of uncertainty cue recognition systems and thus uncertainty detection in a general context.
We should add that there are plenty of studies on end-application oriented uncertainty detection, that is, how to utilize the recognized cues (see, for instance, Kilicoglu and Bergler [2008], Uzuner, Zhang, and Sibanda [2009] and Saurí [2008] for information extraction or Farkas and Szarvas [2008] for document labeling applications), and a recent pilot task sought to exploit negation and hedge cue detectors in machine reading (Morante and Daelemans 2011). As the focus of our paper is cue recognition, however, we omit their detailed description here.
4.1 In-Domain Cue Detection
In-domain uncertainty detectors have been developed since the mid 1990s. Most of these systems use hand-crafted lexicons for cue recognition and they treat each occurrence of the lexicon items as a cue—that is, they do not address the problem of disambiguating cues (Friedman et al. 1994; Light, Qiu, and Srinivasan 2004; Farkas and Szarvas 2008; Saurí 2008; Conway, Doan, and Collier 2009; Van Landeghem et al. 2009). ConText (Chapman, Chu, and Dowling 2007) uses regular expressions to define cues and “pseudo-triggers”. A pseudo-trigger is a superstring of a cue and it is basically used for recognizing contexts where a cue does not imply uncertainty (i.e., it can be regarded as a hand-crafted cue disambiguation module). MacKinlay, Martinez, and Baldwin (2009) introduced a system which also used non-consecutive tokens as cues (like not+as+yet).
Utilizing manually labeled corpora, machine learning–based uncertainty cue detectors have also been developed (to the best of our knowledge each of them uses an in-domain training data set). They use token classification (Morante and Daelemans 2009; Clausen 2010; Fernandes, Crestana, and Milidiú 2010; S´nchez, Li, and Vogel 2010) or sequence labeling approaches (Li et al. 2010; Rei and Briscoe 2010; Tang et al. 2010; Zhang et al. 2010). In both cases the tokens are labeled according to whether they are part of a cue. The latter assigns a label sequence to a sentence (a sequence of tokens) thus it naturally deals with the context of a particular word. On the other hand, context information for a token is built into the feature space of the token classification approaches. Özgür and Radev (2009) and Velldal (2010) match cues from a lexicon then apply a binary classifier based on features describing the context of the cue candidate.
Each of these approaches uses a rich feature representation for tokens, which usually includes surface-level, part-of-speech, and chunk-level features. A few systems have also used dependency relation types originating at the cue (Rei and Briscoe 2010; S´nchez, Li, and Vogel 2010; Velldal, Øvrelid, and Oepen 2010; Zhang et al. 2010); the CoNLL-2010 Shared Task final ranking suggests that it has only a limited impact on the performance of an entire system (Farkas et al. 2010), however. Özgür and Radev (2009) further extended the feature set with the other cues that occur in the same sentence as the cue, and positional features such as the section header of the article in which the cue occurs (the latter is only defined for scientific publications). Velldal (2010) argues that the dimensionality of the uncertainty cue detection feature space is too high and reports improvements by using the sparse random indexing technique.
Ganter and Strube (2009) proposed a rather different approach for (weasel) cue detection—exploiting weasel tags4 in Wikipedia articles given by editors. They used syntax-based patterns to recognize the internal structure of the cues, which has proved useful as discourse-level uncertainty cues are usually long and have a complex internal structure (as opposed to semantic uncertainty cues).
As can be seen, uncertainty cue detectors have mostly been developed in the biological and medical domains. All of these studies, however, focus on only one domain, namely, in-domain cue detection is carried out, which assumes the availability of a training data set of sufficient size. The only exception we are aware of is the CoNLL-2010 Shared Task (Farkas et al. 2010), where participants had the chance to use Wikipedia data on biomedical domain and vice versa. Probably due to the differences in the annotated uncertainty types and the stylistic and topical characteristics of the texts, very few participants performed cross-domain experiments and reported only limited success (see Section 5.3.2 for more on this).
Overall, the findings of these studies indicate that disambiguating cue candidates is an important aspect of uncertainty detection and that the domain specificity of disambiguation models and domain adaptation in general are largely unexplored problems in uncertainty detection.
4.2 Weakly Supervised Extraction of Cue Lexicon
Similar to our approach, several studies have addressed the problem of developing an uncertainty detector for a new domain using as little annotation effort as possible. The aim of these studies is to identify uncertain sentences; this is carried out by semi-automatic construction of cue lexicons. The weakly supervised approaches start with very small seed sets of annotated certain and uncertain sentences, and use bootstrapping to induce a suitable training corpus in an automatic way. Such approaches collect potentially certain and uncertain sentences from a large unlabeled pool based on their similarity to the instances in the seed sets (Medlock and Briscoe 2007), or based on the known errors of an information extraction system that is itself sensitive to uncertain texts (Szarvas 2008). Further instances are then collected (in an iterative fashion) on the basis of their similarity to the current training instances. Based on the observation that uncertain sentences tend to contain more than one uncertainty cue, these models successfully extend the seed sets with automatically labeled sentences, and can produce an uncertainty classifier with a sentence-level F-score of 60–80% for the uncertain class, given that the texts of the seed enumerate, the unlabeled pool, and the actual evaluation data share very similar properties.
Szarvas (2008) showed that these models essentially learn the uncertainty lexicon (set of cues) of the given domain, but are otherwise unable to disambiguate the potential cue words—that is, to distinguish between the uncertain and certain uses of the previously seen cues. This deficiency of the derived models is inherent to the bootstrapping process, which considers all occurrences of the cue candidates as good candidates for positive enumerate (as opposed to unlabeled sentences without any previously seen cue words).
Kilicoglu and Bergler (2008) proposed a semi-automatic method to expand a seed cue lexicon. Their linguistically motivated approach is also based on the weakly supervised induction of a corpus of uncertain sentences. It exploits the syntactic patterns of uncertain sentences to identify new cue candidates.
The previous studies on weakly supervised approaches to uncertainty detection did not tackle the problem of disambiguating the certain and uncertain uses of cue candidates, which is a major drawback from a practical point of view.
4.3 Cue Distribution Analyses
Besides automatic uncertainty recognition, several studies investigated the distribution of hedge cues in scientific papers from different domains (Hyland 1998; Falahati 2006; Rizomilioti 2006). The effect of different domains on the frequency of uncertain expressions was examined in Rizomilioti (2006). Based on a previously defined dictionary of hedge cues, she analyzed the linguistic tools expressing epistemic modality in research papers from three domains, namely, archeology, literary criticism, and biology. Her results indicated that archaeological papers tend to contain the most uncertainty cues (which she calls downtoners) and the fewest uncertainty cues can be found in literary criticism papers. Different academic disciplines were contrasted in Hyland (1998) from the viewpoint of hedging: Papers belonging to the humanities contain significantly more hedging devices than papers in sciences. It is interesting to note, however, that in both studies, biological papers are situated in the middle as far as the percentage rate of uncertainty cues is concerned. Falahati (2006) examined hedges in research articles in medicine, chemistry, and psychology and concluded that it is psychology articles that contain the most hedges.
Overall, these studies demonstrate that there are substantial differences in the way different technical/scientific domains and different genres express uncertainty in general, and in the use of semantic uncertainty in particular. Differences are found not just in the use of different vocabulary for expressing uncertainty, but also in the frequency of certain and uncertain usage of particular uncertainty cues. These findings underpin the practical importance of domain portability and domain adaptation of uncertainty detectors.
5. Uncertainty Cue Recognition
In this section, we present our uncertainty cue detector and the results of the cross-genre and -domain experiments carried out by us. Before describing our model and discussing the results of the experiments, a short overview of the texts used as training and test data sets will be given along with an empirical analysis of the sense distributions of the most frequent cues.
5.1 Data Sets
In our investigations, we selected three corpora (i.e., BioScope, WikiWeasel, and FactBank) from different domains (biomedical, encyclopedia, and newswire, respectively). Genres also vary in the corpora (in the scientific genre, there are papers and abstracts whereas the other corpora contain pieces of news and encyclopedia articles). We preferred corpora on which earlier experiments had been carried out because this allowed us to compare our results with those of previous studies. This selection makes it possible to investigate domain and genre differences because each domain has its characteristic language use (which might result in differences in cue distribution) and different genres also require different writing strategies (e.g., in abstracts, implications of experimental results are often emphasized, which usually involves the use of uncertain language).
The BioScope corpus (Vincze et al. 2008) contains clinical texts as well as biological texts from full papers and scientific abstracts; the texts were manually annotated for hedge cues and their scopes. In our experiments, 15 other papers annotated for the CoNLL-2010 Shared Task (Farkas et al. 2010) were also added to the set of BioScope papers. The WikiWeasel corpus (Farkas et al. 2010) was also used in the CoNLL-2010 Shared Task and it was manually annotated for weasel cues and semantic uncertainty in randomly selected paragraphs taken from Wikipedia articles. The FactBank corpus contains texts from the newswire domain (Saurí and Pustejovsky 2009). Events are annotated in the data set and they are evaluated on the basis of their factuality from the viewpoint of their sources.
Table 2 provides statistical data on the three corpora. Because in our experimental set-up, texts belonging to different genres also play an important role, data on abstracts and papers are included separately.
Data Set . | #sent. . | #epist. . | #dox. . | #inv. . | #cond. . | Total . |
---|---|---|---|---|---|---|
BioScope papers | 7676 | 1373 | 220 | 295 | 187 | 2075 |
BioScope abstracts | 11797 | 2478 | 200 | 784 | 24 | 3486 |
BioScope total | 19473 | 3851 | 420 | 1079 | 211 | 5561 |
WikiWeasel | 20756 | 1171 | 909 | 94 | 491 | 3265 |
FactBank | 3123 | 305 | 201 | 36 | 178 | 720 |
Total | 43352 | 5927 | 1530 | 1209 | 880 | 9546 |
Data Set . | #sent. . | #epist. . | #dox. . | #inv. . | #cond. . | Total . |
---|---|---|---|---|---|---|
BioScope papers | 7676 | 1373 | 220 | 295 | 187 | 2075 |
BioScope abstracts | 11797 | 2478 | 200 | 784 | 24 | 3486 |
BioScope total | 19473 | 3851 | 420 | 1079 | 211 | 5561 |
WikiWeasel | 20756 | 1171 | 909 | 94 | 491 | 3265 |
FactBank | 3123 | 305 | 201 | 36 | 178 | 720 |
Total | 43352 | 5927 | 1530 | 1209 | 880 | 9546 |
5.1.1 Genres and Domains
Texts found in the three corpora to be investigated can be categorized into three genres, which can be further divided to subgenres at a finer level of distinction. Figure 2 depicts this classification.
The majority of BioScope texts (papers and abstracts) belong to the scientific discourse genre. FactBank texts can be divided into broadcast and written news, and Wikipedia texts belong to the encyclopedia genre.
As for the domain of the texts, there are three broad domains, namely, biology, news, and encyclopedia. Once again, these domains can be further divided into narrower topics at a fine-grained level, which is shown in Figure 3. All abstracts and five papers in BioScope are related to the MeSH terms human, blood cell, and transcription factor (hbc in Figure 3). Nine BMC Bioinformatics papers come from the bioinformatics domain (bmc in Figure 3), and ten papers describe some experimental results on the Drosophila species (fly). FactBank news can be classified as stock news, political news, and criminal news. Encyclopedia articles cover a broad range of topics, hence no detailed classification is given here.
5.1.2 The Normalization of the Corpora
In order to uniformly evaluate our methods in each domain and genre (and each corpus), the evaluation data sets were normalized. This meant that cues had to be annotated in each data set and differentiated for types of semantic uncertainty. This resulted in the reannotation of BioScope, WikiWeasel, and FactBank.5 In BioScope, the originally annotated cues were separated into epistemic cues and subtypes of hypothetical cues and instances of hypothetical uncertainty not yet marked were also annotated. In FactBank, epistemic and hypothetical cues were annotated: Uncertain events were matched with their uncertainty cues and instances of hypothetical uncertainty that were originally not annotated were also marked in the corpus. In the case of WikiWeasel, these two types of cues were separated from discourse-level cues.
One class of hypothetical uncertainty (i.e., dynamic modality) was not annotated in any of the corpora. Although dynamic modality seems to play a role in the news domain, it is less important and less represented in the other two domains we investigated here. The other subclasses are more of general interest for the applications. For example, one of our training corpora comes from the scientific domain, where it is more important to distinguish facts from hypotheses and propositions under investigation (which can be later confirmed or rejected, compare the meta-knowledge annotation scheme developed for biological events [Nawaz, Thompson, and Ananiadou 2010)], and from propositions that depend on each other (conditions).
5.1.3 Uncertainty Cues in the Corpora
An analysis of the cue distributions reveals some interesting trends that can be exploited in uncertainty detection across domains and genres. The most frequent cue stems in the (sub)corpora used in our study can be seen in Table 3 and they are responsible for about 74% of epistemic cue occurrences, 55% of doxastic cue occurrences, 70% of investigation cue occurrences, and 91% of condition cue occurrences.
. | Global . | Abstracts . | Full papers . | BioScope . | FactBank . | WikiWeasel . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Epist. | may | 1508 | suggest | 616 | may | 228 | suggest | 810 | may | 43 | may | 721 |
suggest | 928 | may | 516 | suggest | 194 | may | 744 | could | 29 | probable | 112 | |
indicate | 421 | indicate | 301 | indicate | 103 | indicate | 404 | possible | 26 | suggest | 108 | |
possible | 304 | appear | 143 | possible | 84 | appear | 213 | likely | 24 | possible | 93 | |
appear | 260 | or | 119 | might | 83 | or | 197 | might | 23 | likely | 80 | |
might | 256 | possible | 101 | or | 78 | possible | 185 | appear | 15 | might | 78 | |
likely | 221 | might | 72 | can | 73 | might | 155 | seem | 11 | seem | 67 | |
or | 198 | potential | 72 | appear | 70 | can | 117 | potential | 10 | could | 55 | |
could | 196 | likely | 60 | likely | 57 | likely | 117 | probable | 10 | perhaps | 51 | |
probable | 157 | could | 56 | could | 56 | could | 112 | suggest | 10 | appear | 32 | |
Dox. | consider | 276 | putative | 43 | putative | 37 | putative | 80 | expect | 75 | consider | 250 |
believe | 222 | think | 43 | hypothesis | 33 | hypothesis | 77 | believe | 25 | believe | 173 | |
expect | 136 | hypothesis | 43 | assume | 24 | think | 66 | think | 24 | allege | 81 | |
think | 131 | believe | 14 | think | 24 | assume | 32 | allege | 8 | think | 61 | |
putative | 83 | consider | 10 | expect | 22 | predict | 26 | accuse | 7 | regard | 58 | |
Invest. | whether | 247 | investigate | 177 | whether | 73 | investigate | 221 | whether | 26 | whether | 52 |
investigate | 222 | examine | 160 | investigate | 44 | examine | 183 | if | 3 | if | 20 | |
examine | 183 | whether | 96 | test | 25 | whether | 169 | remain to be seen | 2 | whether or not | 7 | |
study | 102 | study | 88 | examine | 23 | study | 101 | question | 1 | assess | 3 | |
determine | 90 | determine | 67 | determine | 20 | determine | 87 | determine | 1 | evaluate | 3 | |
Cond. | if | 418 | if | 14 | if | 85 | if | 99 | if | 65 | if | 254 |
would | 238 | would | 6 | would | 46 | would | 52 | would | 50 | would | 136 | |
will | 80 | until | 2 | will | 20 | will | 20 | will | 21 | will | 39 | |
until | 40 | could | 1 | should | 11 | should | 11 | until | 16 | until | 15 | |
could | 30 | unless | 1 | could | 9 | could | 10 | could | 9 | unless | 14 |
. | Global . | Abstracts . | Full papers . | BioScope . | FactBank . | WikiWeasel . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Epist. | may | 1508 | suggest | 616 | may | 228 | suggest | 810 | may | 43 | may | 721 |
suggest | 928 | may | 516 | suggest | 194 | may | 744 | could | 29 | probable | 112 | |
indicate | 421 | indicate | 301 | indicate | 103 | indicate | 404 | possible | 26 | suggest | 108 | |
possible | 304 | appear | 143 | possible | 84 | appear | 213 | likely | 24 | possible | 93 | |
appear | 260 | or | 119 | might | 83 | or | 197 | might | 23 | likely | 80 | |
might | 256 | possible | 101 | or | 78 | possible | 185 | appear | 15 | might | 78 | |
likely | 221 | might | 72 | can | 73 | might | 155 | seem | 11 | seem | 67 | |
or | 198 | potential | 72 | appear | 70 | can | 117 | potential | 10 | could | 55 | |
could | 196 | likely | 60 | likely | 57 | likely | 117 | probable | 10 | perhaps | 51 | |
probable | 157 | could | 56 | could | 56 | could | 112 | suggest | 10 | appear | 32 | |
Dox. | consider | 276 | putative | 43 | putative | 37 | putative | 80 | expect | 75 | consider | 250 |
believe | 222 | think | 43 | hypothesis | 33 | hypothesis | 77 | believe | 25 | believe | 173 | |
expect | 136 | hypothesis | 43 | assume | 24 | think | 66 | think | 24 | allege | 81 | |
think | 131 | believe | 14 | think | 24 | assume | 32 | allege | 8 | think | 61 | |
putative | 83 | consider | 10 | expect | 22 | predict | 26 | accuse | 7 | regard | 58 | |
Invest. | whether | 247 | investigate | 177 | whether | 73 | investigate | 221 | whether | 26 | whether | 52 |
investigate | 222 | examine | 160 | investigate | 44 | examine | 183 | if | 3 | if | 20 | |
examine | 183 | whether | 96 | test | 25 | whether | 169 | remain to be seen | 2 | whether or not | 7 | |
study | 102 | study | 88 | examine | 23 | study | 101 | question | 1 | assess | 3 | |
determine | 90 | determine | 67 | determine | 20 | determine | 87 | determine | 1 | evaluate | 3 | |
Cond. | if | 418 | if | 14 | if | 85 | if | 99 | if | 65 | if | 254 |
would | 238 | would | 6 | would | 46 | would | 52 | would | 50 | would | 136 | |
will | 80 | until | 2 | will | 20 | will | 20 | will | 21 | will | 39 | |
until | 40 | could | 1 | should | 11 | should | 11 | until | 16 | until | 15 | |
could | 30 | unless | 1 | could | 9 | could | 10 | could | 9 | unless | 14 |
As can be seen, one of the most frequent epistemic cues in each corpus is may. If, possible, might, and suggest also occur frequently in our data set.
The distribution of the uncertainty cues was also analyzed from the perspective of uncertainty classes in each corpus, which is presented in Figure 4. In most of the corpora, epistemic cues are the most frequent (except for FactBank) and they vary the most: Out of the 300 cue stems occurring in the corpora, 206 are epistemic cues. Comparing the domains, it can readily be seen that in biological texts, doxastic uncertainty is not frequent, which is especially true for abstracts, whereas in FactBank and WikiWeasel they cover about 27% of the data. The most frequent doxastic keywords exhibit some domain-specific differences, however: In BioScope, the most frequent ones include putative and hypothesis, which rarely occur in FactBank and WikiWeasel. Nevertheless, cues belonging to the investigation class can be found almost exclusively in scientific texts (89% of them are in BioScope), which can be expected because the aim of scientific publications is to examine whether a hypothesized phenomenon occurs. Among the most frequent cues, investigate, examine, and study belong to this group. These data reveal that the frequency of doxastic and investigation cues is strongly domain-dependent, and this explains the fact that the investigation vocabulary is very limited in Factbank and WikiWeasel. Only about 10 cue stems belong to this uncertainty class in these corpora. The set of condition cue stems, however, is very small in each corpus; altogether 18 condition cue stems can be found in the data, although if and would are responsible for almost 75% of condition cue occurrences. It should also be mentioned that the percentage of condition cues is higher in FactBank than in the other corpora.
Another interesting trend was observed when word forms were considered instead of stemmed forms: Certain verbs in third person singular (e.g., expects or believes) occur mostly in FactBank and WikiWeasel. The reason for this may be that when speaking about someone else's opinion in scientific discourse, the source of the opinion is usually provided in the form of references or citations—usually at the end of the sentence—and due to this, the verb is often used in the passive form, as in Example (9).
- (9)
It is currently believed that both RAG1 and RAG2 proteins were originally encoded by the same transposon recruited in a common ancestor of jawed vertebrates [3,12,13,16].
A genre-related difference between scientific abstracts and full papers is that condition cues can rarely be found in abstracts, although they occur more frequently in papers (with the non-cue usage still being much more frequent). Another difference is the percentage of cues of the investigation type, which may be related to the structure of abstracts. Biological abstracts usually present the problem they examine and describe methods they use. This entails the application of predicates belonging to the investigation class of uncertainty. It can be argued, however, that scientific papers also have these characteristics but abstracts are much shorter than papers (generally, they contain about 10–12 sentences). Hence, investigation cues are responsible for a greater percentage of cues.
There are some lexical differences among the corpora that are related to domain or genre specificity. For instance, due to their semantics, the words charge, accuse, allege, fear, worry, and rumor are highly unlikely to occur in scientific publications, but they occur relatively often in news texts and in Wikipedia articles. As for lexical divergences between abstracts and papers, many of them are related to verbs of investigation and their different usage. In the corpora, verbs of investigations were marked only if it was not clear whether the event/phenomenon would take place or not. If it has already happened (The police are investigating the crime) or the existence of the thing under investigation can be stated with certainty, independently of the investigation (The top ten organisms were examined), then they are not instances of hypotheses, so they were not annotated. As the data sets make clear, there were some candidates of investigation verbs that occurred in the investigation sense mostly in abstracts but in another sense in papers, especially in the bmc data set (e.g. assess or examine). Evaluate also had a special mathematical sense in bmc papers, which did not occur in abstracts.
It can also be seen that some of the very frequent cues in papers do not occur (or only relatively rarely) in abstracts. This is especially true for the bmc data set, where can, if, would, could, and will are among the 15 most frequent cues and represent 23.21% of cue occurrences, but only 3.85% in abstracts. It is also apparent that the rate of epistemic cues is lower in bmc papers than in abstracts or other types of papers.
Genre-dependent characteristics can be analyzed if BioScope abstracts and hbc papers are compared because their fine-grained domain is the same. Thus, it may be assumed that differences between their cues are related to the genre. The sets of cues used are similar, but the sense distributions may differ for certain ambiguous cues. For instance, indicate mostly appears in the ‘suggest’ sense in abstracts, whereas in papers it is used in the ‘signal’ sense. Another difference is that the percentage rate of doxastic cues is almost twice as high in papers as in abstracts (10.6% and 5.7%, respectively). Besides these differences, the two data sets are quite similar.
Domain-related differences can be analyzed when the three subdomains of biological papers are contrasted. As stressed earlier, bmc papers contain fewer instances of epistemic uncertainty, but condition cues occur more frequently in them. Nevertheless, fly and hbc papers are rather similar in these respects but hbc papers contain more investigation cues than the other two subcorpora. As regards lexical issues, the non-cue usage of possible in comparative constructions is more frequent in the bmc data set than in the other papers and many occurrences of if in bmc are related to definitions, which were not annotated as uncertain. On the basis of this information, the fly and the hbc domains seem to be more similar to each other than to the BMC data set from a linguistic point of view.
From the perspective of genre and domain adaptation, the following points should be highlighted concerning the distribution of uncertainty cues across corpora. Doxastic uncertainty is of primary importance in the news and encyclopedia domains whereas the investigation class is characteristic of the biological domain. Within the latter, there is a genre-related difference as well: It is the epistemic and investigation classes that are mainly present in abstracts whereas in papers cues belonging to other uncertainty classes can also be found. Thus, when applying techniques developed for biological texts or abstracts to news texts, for example, doxastic uncertainty cues deserve special attention as it might well be the case that there are insufficient training enumerate for this class of uncertainty cues. The adaptation of an uncertainty cue detector constructed for encyclopedia texts requires the special treatment of investigation cues, however, if, for instance, scientific discourse is the target genre since they are underrepresented in the source genre.
5.2 Evaluation Metrics
As evaluation metrics, we used cue-level and sentence-level Fβ = 1 scores for the uncertain class (the standard evaluation metrics of Task 1 of the CoNLL-2010 shared task) and denote them by Fcue and Fsent, respectively. We report cue-level Fβ = 1 scores on the individual subcategories of uncertainty and the unlabeled (binary) Fβ = 1 scores as well. A sentence is treated as uncertain (in the gold standard and prediction) iff it contains at least one cue. Note that the cue-level metric is quite strict as it is based on recognized phrases—that is, only cues with perfect boundary matches are true positives. For the sentence-level evaluation we simply labeled those sentences as uncertain that contained at least one recognized cue.
5.3 Cross-Domain Cue Recognition Model
In order to minimize the development cost of a labeled corpus and an uncertainty detector for a new genre/domain, we need to induce an accurate model from a minimal amount of labeled data, or take advantage of existing corpora for different genres and/or domains and use a domain adaptation approach. Experiments investigating the value and sufficiency of existing corpora (which are usually out-of-domain) and simple domain adaptation methods were carried out. For this purpose, we implemented a cue recognition model, which is described in this section.
To train our models, we applied surface level (e.g., capitalization) and shallow syntactic features (part-of-speech tags and chunks) and avoided the use of lexicon-based features listing potential cue words, in order to reduce the domain dependence of the learned models. Now we will introduce our model, which is competitive with the state-of-the-art systems and focus on its domain adaptability. We will also describe the implementation details of the learning model and the features employed. We should add that the optimization of a cue detector was not the main focus of our study, however.
5.3.1 Feature Set
We extracted two types of features for each token to describe the token itself, together with its local context in a window of limited size (1, 2, or no window, depending on the feature).
The first group consists of features describing the surface form of the tokens. Here we provide the list of the surface features with the corresponding window sizes:
Stems of the tokens by the Porter stemmer in a window of size 2 (current token and two tokens to the left and right).
Surface pattern of the tokens in a window of size one (current token and 1 token to the left and right). These patterns are similar to the word shape feature described in Sun et al. (2007). This feature can describe the capitalization and other orthographic features as well. Patterns represent character sequences of the same type with one single character for a given word. There are six different pattern types denoting capitalized and lowercased character sequences with the characters “A” and “a”, number sequences with “0”, Greek letter sequences with “G” and “g”, Roman numerals with “R” and “r”, and non-alphanumerical characters with “!”
Prefixes and suffixes of word forms from three to five characters long.
The second group of features describes the syntactic properties of the token and its local context. The list of the syntactic features with the corresponding window sizes is the following:
Part-of-speech (POS) tags of the tokens by the C&C POS-tagger in a window of size 2.
Syntactic chunk of the tokens, as given by the C&C chunker,6 and the chunk code of the tokens in a window of size 2.
Concatenated stem, POS, and chunk labels similar to the features used by Tang et al. (2010). These feature strings were a combination of the stem and the chunk code of the current token, the stem of the current token combined with the POS-codes of the token left and right, and the chunk code of the current token with the stems of the neighboring tokens.
5.3.2 CoNLL-2010 Experiments.
The CoNLL-2010 shared task Learning to detect hedges and their scope in natural language text focused on uncertainty detection. Two subtasks were defined at the shared task: The first task sought to recognize sentences that contain some uncertain language in two different domains and the second task sought to recognize lexical cues together with their linguistic scope in biological texts (i.e., the text span in terms of constituency grammar that covers the part of the sentence that is modified by the cue). The lexical cue recognition subproblem of the second task7 is identical to the problem setting used in this study, with the only major difference being the types of uncertainty addressed: In the CoNLL-2010 task biological texts contained only epistemic, doxastic, and investigation types of uncertainty. Apart from these differences, the CoNLL-2010 shared task offers an excellent testbed for comparing our uncertainty detection model with other state-of-the-art approaches for uncertainty detection and to compare different classification approaches. Here we present our detailed experiments using the CoNLL data sets, analyze the performance of our models, and select the most suitable models for further experiments.
CoNLL systems. The uncertainty detection systems that were submitted to the CoNLL shared task can be classified into three major types. The first set of systems treats the problem as a sentence classification task, that is, one to decide whether a sentence contains any uncertain element or not. These models operate at the sentence level and are unsuitable for cue detection. The second group handles the problem as a token classification task, and classifies each token independently as uncertain (or not). Contextual information is only included in the form of feature functions. The third group of systems handled the task as a sequential token labeling problem, that is, determined the most likely label sequence of a sentence in one step, taking the information about neighboring labels into account. Sequence labeling and token classification approaches performed best for biological texts and sentence-level models and token classification approaches gave the best results for Wikipedia texts (see Table 6 in Farkas et al. [2010]). Here we compare a state-of-the-art token classification and sequence labeling approach using a shared feature representation to decide which model to use in further experiments.
Classifier models. We used a first-order linear chain conditional random fields (CRF) model as a sequence labeler and a Maximum Entropy (Maxent) classifier model as a token classifier, implemented in the Mallet (McCallum 2002) package for training the uncertainty cue detectors. This choice was motivated by the fact that these were the most popular classification approaches among the CoNLL-2010 participants, and that CRF models are known to provide high accuracy for the detection of phrases with accurate boundaries (e.g., in named entity recognition). We trained the CRF and Maxent models with their default settings in Mallet for 200 iterations or until convergence (CRF), and also until convergence (Maxent) in each experimental set-up.
As a baseline model, we applied a simple dictionary-based approach which classifies every uni- and bigram as uncertain that is tagged as uncertain in over 50% of the cases in the training data. Hence, it is a similar system to that presented by Tjong Kim Sang (2010), without tuning the decision threshold for predicting uncertainty.
CoNLL results. An overview of the results achieved on the CoNLL-2010 data sets can be found in Table 4. A comparison of our models with the CoNLL systems reveals that our uncertainty detection model is very competitive when applied on the biological data set. Our CRF model trained on the official training data set of the shared task achieved a cue-level F-score of 81.4 and sentence-level F-score of 87.0 on the biological evaluation data set. These results would have come first in the shared task, with a marginal difference compared to the top performing participant. In contrast, our model is less competitive on the Wikipedia data set: The Maxent model achieved a cue-level F-score of 22.3 and sentence-level F-score of 58.1 on the Wikipedia evaluation data set, whereas our CRF model was not competitive with the best participating systems. The observation that sequence-labeling models perform worse than token-based approaches on Wikipedia, especially for sentence-level evaluation measures, coincides with the findings of the shared task: The discourse-level uncertainty cues in the Wikipedia data set are rather long and heterogeneous and sequence labeling models often revert to not annotating any token in a sentence when the phrase boundaries are hard to detect. Still, sequence labeling models have an advantage in terms of cue-level accuracy. This is not surprising because CRF is a state-of-the-art model for chunking / sequence labeling tasks.
. | BIOLOGICAL . | WIKIPEDIA . | ||
---|---|---|---|---|
. | Fcue . | Fsent . | Fcue . | Fsent . |
BASELINE | 74.5 | 81.4 | 19.5 | 58.6 |
TOKEN/MAXENT | 79.7 | 85.8 | 22.3 | 58.1 |
SEQUENCE/CRF | 81.4 | 87.0 | 32.7 | 47.0 |
BEST/SEQ (Tang et al. 2010) | 81.3 | 86.4 | 36.5 | 55.0 |
BEST/TOK BIO (Velldal, Øvrelid, and Oepen 2010) | 78.7 | 85.2 | – | – |
BEST/TOK WIKI (Morante, Van Asch, and Daelemans 2010) | 76.7 | 81.7 | 11.3 | 57.3 |
BEST/SENT BIO (Täckstróm et al. 2010) | – | 85.2 | – | 55.4 |
BEST/SENT WIKI (Georgescul 2010) | – | 78.5 | – | 60.2 |
. | BIOLOGICAL . | WIKIPEDIA . | ||
---|---|---|---|---|
. | Fcue . | Fsent . | Fcue . | Fsent . |
BASELINE | 74.5 | 81.4 | 19.5 | 58.6 |
TOKEN/MAXENT | 79.7 | 85.8 | 22.3 | 58.1 |
SEQUENCE/CRF | 81.4 | 87.0 | 32.7 | 47.0 |
BEST/SEQ (Tang et al. 2010) | 81.3 | 86.4 | 36.5 | 55.0 |
BEST/TOK BIO (Velldal, Øvrelid, and Oepen 2010) | 78.7 | 85.2 | – | – |
BEST/TOK WIKI (Morante, Van Asch, and Daelemans 2010) | 76.7 | 81.7 | 11.3 | 57.3 |
BEST/SENT BIO (Täckstróm et al. 2010) | – | 85.2 | – | 55.4 |
BEST/SENT WIKI (Georgescul 2010) | – | 78.5 | – | 60.2 |
We conclude from Table 4. that our model is competitive with the state-of-the-art systems for detecting semantic uncertainty (which is closer to the biological subtask), but it is less suited to recognizing discourse-level uncertainty. In the subsequent experiments we used our CRF model, which performed best in detecting uncertainty cues in natural language sentences.
5.3.3 Domain Adaptation Model.
In supervised machine learning, the task is to learn how to make predictions on previously unseen, new enumerate based on a statistical model learned from a collection of labeled training enumerate (i.e., a set of enumerate coupled with the desired output for them). The classification setting assumes a set of labels L, a set of features X, and a probability distribution p(X) describing the enumerate in terms of their features. Then the training enumerate are assumed to be given in the form of {xi,li} pairs and the goal of classification is to estimate the label distribution p(L|X), which can be used later on to predict the labels for unseen enumerate.
Domain adaptation focuses on the problem where the same (or a closely related) learning task has to be solved in multiple domains which have different characteristics in terms of their features: The set of features X may be different or the probability distributions p(X) describing the inputs may be different. When the target tasks are treated as different (but related), the label distribution p(L|X) is dependent on the domain. That is, given a domain d, the problem can be formalized as modeling p(L|X)d based on Xd, p(X)d and a set of enumerate: {xi,d, li}.8 In the context of domain adaptation, there is a target domain t and a source domain s, with labeled data available for both, and the goal is to induce a more accurate target domain model p(L|X)t from {xi,t, li} ⋓ {xi,s, li} than the one learned from {xi,t, li} only. In practical scenarios, the goal is to exploit the source data to acquire an accurate model from just limited target data which are alone insufficient to train an accurate in-domain model, and thus to port the model to a new domain with moderate annotation costs. The problem is difficult because it is nontrivial for a learning method to account for the different data (and label) distributions between target and source, which causes a remarkable drop in model accuracy when it is applied to classifying enumerate taken from the target domain.
In our experimental context, both topic- and genre-related differences of texts pose an adaptation problem as these factors have an impact on both the vocabulary (p(X)) and the sense distributions of the cues (p(L|X)) found in different texts. There is some confusion in the literature regarding the terminology describing the various domain mismatches in the learning problem. For example, Daumé III (2007) describes a domain adaptation method where he assumes that the label distribution is unchanged (we note here that this assumption is not exploited in the method, and that the label distribution changes in our problem), whereas Pan and Yang (2010) uses the term inductive transfer learning to refer to our scenario (in their paper, domain adaptation refers to a different setting).9 In this study we always use the term domain adaptation to refer to our problem setting, that is, where both p(X) and p(L|X) are assumed to change.
In our experiments, we used various data sets taken from multiple genres and domains (see Section 5.1.1 for an overview) and applied a simple but effective domain adaptation model (Daumé III 2007) for training our classifiers. In this model, domain adaptation is carried out by defining each feature over the target and source data sets twice—just once for target domain instances, and once for both the target and source domain instances. Formally, having a target domain t and a source domain s and n features {f1, f2, ...fn}, for each fi we have a target-only version fi,t and a shared version fi, t + s. Each target domain example is described by 2n features: {f1,t, f2,t, ...fn,t, f1, t + s, f2, t + s, ...fn, t + s} and source domain enumerate are described by only the n shared features: {f1, t + s, f2, t + s, ...fn, t + s}. Using the union of the source and target training data sets {xi,t, li} ⋓ {xi,s, li} and this feature representation, any standard supervised machine learning technique can be used and it becomes possible for the algorithm to learn target-dependent and shared patterns at the same time and handle the changes in the underlying distributions. This easy domain adaptation technique has been found to work well in many NLP-oriented tasks. We used the CRF models introduced herein and in this way, we were able to exploit feature–label correspondences across domains (for features that behave consistently across domains) and also to learn patterns specific to the target domain.
5.4 Cross-Domain and Genre Experiments
We defined several settings (target and source pairs) with varied domain and genre distances and target data set sizes. These experiments allowed us to study the potential of transferring knowledge across existing corpora for the accurate detection of uncertain language in a wide variety of text types. In our experiments, we used all the combinations of genres and domains that we found plausible. News texts (and their subdomains) were not used as source data because FactBank is significantly smaller than the other corpora (WikiWeasel or scientific texts). As the source data set is typically larger than the target data set in practical scenarios, news texts can only be used as target data. Abstracts were only used as source data because information extraction typically addresses full texts whereas abstracts just provide annotated data for development purposes. Besides these restrictions, we experimented with all possible target and source pairs.
We used four different machine-learning settings for each target–source pair in our investigations. In the purely cross-domain (CROSS) setting, the model was trained on the source domain and evaluated on the target (i.e., no labeled target domain data sets were used for training). In the purely in-domain setting (TARGET), we performed 10-fold cross-validation on the target data (i.e., no source domain data were used). In the two domain adaptation settings, we again performed 10-fold cross-validation on the target data but exploited the source data set (as described in Section 5.3). Here, we either used each sentence of the source data set (DA/ALL) or only those sentences that contained a cue observed in the target train data set (DA/CUE).
Table 5 lists the results obtained on various target and source domains in various machine learning settings and Table 6 contains the absolute differences between a particular result and the in-domain (TARGET) results.
. | . | . | . | CROSS . | TARGET . | DA/ALL . | DA/CUE . | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
TARGET . | SOURCE . | . | DIST . | Fcue . | Fsent . | Fcue . | Fsent . | Fcue . | Fsent . | Fcue . | Fsent . |
enc | sci_paper+_abs | 0.9 | ++/++ | 68.0 | 74.2 | 82.4 | 87.4 | 82.6 | 87.6 | 82.6 | 87.6 |
news | sci_paper+_abs | 6.2 | ++/++ | 64.4 | 70.5 | 68.7 | 77.1 | 72.7 | 79.5 | 73.8 | 81.0 |
news | enc | 6.6 | ++/++ | 68.2 | 74.8 | 68.7 | 77.1 | 73.7 | 81.2 | 73.1 | 80.0 |
sci_paper | enc | 2.7 | ++/++ | 67.8 | 75.1 | 78.8 | 84.4 | 80.0 | 85.9 | 79.8 | 85.4 |
sci_paper_bmc | sci_abs_hbc | 4.3 | +/+ | 58.2 | 70.5 | 64.0 | 74.5 | 68.1 | 76.7 | 69.3 | 77.8 |
sci_paper_fly | sci_abs_hbc | 3.4 | +/+ | 70.5 | 79.1 | 80.0 | 85.1 | 83.3 | 88.2 | 82.9 | 87.8 |
sci_paper_hbc | sci_abs_hbc | 8.2 | −/+ | 76.5 | 82.9 | 74.2 | 80.2 | 84.2 | 88.6 | 83.0 | 88.9 |
sci_paper_bmc | sci_paper_fly+_hbc | 1.8 | +/− | 69.8 | 77.6 | 64.0 | 74.5 | 70.0 | 78.2 | 69.4 | 78.1 |
sci_paper_fly | sci_paper_bmc+_hbc | 1.2 | +/− | 78.4 | 83.5 | 80.0 | 85.1 | 82.6 | 87.0 | 82.9 | 87.0 |
sci_paper_hbc | sci_paper_bmc+_fly | 4.4 | +/− | 81.7 | 85.9 | 74.2 | 80.2 | 80.7 | 86.9 | 80.7 | 85.9 |
AVERAGE: | 70.4 | 77.4 | 73.5 | 80.6 | 77.8 | 84.0 | 77.8 | 84.0 |
. | . | . | . | CROSS . | TARGET . | DA/ALL . | DA/CUE . | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
TARGET . | SOURCE . | . | DIST . | Fcue . | Fsent . | Fcue . | Fsent . | Fcue . | Fsent . | Fcue . | Fsent . |
enc | sci_paper+_abs | 0.9 | ++/++ | 68.0 | 74.2 | 82.4 | 87.4 | 82.6 | 87.6 | 82.6 | 87.6 |
news | sci_paper+_abs | 6.2 | ++/++ | 64.4 | 70.5 | 68.7 | 77.1 | 72.7 | 79.5 | 73.8 | 81.0 |
news | enc | 6.6 | ++/++ | 68.2 | 74.8 | 68.7 | 77.1 | 73.7 | 81.2 | 73.1 | 80.0 |
sci_paper | enc | 2.7 | ++/++ | 67.8 | 75.1 | 78.8 | 84.4 | 80.0 | 85.9 | 79.8 | 85.4 |
sci_paper_bmc | sci_abs_hbc | 4.3 | +/+ | 58.2 | 70.5 | 64.0 | 74.5 | 68.1 | 76.7 | 69.3 | 77.8 |
sci_paper_fly | sci_abs_hbc | 3.4 | +/+ | 70.5 | 79.1 | 80.0 | 85.1 | 83.3 | 88.2 | 82.9 | 87.8 |
sci_paper_hbc | sci_abs_hbc | 8.2 | −/+ | 76.5 | 82.9 | 74.2 | 80.2 | 84.2 | 88.6 | 83.0 | 88.9 |
sci_paper_bmc | sci_paper_fly+_hbc | 1.8 | +/− | 69.8 | 77.6 | 64.0 | 74.5 | 70.0 | 78.2 | 69.4 | 78.1 |
sci_paper_fly | sci_paper_bmc+_hbc | 1.2 | +/− | 78.4 | 83.5 | 80.0 | 85.1 | 82.6 | 87.0 | 82.9 | 87.0 |
sci_paper_hbc | sci_paper_bmc+_fly | 4.4 | +/− | 81.7 | 85.9 | 74.2 | 80.2 | 80.7 | 86.9 | 80.7 | 85.9 |
AVERAGE: | 70.4 | 77.4 | 73.5 | 80.6 | 77.8 | 84.0 | 77.8 | 84.0 |
. | . | . | . | CROSS . | TARGET . | DA/ALL . | DA/CUE . | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
TARGET . | SOURCE . | . | DIST . | Fcue . | Fsent . | Fcue . | Fsent . | Fcue . | Fsent . | Fcue . | Fsent . |
enc | sci_paper+_abs | 0.9 | ++/++ | −14.4 | −13.2 | 82.4 | 87.4 | 0.2 | 0.2 | 0.2 | 0.2 |
news | sci_paper+_abs | 6.2 | ++/++ | −4.3 | −6.6 | 68.7 | 77.1 | 4.0 | 2.4 | 5.1 | 3.9 |
news | enc | 6.6 | ++/++ | −0.5 | −2.3 | 68.7 | 77.1 | 5.0 | 4.1 | 4.4 | 2.9 |
sci_paper | enc | 2.7 | ++/++ | −11.0 | −9.3 | 78.8 | 84.4 | 1.2 | 1.5 | 1.0 | 1.0 |
sci_paper_bmc | sci_abs_hbc | 4.3 | +/+ | −5.8 | −4.0 | 64.0 | 74.5 | 4.1 | 2.2 | 5.3 | 3.3 |
sci_paper_fly | sci_abs_hbc | 3.4 | +/+ | −9.5 | −6.0 | 80.0 | 85.1 | 3.3 | 3.1 | 2.9 | 2.7 |
sci_paper_hbc | sci_abs_hbc | 8.2 | −/+ | 2.3 | 2.7 | 74.2 | 80.2 | 10.0 | 8.4 | 8.8 | 8.7 |
sci_paper_bmc | sci_paper_fly+_hbc | 1.8 | +/− | 5.8 | 3.1 | 64.0 | 74.5 | 6.0 | 3.7 | 5.4 | 3.6 |
sci_paper_fly | sci_paper_bmc+_hbc | 1.2 | +/− | −1.6 | −1.6 | 80.0 | 85.1 | 2.6 | 1.9 | 2.9 | 1.9 |
sci_paper_hbc | sci_paper_bmc+_fly | 4.4 | +/− | 7.5 | 5.7 | 74.2 | 80.2 | 6.5 | 6.7 | 6.5 | 5.7 |
AVERAGE: | −3.1 | −3.2 | 73.5 | 80.6 | 4.3 | 3.4 | 4.3 | 3.4 |
. | . | . | . | CROSS . | TARGET . | DA/ALL . | DA/CUE . | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
TARGET . | SOURCE . | . | DIST . | Fcue . | Fsent . | Fcue . | Fsent . | Fcue . | Fsent . | Fcue . | Fsent . |
enc | sci_paper+_abs | 0.9 | ++/++ | −14.4 | −13.2 | 82.4 | 87.4 | 0.2 | 0.2 | 0.2 | 0.2 |
news | sci_paper+_abs | 6.2 | ++/++ | −4.3 | −6.6 | 68.7 | 77.1 | 4.0 | 2.4 | 5.1 | 3.9 |
news | enc | 6.6 | ++/++ | −0.5 | −2.3 | 68.7 | 77.1 | 5.0 | 4.1 | 4.4 | 2.9 |
sci_paper | enc | 2.7 | ++/++ | −11.0 | −9.3 | 78.8 | 84.4 | 1.2 | 1.5 | 1.0 | 1.0 |
sci_paper_bmc | sci_abs_hbc | 4.3 | +/+ | −5.8 | −4.0 | 64.0 | 74.5 | 4.1 | 2.2 | 5.3 | 3.3 |
sci_paper_fly | sci_abs_hbc | 3.4 | +/+ | −9.5 | −6.0 | 80.0 | 85.1 | 3.3 | 3.1 | 2.9 | 2.7 |
sci_paper_hbc | sci_abs_hbc | 8.2 | −/+ | 2.3 | 2.7 | 74.2 | 80.2 | 10.0 | 8.4 | 8.8 | 8.7 |
sci_paper_bmc | sci_paper_fly+_hbc | 1.8 | +/− | 5.8 | 3.1 | 64.0 | 74.5 | 6.0 | 3.7 | 5.4 | 3.6 |
sci_paper_fly | sci_paper_bmc+_hbc | 1.2 | +/− | −1.6 | −1.6 | 80.0 | 85.1 | 2.6 | 1.9 | 2.9 | 1.9 |
sci_paper_hbc | sci_paper_bmc+_fly | 4.4 | +/− | 7.5 | 5.7 | 74.2 | 80.2 | 6.5 | 6.7 | 6.5 | 5.7 |
AVERAGE: | −3.1 | −3.2 | 73.5 | 80.6 | 4.3 | 3.4 | 4.3 | 3.4 |
Fine-grained semantic uncertainty classification results are summarized in Tables 7 and 8. Table 7 contrasts the coarse-grained Fcue with the unlabeled/binary Fcue of fine-grained experiments, therefore it quantifies the difference in accuracy due to the more difficult classification setting and the increased sparseness of the task. Table 8 shows the per class Fcue scores, namely, how accurately our model recognizes the individual uncertainty types.
. | . | . | . | CROSS . | TARGET . | DA/ALL . | DA/CUE . | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
TARGET . | SOURCE . | . | DIST . | Fbin . | Funl . | Fbin . | Funl . | Fbin . | Funl . | Fbin . | Funl . |
enc | sci_paper+_abs | 0.9 | ++/++ | 68.0 | 67.4 | 82.4 | 82.4 | 82.6 | 81.9 | 82.6 | 81.7 |
news | sci_paper+_abs | 6.2 | ++/++ | 64.4 | 59.9 | 68.7 | 66.4 | 72.7 | 71.5 | 73.8 | 71.8 |
news | enc | 6.6 | ++/++ | 68.2 | 67.0 | 68.7 | 66.4 | 73.7 | 73.6 | 73.1 | 73.4 |
sci_paper | enc | 2.7 | ++/++ | 67.8 | 67.2 | 78.8 | 78.3 | 80.0 | 80.2 | 79.8 | 79.5 |
sci_paper_bmc | sci_abs_hbc | 4.3 | +/+ | 58.2 | 66.3 | 64.0 | 61.9 | 68.1 | 68.5 | 69.3 | 67.9 |
sci_paper_fly | sci_abs_hbc | 3.4 | +/+ | 70.5 | 78.7 | 80.0 | 79.2 | 83.3 | 83.4 | 82.9 | 83.2 |
sci_paper_hbc | sci_abs_hbc | 8.2 | –/+ | 76.5 | 83.6 | 74.2 | 69.3 | 84.2 | 83.1 | 83.0 | 83.4 |
sci_paper_bmc | sci_paper_fly+_hbc | 1.8 | +/– | 69.8 | 69.7 | 64.0 | 61.9 | 70.0 | 69.5 | 69.4 | 65.9 |
sci_paper_fly | sci_paper_bmc+_hbc | 1.2 | +/– | 78.4 | 77.7 | 80.0 | 79.2 | 82.6 | 82.1 | 82.9 | 82.5 |
sci_paper_hbc | sci_paper_bmc+_fly | 4.4 | +/– | 81.7 | 81.9 | 74.2 | 69.3 | 80.7 | 81.3 | 80.7 | 81.2 |
AVERAGE: | 70.4 | 71.9 | 73.5 | 71.4 | 77.8 | 77.5 | 77.8 | 77.0 |
. | . | . | . | CROSS . | TARGET . | DA/ALL . | DA/CUE . | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
TARGET . | SOURCE . | . | DIST . | Fbin . | Funl . | Fbin . | Funl . | Fbin . | Funl . | Fbin . | Funl . |
enc | sci_paper+_abs | 0.9 | ++/++ | 68.0 | 67.4 | 82.4 | 82.4 | 82.6 | 81.9 | 82.6 | 81.7 |
news | sci_paper+_abs | 6.2 | ++/++ | 64.4 | 59.9 | 68.7 | 66.4 | 72.7 | 71.5 | 73.8 | 71.8 |
news | enc | 6.6 | ++/++ | 68.2 | 67.0 | 68.7 | 66.4 | 73.7 | 73.6 | 73.1 | 73.4 |
sci_paper | enc | 2.7 | ++/++ | 67.8 | 67.2 | 78.8 | 78.3 | 80.0 | 80.2 | 79.8 | 79.5 |
sci_paper_bmc | sci_abs_hbc | 4.3 | +/+ | 58.2 | 66.3 | 64.0 | 61.9 | 68.1 | 68.5 | 69.3 | 67.9 |
sci_paper_fly | sci_abs_hbc | 3.4 | +/+ | 70.5 | 78.7 | 80.0 | 79.2 | 83.3 | 83.4 | 82.9 | 83.2 |
sci_paper_hbc | sci_abs_hbc | 8.2 | –/+ | 76.5 | 83.6 | 74.2 | 69.3 | 84.2 | 83.1 | 83.0 | 83.4 |
sci_paper_bmc | sci_paper_fly+_hbc | 1.8 | +/– | 69.8 | 69.7 | 64.0 | 61.9 | 70.0 | 69.5 | 69.4 | 65.9 |
sci_paper_fly | sci_paper_bmc+_hbc | 1.2 | +/– | 78.4 | 77.7 | 80.0 | 79.2 | 82.6 | 82.1 | 82.9 | 82.5 |
sci_paper_hbc | sci_paper_bmc+_fly | 4.4 | +/– | 81.7 | 81.9 | 74.2 | 69.3 | 80.7 | 81.3 | 80.7 | 81.2 |
AVERAGE: | 70.4 | 71.9 | 73.5 | 71.4 | 77.8 | 77.5 | 77.8 | 77.0 |
. | . | EPISTEMIC . | INVESTIGATION . | DOXASTIC . | CONDITION . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TARGET . | SOURCE . | Fcrs . | Ftgt . | Fda . | Fcrs . | Ftgt . | Fda . | Fcrs . | Ftgt . | Fda . | Fcrs . | Ftgt . | Fda . |
enc | sci_paper+_abs | 75.9 | 83.4 | 82.8 | 67.3 | 67.5 | 70.4 | 48.8 | 89.2 | 88.1 | 54.4 | 62.6 | 61.2 |
news | sci_paper+_abs | 70.9 | 65.4 | 75.2 | 79.5 | 75.9 | 83.1 | 39.1 | 68.9 | 71.3 | 47.2 | 57.1 | 57.5 |
news | enc | 65.4 | 65.4 | 74.5 | 74.6 | 75.9 | 87.5 | 76.3 | 68.9 | 78.0 | 50.6 | 57.1 | 56.7 |
sci_paper | enc | 72.9 | 81.2 | 81.9 | 36.5 | 72.9 | 72.4 | 63.6 | 74.9 | 79.8 | 57.0 | 58.9 | 59.7 |
sci_paper_bmc | sci_abs_hbc | 71.5 | 68.3 | 72.6 | 56.1 | 37.7 | 58.1 | 68.1 | 61.9 | 69.4 | 45.5 | 45.0 | 49.5 |
sci_paper_fly | sci_abs_hbc | 82.9 | 82.1 | 85.3 | 69.0 | 68.6 | 76.6 | 75.1 | 71.7 | 75.4 | 28.6 | 63.4 | 64.1 |
sci_paper_hbc | sci_abs_hbc | 87.5 | 77.7 | 86.4 | 76.5 | 53.5 | 77.5 | 80.6 | 39.0 | 76.7 | 26.1 | 10.0 | 33.3 |
sci_paper_bmc | sci_paper_fly+_hbc | 74.4 | 68.3 | 69.2 | 55.9 | 37.7 | 57.4 | 63.7 | 61.9 | 64.7 | 57.3 | 45.0 | 50.7 |
sci_paper_fly | sci_paper_bmc+_hbc | 80.3 | 82.1 | 84.3 | 66.7 | 68.6 | 75.8 | 77.7 | 71.7 | 77.3 | 53.5 | 63.4 | 68.0 |
sci_paper_hbc | sci_paper_bmc+_fly | 85.2 | 77.7 | 86.0 | 74.0 | 53.5 | 70.3 | 75.9 | 39.0 | 70.2 | 58.1 | 10.0 | 41.4 |
AVERAGE: | 76.7 | 75.2 | 79.8 | 65.6 | 61.2 | 72.9 | 66.9 | 64.7 | 75.1 | 47.8 | 47.3 | 54.2 |
. | . | EPISTEMIC . | INVESTIGATION . | DOXASTIC . | CONDITION . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TARGET . | SOURCE . | Fcrs . | Ftgt . | Fda . | Fcrs . | Ftgt . | Fda . | Fcrs . | Ftgt . | Fda . | Fcrs . | Ftgt . | Fda . |
enc | sci_paper+_abs | 75.9 | 83.4 | 82.8 | 67.3 | 67.5 | 70.4 | 48.8 | 89.2 | 88.1 | 54.4 | 62.6 | 61.2 |
news | sci_paper+_abs | 70.9 | 65.4 | 75.2 | 79.5 | 75.9 | 83.1 | 39.1 | 68.9 | 71.3 | 47.2 | 57.1 | 57.5 |
news | enc | 65.4 | 65.4 | 74.5 | 74.6 | 75.9 | 87.5 | 76.3 | 68.9 | 78.0 | 50.6 | 57.1 | 56.7 |
sci_paper | enc | 72.9 | 81.2 | 81.9 | 36.5 | 72.9 | 72.4 | 63.6 | 74.9 | 79.8 | 57.0 | 58.9 | 59.7 |
sci_paper_bmc | sci_abs_hbc | 71.5 | 68.3 | 72.6 | 56.1 | 37.7 | 58.1 | 68.1 | 61.9 | 69.4 | 45.5 | 45.0 | 49.5 |
sci_paper_fly | sci_abs_hbc | 82.9 | 82.1 | 85.3 | 69.0 | 68.6 | 76.6 | 75.1 | 71.7 | 75.4 | 28.6 | 63.4 | 64.1 |
sci_paper_hbc | sci_abs_hbc | 87.5 | 77.7 | 86.4 | 76.5 | 53.5 | 77.5 | 80.6 | 39.0 | 76.7 | 26.1 | 10.0 | 33.3 |
sci_paper_bmc | sci_paper_fly+_hbc | 74.4 | 68.3 | 69.2 | 55.9 | 37.7 | 57.4 | 63.7 | 61.9 | 64.7 | 57.3 | 45.0 | 50.7 |
sci_paper_fly | sci_paper_bmc+_hbc | 80.3 | 82.1 | 84.3 | 66.7 | 68.6 | 75.8 | 77.7 | 71.7 | 77.3 | 53.5 | 63.4 | 68.0 |
sci_paper_hbc | sci_paper_bmc+_fly | 85.2 | 77.7 | 86.0 | 74.0 | 53.5 | 70.3 | 75.9 | 39.0 | 70.2 | 58.1 | 10.0 | 41.4 |
AVERAGE: | 76.7 | 75.2 | 79.8 | 65.6 | 61.2 | 72.9 | 66.9 | 64.7 | 75.1 | 47.8 | 47.3 | 54.2 |
The size of the target training data sets proved to be an important factor in these investigations. Hence, we performed experiments with different target data set sizes. We utilized the DA/ALL model (which is more robust for extremely small target data sizes [e.g., 100-400 sentences]) and performed the same 10-fold cross validation on the target data set as in Tables 5,67-8. For each fold of the cross-validation here, however, we just used n sentences (x axis of the figures) from the target training data set and a fixed set of 4,000 source sentences to alleviate the effect of varying data set sizes. Figure 5 depicts the learning curves for two target/source data set pairs.
6. Discussion
As Table 5 shows, incorporating labeled data from different genres and/or domains consistently improves the performance. The successful applicability of domain adaptation tells us that the problem of detecting uncertainty has similar characteristics across genres and domains. The uncertainty cue lexicons of different domains and genres indeed share a core vocabulary and despite the differences in sense distributions, labeled data from a different source improves uncertainty classification in a new genre and domain if the different data sets are annotated consistently. This justifies our aim to create a consistent representation of uncertainty that can be applied to multiple domains.
6.1 Domain Adaptation Results
The size of the target and source data sets largely influences to what extent external data can improve results. The only case where domain adaptation had only a negligible effect (an F-score gain less than 1%) is where the target data set is itself very large. This is expected as the more target data one has, the less crucial it is to incorporate additional data with some undesirable characteristics (difference in style, domain, certain/uncertain sense distribution, etc.).
The performance scores for the CROSS setting clearly indicate the domain/genre distance of the data sets: The more distant the domain and genre of the source and target data sets are, the more the CROSS performance (where no labeled target data is used) degrades, compared with the TARGET model. In general, when the distance between both the domain and the genre of texts is substantial (++/++ and +/+ rows in Tables 5 and 6), this accounts for a 6–10% decrease in both the sentence and cue-level F-scores. An exception is the case of encyclopedic source and news target domains. Here the performance is very close to the target domain performance. This indicates that these settings are not so different from each other as it might seem at the first glance. The encyclopedic and news genres share quite a lot of commonalities (compare cue distributions in Figure 4, for instance). We verified this observation by using a knowledge-poor quantitative estimator of similarity between domains (Van Asch and Daelemans 2010): Using cosine as the similarity measure, the newswire and encyclopedia texts are found to be the second most similar domain pair in our experiments, with a score comparable to those obtained for the pairs of scientific article types bmc, hbc, and fly.
When there is a domain or genre match between source and target (−/+ and +/− rows in Tables 5 and 6), however, and the distance regarding the other is just moderate, the cross-training performance is close to or even better than the target-only results. That is, the larger amount of source training data balances the differences between the domains. These results indicate that the learned uncertainty classifiers can be directly applied to slightly different data sets. This suitability is due to the learned disambiguation models, which generalize well in similar settings. This is contrary to the findings of earlier studies, which built the uncertainty detectors using seed enumerate and bootstrapping. These models were not designed to learn any disambiguation models for the cue words found, and their performance degraded even for slightly different data (Szarvas 2008).
Comparing the two domain adaptation procedures DA/CUE and DA/ALL, adaptation via transferring only source sentences that contain a target domain cue is, on average, comparable to transferring all the data from the source domain. In other words, when we have a small but sufficient amount of target data available, it is enough to account for source data corresponding to the uncertainty cues we saw in the limited target data set. This observation has several consequences, namely:
The source-only cues, or to be more precise, their disambiguation models, are not helpful for the target domains as they cannot be adapted. This is due to the differences in the source and target disambiguation models.
Similarly, domain adaptation improves the disambiguation models for the observed target cues, rather than introducing new vocabulary into the target domain. This mechanism coincides with our initial goal of using domain adaptation to learn better semantic models. This effect is the opposite of how bootstrapping-based weakly supervised approaches improve the performance in an underresourced domain. This observation suggests a promising future direction of combining the two approaches to maximize the gains while minimizing the annotation costs.
In a general context, we can effectively extend the data for a given domain if we have robust knowledge of the potential uncertainty vocabulary for that domain. Given the wide variety of the domains and genres of our data sets, it is reasonable to suppose that they represent uncertain language in general quite well, and the joint vocabularies provide a good starting point for a targeted data development for further domains.
As regards the fine-grained classification results, Table 7 demonstrates that the fine-grained distinction results in only a small, or no, loss in performance. The coarse-grained model is slightly more accurate than the fine-grained model (counting correctly recognized but misclassified cues as true positives) in most settings. The most significant difference is observed for the target-only settings, where no out-of-domain data are used for the training and thus the data sets are accordingly smaller. A noticeable exception is when scientific abstracts are used for cross training: In those settings the coarse-grained model performs poorly, due to its lower recall, which we attribute to overfitting the special characteristics of abstracts. The fact that in fine-grained classification the CROSS results consistently outperform the TARGET models (see Table 8) even for distant domain pairs, also underlines that the increased sparseness caused by the differentiation of the various subtypes of uncertainty is an important factor only for smaller data sets. The improvement by domain adaptation is clearly more prominent in fine-grained than in coarse-grained classification, however: The individual cue types benefit by 5–10% points in terms of the F-score from out-of-domain data and domain adaptation. Moreover, as Table 8 shows, for the domain pairs and fine-grained classes where a nice amount of positive enumerate are at hand, the per class Fcue scores are also around 80% and above. This means that it is possible to accurately identify the individual subtypes of semantic uncertainty, and thus it also proves the feasibility of the subcategorization and annotation scheme proposed in this study (Section 2). Other important observations here are that domain adaptation is even more significant in the more difficult fine-grained classification setting, and that the condition class represents a challenge for our model. The performance for the condition class is lower than that for the other classes, which can only in part be attributed to the fact that this is the least represented subtype in our data sets: as opposed to other cue types, condition cues are typically used in many different contexts and they may belong to other uncertainty classes as well.
6.2 The Required Amount of Annotation
Based on our experiments, we may conclude that a manually annotated training data set consisting of 3,000–5,000 sentences is sufficient for training an accurate cue detector for a new genre/domain. The results of our learning curve experiments (Figure 5) illustrate the situations where only a limited amount of annotated data (fewer than 3,000 sentences) is available for the target domain. The feasibility of decreasing annotation efforts and the real added value of domain adaptation are more prominent in this range. It is easy to see that the TARGET results approach to DA results with more target data.
Figure 5 shows that the size of the target training data set where the supervised TARGET setting outperforms the CROSS model (trained on 4,000 source sentences) is around 1,000 sentences. As we mentioned earlier, even distant domain data can improve the cue recognition model in the absence of a sufficient target data set. Figure 5 justifies this observation, as the CROSS and DA settings outperform the TARGET setting on each source–target data set pair. It can also be observed that the doxastic type is more domain-dependent than the others and its results consistently improve by increasing the size of the target domain annotation (which coincides with the cue frequency investigations of Section 5.1.3). In the news target domain, however, the investigation and epistemic classes benefit a lot from a small amount of annotated target data but their performance scores increase just slightly after that. This indicates that most of the important domain-dependent (probably lexical) knowledge could be gathered from 100–400 sentences. In the biological experiments, we may conclude that the investigation class is already covered by the source domain (intuitively, the investigation cues are well represented in the abstracts) and its results are not improved significantly by using more target data. The condition class is underrepresented in both the source and target data sets and hence no reliable observations can be made regarding this subclass (see Table 2).
Overall, if we would like to have an uncertainty cue detector for a new genre/domain: (i) We can achieve performance around 60–70% by using cross training depending on the difference between the domains (i.e., without any annotation effort); (ii) By annotating around 3,000 sentences, we can have a performance of 70–80%, depending on the level of difficulty of the texts; (iii) We can get the same 70–80% results with annotating just 1,000 sentences and using domain adaptation.
6.3 Interesting Examples and Error Analysis
As might be expected, most of the erroneous cue predictions were due to vocabulary differences, for example, fear or accuse occurred only in news texts, which is why they were not recognized by models trained on biological or encyclopedia texts. Another example is the case of or, which is a frequent cue in biological texts. Still, it is rarely used as a cue in other domains but without domain adaptation, the model trained on biological texts marks quite a few occurrences of or as cues in the news or encyclopedia domains. Many of these anomalies were eliminated by the application of domain adaptation techniques, however.
Many errors were related to multi-class cues. These cues are especially hard to disambiguate because not only can they refer to several classes of uncertainty, but they typically have non-cue usage as well. For instance, the case of would is rather complicated because it can fulfill several functions:
- (10)
Epistemic usage (‘it is highly probable’): Further biochemical studies on the mechanism of action of purified kinesin-5 from multiple systems would obviously be fruitful. (Corpus: fly)
- (11)
Conditional: “If religion was a thing that money could buy,/The rich would live and the poor would die.” (Corpus: WikiWeasel)
- (12)
Future in the past: This Aarup can trace its history back to 1500, but it would be 1860's before it would become a town. (Corpus: WikiWeasel)
- (13)
Repeated action in the past (‘used to’): 'Becker' was the next T.V. Series for Paramount that Farrell would co-star in. (Corpus: WikiWeasel)
- (14)
Dynamic modality: Individuals would first have a small lesion at the site of the insect bite, which would eventually leave a small scar. (Corpus: WikiWeasel)
- (15)
Pragmatic usage: Although some would dispute the fact, the joke related to a peculiar smell that follows his person. (Corpus: WikiWeasel)
The last two uses of would are not typically described in grammars of English and seem to be characteristic primarily of the news and encyclopedia domains. Thus it is advisable to explore such cases and treat them with special consideration when adapting an algorithm trained and tested in a specific domain to another domain.
Another interesting example is may in its non-cue usage. Being (one of) the most frequent cues in each subcorpus, its non-cue usage is rather limited but can be found occasionally in FactBank and WikiWeasel. The following instance of may in FactBank was correctly marked as non-cue by the cue detector when trained on Wikipedia texts. On the other hand, it was marked as a cue when trained on biological texts since in this case, there were insufficient training enumerate of may not being a cue:
- (16)
“Well may we say ‘God save the Queen,’ for nothing will save the republic,” outraged monarchist delegate David Mitchell said. (Corpus: FactBank)
A final example to be discussed is concern. This word also has several uses:
- (17)
Noun meaning ‘company’: The insurance concern said all conversion rights on the stock will terminate on Nov. 30. (Corpus: FactBank)
- (18)
Noun meaning ‘worry’:Concern about declines in other markets, especially New York, caused selling pressure. (Corpus: FactBank)
- (19)
Preposition: The company also said it continues to explore all options concerning the possible sale of National Aluminum's 54.5% stake in an aluminum smelter in Hawesville, Ky. (Corpus: FactBank)
- (20)
Verb: Many of the predictions in these two data sets concern protein pairs and proteins that are not present in other data sets. (Corpus: bmc)
We think that an analysis of similar enumerate can further support domain adaptation and cue detection across genres and domains.
7. Conclusions and Future Work
In this article, we introduced an uncertainty cue detection model that can perform well across different domains and genres. Even though several types of uncertainty exist, available corpora and resources focus only on some of the possible types and thereby only cover particular aspects of the phenomenon. This means that uncertainty models found in the literature are heterogeneous, and the results of experiments on different corpora are hardly comparable. These facts motivated us to offer a unified model of semantic uncertainty enhanced by linguistic and computer science considerations. In accordance with this classification, we reannotated three corpora from several domains and genres using our uniform annotation guidelines.
Our results suggest that simple cross training can be employed and it achieves a reasonable performance (60–70% cue-level F-score) when no annotated data is at hand for a new domain. When some annotated data is available (here some means fewer than 3,000 annotated sentences for the target domain), domain adaptation techniques are the best choice: (i) they lead to a significant improvement compared to simple cross training, and (ii) they can provide a reasonable performance with significantly less annotation. In our experiments, the annotation of 3,000 sentences and training only on these is roughly equivalent to the annotation of 1,000 sentences using external data and domain adaptation. If the size of the training data set is sufficiently large (larger than 5,000 sentences) the effect of incorporating additional data—having some undesirable characteristics—is not crucial.
Comparing different domain adaptation techniques, we found that similar results could be attained when the source domain was filtered for sentences that contained cues in the target domain. This tells us that models learn to better disambiguate the cues seen in the target domain instead of finding new, unseen cues. In this sense, this approach can be regarded as a complementary method to weakly supervised techniques for lexicon extraction. A promising way to further minimize annotation costs while maximizing performance would be the integration of the two approaches, which we plan to investigate in the near future.
In our study, we did not pay attention to dynamic modality (due to the lack of annotated resources), but the detection of such phenomena is also desirable. For instance, dynamically modal events cannot be treated as certain—that is, the event of buying cannot be assigned the same truth value in They agreed to buy the company and They bought the company. Whereas the second sentence expresses a fact, the first one informs us about the intention of buying the company, which will be certainly carried out in a world where moral or business laws are observed but at the moment it cannot be stated whether the transaction takes place (i.e., that it is certain). Hence, in the future, we also intend to integrate the identification of dynamically modal cues into our uncertainty cue detector.
Acknowledgements
This work was supported in part by the NIH grants (project codenames MASZEKER and BELAMI) of the Hungarian government, by the German Ministry of Education and Research under grant SiDiM (grant no. 01IS10054G), and by the Volkswagen Foundation as part of the Lichtenberg-Professorship Program (grant no. I/82806). Rich´rd Farkas was funded by the Deutsche Forschungsgemeinschaft grant SFB 732.
Notes
Only 3 out of the more than 20 participants of the related CoNLL-2010 shared task (Farkas et al. 2010) managed to exploit out-of-domain data to improve their results, and only by a negligible margin.
The most successful CoNLL systems were based on these approaches, but different feature representations make direct comparisons difficult.
The entire typology of semantic uncertainty phenomena and a test battery for their classification are described in a supplementary file. Together with the corpora and the experimental software, they are available at http://www.inf.u-szeged.hu/rgai/uncertainty.
The corpora are available at http://www.inf.u-szeged.hu/rgai/uncertainty.
POS-tagging and chunking were performed on all corpora using the C&C Tools (Curran, Clark, and Bos 2007).
As an intermediate level, participants of the first task could submit the lexical cues found in sentences for evaluation, without their scope, which gave some insight into the nature of cue detection on the Wikipedia corpus (where scope annotation does not exist) as well.
The literature also describes the case when the set of labels depends on the domain, but we omit this case to simplify our notation and discussion. For details, see Pan and Yang (2010).
More on this can be found in Pan and Yang (2010) and at http://nlpers.blogspot.com/2007/11/domain-adaptation-vs-transfer-learning.html.
References
Author notes
Technische Universität Darmstadt, Ubiquitous Knowledge Processing (UKP) Lab, TU Darmstadt - FB 20 Hochschulstrasse 10, D-64289 Darmstadt, Germany. E-mail: {szarvas,gurevych}@tk.informatik.tu-darmstadt.de.
Hungarian Academy of Sciences, Research Group on Artificial Intelligence, Tisza Lajos krt. 103, 6720 Szeged, Hungary. E-mail: [email protected].
Universität Stuttgart, Institut für Maschinelle Sprachverarbeitung. Azenbergstrasse 12, D-70174 Stuttgart, Germany. E-mail: [email protected].
University of Szeged, Department of Informatics, Árpád tér 2, 6720 Szeged, Hungary. E-mail: [email protected].