Abstract
Philosophy of science has pointed out a problem of theoretical terms in empirical sciences. This problem arises if all known measuring procedures for a quantity of a theory presuppose the validity of this very theory, because then statements containing theoretical terms are circular. We argue that a similar circularity can happen in empirical computational linguistics, especially in cases where data are manually annotated by experts. We define a criterion of T-non-theoretical grounding as guidance to avoid such circularities, and exemplify how this criterion can be met by crowdsourcing, by task-related data annotation, or by data in the wild. We argue that this criterion should be considered as a necessary condition for an empirical science, in addition to measures for reliability of data annotation.
1. Introduction
The recent history of computational linguistics (CL) shows a trend towards encoding natural language processing (NLP) problems as machine learning tasks, with the goal of applying task-specific learning machines to solve the encoded NLP problems. In the following we will refer to such approaches as empirical CL approaches.
Machine learning tools and statistical learning theory play an important enabling and guiding role for research in empirical CL. A recent discussion in the machine learning community claims an even stronger and more general role of machine learning. We allude here to a discussion concerning the relation of machine learning and philosophy of science. For example, Corfield, Schölkopf, and Vapnik (2009) compare Popper's ideas of falsifiability of a scientific theory with “similar notions” from statistical learning theory regarding Vapnik-Chervonenkis theory. A recent NIPS workshop on “Philosophy and Machine Learning”1 presented a collection of papers investigating similar problems and concepts in the two fields. Korb (2004) sums up the essence of the discussion by directly advertising “Machine Learning as Philosophy of Science.”
In this article we argue that adopting machine learning theory as philosophy of science for empirical CL has to be done with great care. A problem arises in the application of machine learning methods to natural language data under the assumption that input–output pairs are given and do not have to be questioned. In contrast to machine learning, in empirical CL neither a representation of instances nor an association of instances and labels is always “given.” We show that especially in cases where data are manually annotated by expert coders, a problem of circularity arises if one and the same theory of measurement is used in data annotation and in feature construction. In this article, we use insights from philosophy of science to understand this problem. We particularly point to the “problem of theoretical terms,” introduced by Sneed (1971), that shows how circularities can make empirical statements in sciences such as physics impossible.
In the following, we will explain the problem of theoretical terms with the help of a miniature physical theory used in philosophy of science (Section 2). We will then exemplify this concept on examples from empirical CL (Section 3). We also make an attempt at proposing solutions to this problem by using crowdsourcing techniques, task-related annotation, or data in the wild (Section 4).
2. The Problem of Theoretical Terms in Philosophy of Science
In order to characterize the logical structure of empirical science, philosophy of science has extensively discussed the notions of “theoretical” and “observational” language. Sneed (1971)2 was the first to suggest a distinction between “theoretical” and “non-theoretical” terms of a given theory by means of the roles they play in that theory. Balzer (1996, page 140) gives a general definition that states that a term is “theoretical in theory T iff every determination of (a realization of) that term presupposes that T has been successfully applied beforehand.” Because there are no theory-independent terms in this view, an explicit reference to a theory T is always carried along when characterizing terms as theoretical with respect to T (T-theoretical) or non-theoretical with respect to T (T-non-theoretical). Stegmüller (1979) makes the notions of “determination” or “realization” more concrete by referring to procedures for measuring values of quantities or functions in empirical science:
What does it mean to say that a quantity (function) f of a physical theory T is T-theoretical?… In order to perform an empirical test of an empirical claim containing the T-theoretical quantity f, we have to measure values of the function f. But all known measuring procedures (or, if you like, all known theories of measurement of f-values) presuppose the validity of this very theory T. (page 17)
x is an AS iff there is an A, d, g such that:
- 1.
x = 〈A, d, g〉,
- 2.
A = {a1, …, an},
- 3.
d : A → ,
- 4.
g : A → ,
- 5.
∀a ∈ A : g(a) > 0,
- 6.
.
- 1.
The crux of the problem of theoretical terms for the miniature theory AS is the measuring procedure for the function g that presupposes the validity of the theory AS. The term g is thus AS-theoretical. There are two solutions to this problem:
- 1.
In order to avoid the use of AS-theoretic terms such as g, we could discard the assumption that our weight-measuring procedure uses beam balance scales. Instead we could use AS-non-theoretic measuring procedures such as spring scales. The miniature theory AS would no longer contain AS-theoretic terms. Thus we would be able to make empirical statements of the form (2), that is, statements about certain entities being models of the theory AS.
- 2.
In complex physical theories such as particle mechanics there are no simplified assumptions on measuring procedures that can be dropped easily. Sneed (1971) proposed the so-called Ramsey solution3 that in essence avoids AS-theoretical terms by existentially quantifying over them.
Solution (1), where T-theoretical terms are measured by applications of a theory T′, thus is the standard case in empirical sciences. Solution (2) is a special case where we need theory T in order to measure some terms in theory T. Gadenne (1985) argues that this case can be understood as a tentative assumption of theory T that still makes empirical testing possible.4
The important point for our discussion is that in both solutions to the problem of theoretical terms, whether we refer to another theory T′ (solution (1)) or whether we tentatively assume theory T (solution (2)), we require an explicit dichotomy between T-theoretical and T-non-theoretical terms. This insight is crucial in the following analysis of possible circularities in the methodology of empirical CL.
3. The Problem of Theoretical Terms in Empirical CL
The problem of theoretical terms arises in empirical CL in cases where a single theoretical tier is used both in manual data annotation (i.e., in the assignment of labels y to patterns x via the encoding of data pairs (x, y)), and in feature construction (i.e., in the association of labels y to patterns x via features φ(x, y)).
This problem can be illustrated by looking at automatic methods for data annotation. For example, information retrieval (IR) in the patent domain uses citations of patents in other patents to automatically create relevance judgments for ranking (Graf and Azzopardi 2008). Learning-to-rank models such as that of Guo and Gomes (2009) define domain knowledge features on patent pairs (e.g., same patent class in the International Patent Classification [IPC], same inventor, same assignee company) and IR score features (e.g., tf-idf, cosine similarity) to represent data in a structured prediction framework. Clearly, one could have just as well used IPC classes to create automatic relevance judgments, and patent citations as features in the structured prediction model. It should also be evident that using the same criterion to automatically create relevance labels and as feature representation would be circular. In terms of the philosophical concepts introduced earlier, the theory of measurement of relevance used in data labeling cannot be the same as the theory expressed by the features of the structured prediction model; otherwise we exhibit the problem of theoretical terms.
This problem can also arise in scenarios of manual data annotation. One example is data annotation by expert coders: The expert coder's decisions of which labels to assign to which types of patterns may be guided by implicit or tacit knowledge that is shared among the community of experts. These experts may apply the very same knowledge to design features for their machine learning models. For example, in attempts to construct semantic annotations for machine learning purposes, the same criteria such as negation tests might be used to distinguish presupposition from entailment in the labeling of data, and in the construction of feature functions for a classifier to be trained and tested on these data. Similar to the example of automatic data annotation in patent retrieval, we exhibit the problem of theoretical terms in manual data annotation by experts in that the theory of measurement used in data annotation and feature construction is the same. This problem is exacerbated in the situation where a single expert annotator codes the data and later assumes the role of a feature designer using the “given” data. For example, in constructing a treebank for the purpose of learning a statistical disambiguation model for parsing with a hand-written grammar, the same person might act in different roles as grammar writer, as manual annotator using the grammar's analyses as candidate annotations, and as feature designer for the statistical disambiguation model.
The sketched scenarios are inherently circular in the sense of the problem of theoretical terms described previously. Thus in all cases, we are prevented from making empirical statements. High prediction accuracy of machine learning in such scenarios indicates high consistency in the application of implicit knowledge in different roles of a single expert or of groups of experts, but not more.
This problem of circularity in expert coding is related to the problem of reliability in data annotation, a solution to which is sought by methods for measuring and enhancing inter-annotator agreement. A seminal paper by Carletta (1996) and a follow-up survey paper by Artstein and Poesio (2008) have discussed this issue at length. Both papers refer to Krippendorff (2004, 1980a, page 428) who recommends that reliability data “have to be generated by coders that are widely available, follow explicit and communicable instructions (a data language), and work independently of each other….[T]he more coders participate in the process and the more common they are, the more likely they can ensure the reliability of data.” Ironically, it seems as if the best inter-annotator agreement is achieved by techniques that are in conflict with these recommendations, namely, by using experts (Kilgarriff 1999) or intensively trained coders (Hovy et al. 2006) for data annotation. Artstein and Poesio (2008) state explicitly that
experts as coders, particularly long-term collaborators, […] may agree not because they are carefully following written instructions, but because they know the purpose of the research very well–which makes it virtually impossible for others to reproduce the results on the basis of the same coding scheme …. Practices which violate the third requirement (independence) include asking the coders to discuss their judgments with each other and reach their decisions by majority vote, or to consult with each other when problems not foreseen in the coding instructions arise. Any of these practices make the resulting data unusable for measuring reproducibility. (page 575)
Reidsma and Carletta (2007) and Beigman Klebanov and Beigman (2009) reach the conclusion that high inter-annotator agreement is neither sufficient nor necessary to achieve high reliability in data annotation. The problem lies in the implicit or tacit knowledge that is shared among the community of experts. This implicit knowledge is responsible for the high inter-annotator agreement, but hinders reproducibility. In a similar way, implicit knowledge of expert coders can lead to a circularity in data annotation and feature modeling.
4. Breaking the Circularity
Finke (1979), in attempting to establish criteria for an empirical theory of linguistics, demands that the use of a single theoretical strategy to identify and describe the entities of interest shall be excluded from empirical analyses. He recommends that the possibility of using T-non-theoretic strategies to identify observations be made the defining criterion for empirical sciences. That is, in order to make an empirical statement, the two tiers of a T-theoretical and a T-non-theoretical level are necessary because the use of a single theoretical tier prevents distinguishing empirical statements from those that are not.
Let us call Finke's requirement the criterion of T-non-theoretical grounding.6 Moulines (see Balzer 1996, page 141) gives a pragmatic condition for T-non-theoreticity that can be used as a guideline: “Term is T-non-theoretical if there exists and acknowledged method of determination of in some theory T′ different from T plus some link from T′ to T which permits the transfer of realizations of from T′ into T.”
Balzer (1996) discusses a variety of more formal characterizations of the notion of T-(non-)theoretical terms. Although the pragmatic definition cited here is rather informal, it is sufficient as a guideline in discussing concrete examples and strategies to break the circlularity in the methodology of empirical CL. In the following, we will exemplify how this criterion can be met by manual data annotation by using naive coders, or by embedding data annotation into a task extrinsic to the theory to be tested, or by using independently created language data that are available in the wild.
4.1 T-non-theoretical Grounding by Naive Coders and Crowdsourcing
Now that we have defined the criterion of T-non-theoretical grounding, we see that Krippendorff's (2004) request for “coders that are widely available, follow explicit and communicable instructions (a data language), and work independently of each other” can be regarded as a concrete strategy to satisfy our criterion. The key is the requirement for coders to be “widely available” and to work on the basis of “explicit and communicable instructions.” The need to communicate the annotation task to non-experts serves two purposes: On the one hand, the goal of reproducibility is supported by having to communicate the annotation task explicitly in written form. Furthermore, the “naive” nature of annotators requires a verbalization in words comprehensible to non-experts, without the option of relying on implicit or tacit knowledge that is shared among expert annotators. The researcher will thus be forced to describe the annotation task without using technical terms that are common to experts, but are not known to naive coders.
Annotation by naive coders can be achieved by using crowdsourcing services such as Amazon's Mechanical Turk,7 or alternatively, by creating games with a purpose (von Ahn and Dabbish 2004; Poesio et al. 2013).8 Non-expert annotations created by crowdsourcing have been shown to provide expert-level quality if certain recommendations on experiment design and quality control are met (Snow et al. 2008). Successful examples of the use of crowdsourcing techniques for data annotation and system evaluation can be found throughout all areas of NLP (see Callison-Burch and Dredze [2010] for a recent overview). The main advantage of these techniques lies in the ability to achieve high-quality annotations at a fraction of the time and the expense of expert annotation. However, a less apparent advantage is the need for researchers to provide succinct and comprehensible descriptions of Human Intelligence Tasks, and the need to break complex annotation tasks down to simpler basic units of work for annotators. Receiving high-quality annotations with sufficient inter-worker agreement from crowdsourcing can be seen as a possible litmus test for a successful T-non-theoretical grounding of complex annotation tasks. Circularity issues will vanish because T-theoretical terms cannot be communicated directly to naive coders.
4.2 Grounding by Extrinsic Evaluation and Task-Related Annotation
Another way to achieve T-non-theoretical grounding is extrinsic evaluation of NLP systems. This type of evaluation assesses “the effect of a system on something that is external to it, for example, the effect on human performance at a given task or the value added to an application” (Belz 2009) and has been demanded for at least 20 years (Spärck Jones 1994). Extrinsic evaluation is advertised as a remedy against “closed problem” approaches (Spärck Jones 1994) or against “closed circles” in intrinsic evaluation where system rankings produced by automatic measures are compared with human rankings which are themselves unfalsifiable (Belz 2009).
An example of an extrinsic evaluation in NLP is the evaluation of the effect of syntactic parsers on retrieval quality in a biomedical IR task (Miyao et al. 2008). Interestingly, the extrinsic set-up revealed a different system ranking than the standard intrinsic evaluation, according to F-scores on the Penn WSJ corpus. Another example is the area of clustering. Deficiencies in current intrinsic clustering evaluation methods have led von Luxburg, Williamson, and Guyon (2012) to pose the question “Clustering: Science or Art?”. They recommend to measure the usefulness of a clustering method for a particular task under consideration, that is, to always study clustering in the context of its end use.
Extrinsic scenarios are not only useful for the purpose of evaluation. Rather, every extrinsic evaluation creates data that can be used as training data for another learning task (e.g., rankings of system outputs with respect to an extrinsic task can be used to train discriminative (re)ranking models). For example, Kim and Mooney (2013) use the successful completion of navigation tasks to create training data for reranking in grounded language learning. Nikoulina et al. (2012) use retrieval performance of translated queries to create data for reranking in statistical machine translation. Clarke et al. (2010) use the correct response for a query to a database of geographical facts to select data for structured learning of a semantic parser. Thus the extrinsic set-up can be seen as a general technique for T-non-theoretical grounding in training as well as in testing scenarios. Circularity issues will not arise in extrinsic set-ups because the extrinsic task is by definition external to the system outputs to be tested or ranked.
4.3 Grounded Data in the Wild
Halevy, Norvig, and Pereira (2009, page 8) mention statistical speech recognition and statistical machine translation as “the biggest successes in natural-language-related machine learning.” This success is due to the fact that “a large training set of the input–output behavior that we seek to automate is available to us in the wild.” While they emphasize the large size of the training set, we think that the aspect that the training data are given as a “natural task routinely done every day for a real human need” (Halevy, Norvig, and Pereira 2009), is just as important as the size of the training set. This is because a real-world task that is extrinsic and independent of any scientific theory avoids any methodological circularity in data annotation and enforces an application-based evaluation.
Speech and translation are not the only lucky areas where data are available in the wild. Other data sets that have been “found” by NLP researchers are IMDb movie reviews (exploited for sentiment analysis by Pang, Lee, and Vaithyanathan [2002]), Amazon product reviews (used for multi-domain sentiment analysis by Blitzer, Dredze, and Pereira [2007]), Yahoo! Answers (used for answer ranking by Surdeanu, Ciaramita, and Zaragoza [2008]), reading comprehension tests (used for automated reading comprehension by Hirschman et al. [1999]), or Wikipedia (with too many uses to cite). Most of these data were created by community-based efforts. This means that the data sets are freely available and naturally increasing.
The extrinsic and independent aspect of data in the wild can also be created in crowdsourcing approaches that enforce a distinction between data annotation tasks and scientific modeling. For example, Denkowski, Al-Haj, and Lavie (2010) used Amazon's Mechanical Turk to create reference translations for statistical machine translation by monolingual phrase substitutions on existing references. “Translations” created by workers that paraphrase given references without knowing the source can never lead to the circularity that data annotation by experts is susceptible to. In a scenario of monolingual paraphrasing for reference translations even inter-annotator agreement is not an issue anymore. Data created by single annotators (e.g., monolingual meaning equivalents created for bilingual purposes [Dreyer and Marcu 2012]), can be treated as “given” data for machine learning purposes, even if each network of meaning equivalences is created by a single annotator.
5. Conclusion
In this article, we have argued that the problem of theoretical terms as identified for theoretical physics can occur in empirical CL in cases where data are not “given” as commonly assumed in machine learning. We exemplified this problem on the example of manual data annotation by experts, where the task of relating instances to labels in manual data annotation and the task of relating instances to labels via modeling feature functions are intertwined. Inspired by the structuralist theory of science, we have defined a criterion of T-non-theoretical grounding and exemplified how this criterion can be met by manual data annotation by using naive coders, or by embedding data annotation into a task extrinsic to the theory to be tested, or by using independently created language data that are available in the wild.
Our suggestions for T-non-theoretical grounding are related to work on grounded language learning that is based on weak supervision in the form of the use of sentences in naturally occurring contexts. For example, the meaning of natural language expresssions can be grounded in visual scenes (Roy 2002; Yu and Ballard 2004; Yu and Siskind 2013) or actions in games or navigation tasks (Chen and Mooney 2008, 2011). Because of the ambiguous supervision, most such approaches work with latent representations and use unsupervised techniques in learning. Our suggestions for T-non-theoretical grounding can be used to avoid circularities in standard supervised learning. We think that this criterion should be considered a necessary condition for an empirical science, in addition to ensuring reliability of measurements. Our negligence of related issues such as validity of measurements (see Krippendorff 1980b) shows that there is a vast methodological area to be explored, perhaps with further opportunity for guidance by philosophy of science.
Acknowledgments
We are grateful for feedback on earlier versions of this work from Sebastian Padó, Artem Sokolov, and Katharina Wäschle. Furthermore, we would like to thank Paola Merlo for her suggestions and encouragement.
Notes
For the miniature theory AS, this is done by firstly stripping out statements (4)–(6) containing theoretical terms, achieving a partial potential model. Secondly statements (4) and (5) are replaced by a so-called theoretical extension that existentially quantifies over measuring procedures for terms like g. The resulting Ramsey claim applies a theoretical extension to a partial potential model that also satisfies condition (6). Because such a statement does not contain theoretical terms we can make empirical statements about entities being models of the theory AS.
Critics of the structuralist theory of science have remarked that both of the solutions are instances of a more general problem, the so-called Duhem-Quine problem, thus the focus of the structuralist program on solution (2) seems to be an exaggeration of the actual problem (von Kutschera 1982; Gadenne 1985). The Duhem-Quine thesis states that theoretical assumptions cannot be tested in isolation, but rather whole systems of theoretical assumptions and auxiliary assumptions are subjected to empirical testing. That is, if our predictions are not in accordance with our theory, we can only conclude that one of our many theoretical assumptions must be wrong, but we cannot know which one, and we can always modify our system of assumptions, leading to various ways of immunity of theories (Stegmüller 1986). This problem arises in Solution (1) as well as in Solution (2)
In this article, we concentrate on supervised machine learning. Semisupervised, transductive, active, or unsupervised learning deal with machine learning from incomplete or missing labelings where the general assumption of i.i.d. data is not questioned. See Dundar et al. (2007) for an approach of machine learning from non-i.i.d. data.
Note that our criterion of T-non-theoretical grounding is related to the more specific concept of operationalization in social sciences (Friedrichs 1973). Operationalization refers to the process of developing indicators of the form “X is an a if Y is a b (at time t)” to connect T-theoretical and T-non-theoretical levels. We will stick with the more general criterion in the rest of this article.
See Fort, Adda, and Cohen (2011) for a discussion of the ethical dimensions of crowdsourcing services and their alternatives.
References
Author notes
Department of Computational Linguistics, Heidelberg University, Im Neuenheimer Feld 325, 69120 Heidelberg, Germany. E-mail: [email protected].