Abstract
In this article, we present a novel approach for parsing argumentation structures. We identify argument components using sequence labeling at the token level and apply a new joint model for detecting argumentation structures. The proposed model globally optimizes argument component types and argumentative relations using Integer Linear Programming. We show that our model significantly outperforms challenging heuristic baselines on two different types of discourse. Moreover, we introduce a novel corpus of persuasive essays annotated with argumentation structures. We show that our annotation scheme and annotation guidelines successfully guide human annotators to substantial agreement.
1. Introduction
Argumentation aims at increasing or decreasing the acceptability of a controversial standpoint (van Eemeren, Grootendorst, and Snoeck Henkemans, 1996, page 5). It is a routine that is omnipresent in our daily verbal communication and thinking. Well-reasoned arguments are not only important for decision making and learning but also play a crucial role in drawing widely accepted conclusions.
Computational argumentation is a recent research field in computational linguistics that focuses on the analysis of arguments in natural language texts. Novel methods have broad application potential in various areas such as legal decision support (Mochales-Palau and Moens 2009), information retrieval (Carstens and Toni 2015), policy making (Sardianos et al. 2015), and debating technologies (Levy et al. 2014; Rinott et al. 2015). Recently, computational argumentation has been receiving increased attention in computer-assisted writing (Song et al. 2014; Stab et al. 2014) because it allows the creation of writing support systems that provide feedback about written arguments.
Argumentation structures are closely related to discourse structures such as those defined by Rhetorical Structure Theory (RST) (Mann and Thompson 1987), the Penn Discourse Treebank (PDTB) (Prasad et al. 2008), or Segmented Discourse Representation Theory (SDRT) (Asher and Lascarides 2003). The internal structure of an argument consists of several argument components. It includes a claim and one or more premises (Govier 2010). The claim is a controversial statement and the central component of an argument, and premises are reasons for justifying (or refuting) the claim. Moreover, arguments have directed argumentative relations, describing the relationships one component has with another. Each such relation indicates that the source component is either a justification for or a refutation of the target component.
The identification of argumentation structures involves several subtasks like separating argumentative from non-argumentative text units (Moens et al. 2007; Florou et al. 2013), classifying argument components into claims and premises (Mochales-Palau and Moens 2011; Rooney, Wang, and Browne 2012; Stab and Gurevych 2014b), and identifying argumentative relations (Mochales-Palau and Moens 2009; Peldszus 2014; Stab and Gurevych 2014b). However, an approach that covers all subtasks is still missing. Furthermore, most approaches operate locally and do not optimize the global argumentation structure. Recently, Peldszus and Stede (2015) proposed an approach based on Minimum Spanning Trees, which jointly models argumentation structures. However, it links all argument components in a single tree structure. Consequently, it is not capable of splitting a text containing more than one argument. In addition to the lack of end-to-end approaches for parsing argumentation structures, there are relatively few corpora annotated with argumentation structures at the discourse-level. Apart from our previous corpus (Stab and Gurevych 2014a), the few existing corpora lack non-argumentative text units (Peldszus 2014), are not annotated with claims and premises (Kirschner, Eckle-Kohler, and Gurevych 2015), or the reliability is unknown (Reed et al. 2008).
Our primary motivation for this work is to create argument analysis methods for argumentative writing support systems and to achieve a better understanding of argumentation structures. Therefore, our first research question is whether human annotators can reliably identify argumentation structures in persuasive essays and whether it is possible to create annotated data of high quality. The second research question addresses the automatic recognition of argumentation structure. We investigate if, and how accurately, argumentation structures can be identified by computational techniques. The contributions of this article are the following:
- •
An annotation scheme for modeling argumentation structures derived from argumentation theory. Our annotation scheme models the argumentation structure of a document as a connected tree.
- •
A novel corpus of 402 persuasive essays annotated with discourse-level argumentation structures. We show that human annotators can apply our annotation scheme to persuasive essays with substantial agreement. This corpus and the annotation guidelines are freely available.1
- •
An end-to-end argumentation structure parser that identifies argument components at the token level and globally optimizes component types and argumentative relations.
The remainder of this article is structured as follows: In Section 2, we review related work in computational argumentation and discuss the difference to traditional discourse analysis. In Section 3, we derive our annotation scheme from argumentation theory. Section 4 presents the results of an annotation study and the corpus creation. In Section 5, we introduce the argumentation structure parser. We show that our model significantly outperforms challenging heuristic baselines on two different types of discourse. We discuss our results in Section 6, and provide our conclusions in Section 7.
2. Related Work
Existing work in computational argumentation addresses a variety of different tasks. These include, for example, approaches for identifying reasoning type (Feng and Hirst 2011), argumentation style (Oraby et al. 2015), the stance of the author (Hasan and Ng 2014; Somasundaran and Wiebe 2009), the acceptability of arguments (Cabrio and Villata 2012), and appropriate support types (Park and Cardie 2014). Most relevant to our work, however, are approaches on argument mining that focus on the identification of argumentation structures in natural language texts. We categorize related approaches into the following three subtasks:
- •
Component identification focuses on the separation of argumentative from non-argumentative text units and the identification of argument component boundaries.
- •
Component classification addresses the function of argument components. It aims at classifying argument components into different types such as claims and premises.
- •
Structure identification focuses on linking arguments or argument components. Its objective is to recognize different types of argumentative relations such as support or attack relations.
2.1 Component Identification
Moens et al. (2007) identified argumentative sentences in various types of text such as newspapers, parliamentary records, and online discussions. They experimented with various different features and achieved an accuracy of 0.738 with word pairs, text statistics, verbs, and keyword features. Florou et al. (2013) classified text segments as argumentative or non-argumentative using discourse markers and several features extracted from the tense and mood of verbs. They report an F1 score of 0.764. Levy et al. (2014) proposed a pipeline including three consecutive steps for identifying context-dependent claims in Wikipedia articles. Their first component detects topic-relevant sentences including a claim. The second component detects the boundaries of each claim. The third component ranks the identified claims for identifying the most relevant claims for the given topic. They report a mean precision of 0.09 and a mean recall of 0.73 averaged over 32 topics for retrieving 200 claims. Goudas et al. (2014) presented a two-step approach for identifying argument components and their boundaries in social media texts. First, they classified each sentence as argumentative or non-argumentative and achieved 0.774 accuracy. Second, they segmented each argumentative sentence using a Conditional Random Field (CRF). Their best model achieved 0.424 accuracy.
2.2 Component Classification
The objective of the component classification task is to identify the type of argument components. Kwon et al. (2007) proposed two consecutive steps for identifying different types of claims in online comments. First, they classified sentences as claims and obtained an F1 score of 0.55 with a boosting algorithm. Second, they classified each claim as either support, oppose, or propose. Their best model achieved an F1 score of 0.67. Rooney, Wang, and Browne (2012) applied kernel methods for classifying text units as either claims, premises, or non-argumentative. They obtained an accuracy of 0.65. Mochales-Palau and Moens (2011) classified sentences in legal decisions as claim or premise. They achieved an F1 score of 0.741 for claims and 0.681 for premises using a Support Vector Machine (SVM) with domain-dependent key phrases, text statistics, verbs, and the tense of the sentence. In our previous work, we used a multiclass SVM for labeling text units of student essays as major claim, claim, premise, or non-argumentative (Stab and Gurevych 2014b). We obtained an F1 score of 0.726 using structural, lexical, syntactic, indicator, and contextual features. Recently, Nguyen and Litman (2015) found that argument and domain words from unlabeled data increase F1 score to 0.76 in the same experimental setup, and Lippi and Torroni (2015) achieved an F1 score of 0.714 for identifying sentences containing a claim in student essays using partial tree kernels.
2.3 Structure Identification
Approaches on structure identification can be divided into macro-level approaches and micro-level approaches. Macro-level approaches such as presented by Cabrio and Villata (2012), Ghosh et al. (2014), or Boltužić and Šnajder (2014) address relations between complete arguments and ignore the microstructure of arguments. More relevant to our work, however, are micro-level approaches, which focus on relations between argument components. Mochales-Palau and Moens (2009) introduced one of the first approaches for identifying the microstructure of arguments. Their approach is based on a manually created Context-Free Grammar and recognizes argument structures as trees. However, it is tailored to legal argumentation and does not recognize implicit argumentative relations (i.e., relations that are not indicated by discourse markers). In previous work, we considered the identification of argument structures as a binary classification task of ordered argument component pairs (Stab and Gurevych 2014b). We classified each pair as support or not-linked using an SVM with structural, lexical, syntactic, and indicator features. Our best model achieved an F1 score of 0.722. However, the approach recognizes argumentative relations locally and does not consider contextual information. Peldszus (2014) modeled the targets of argumentative relations along with additional information in a single tagset. His tagset includes, for instance, several labels denoting whether an argument component at position n is argumentatively related to preceding argument components n − 1, n − 2, and so forth, or following argument components n + 1, n + 2, and so on. Although his approach achieved a promising accuracy of 0.48, it is only applicable to short texts. Peldszus and Stede (2015) presented the first approach that globally optimizes argumentative relations. They jointly modeled several aspects of argumentation structures using a Minimum Spanning Tree model and achieved an F1 score of 0.720. They found that the function (support or attack) and the role (opponent and proponent) of argument components are the most useful dimensions for improving the identification of argumentative relations. However, the texts in their corpus were created artificially using a guideline that promotes having one opposing argument component in each text (cf. Section 2.4). Therefore, it is unclear whether the results can be reproduced with real data, which may exhibit arguments with fewer opposing argument components (Wolfe and Britt 2009). Moreover, their approach links all argument components in a single tree structure. Thus, it is not capable of separating several arguments and recognizing unlinked components.
2.4 Existing Corpora Annotated with Argumentation Structures
Existing corpora in computational argumentation cover numerous aspects of argumentation analysis. There are, for instance, corpora that address argumentation strength (Persing and Ng 2015), factual knowledge (Beigman Klebanov and Higgins 2012), various properties of arguments (Walker et al. 2012), argumentative relations between complete arguments at the macro-level (Cabrio and Villata 2014; Boltužić and Šnajder 2014), different types of argument components (Mochales-Palau and Ieven 2009; Kwon et al. 2007; Habernal and Gurevych 2017), and argumentation structures over several documents (Aharoni et al. 2014). However, corpora annotated with argumentation structures at the level of discourse are still rare.
One prominent resource is AraucariaDB (Reed et al. 2008). It includes heterogenous text types such as newspaper editorials, parliamentary records, judicial summaries, and online discussions. It also includes annotations describing the type of reasoning according to Walton's argumentation schemes (Walton, Reed, and Macagno 2008) and implicit argument components that were added by the annotators during the analysis. However, the reliability of the annotations is unknown. Furthermore, recent releases of AraucariaDB are not appropriate for training end-to-end argumentation structure parsers because they do not include non-argumentative text units.
Kirschner, Eckle-Kohler, and Gurevych (2015) annotated argumentation structures in Introduction and Discussion sections of 24 German scientific articles. Their annotation scheme includes four argumentative relations (support, attack, detail, and sequence). However, the corpus does not contain annotations for argument component types.
Peldszus and Stede (2015) created a small corpus of 112 German microtexts with controlled linguistic and rhetoric complexity. Each document contains a single argument and does not include more than five argument components. Their annotation scheme models supporting and attacking relations as well as additional information like proponent and opponent. They obtained an inter-annotator agreement (IAA) of κ = 0.832 with three expert annotators. Recently, they translated the corpus to English, resulting in the first parallel corpus for computational argumentation. However, the corpus does not include non-argumentative text units. Therefore, the corpus is only of limited use for training end-to-end argumentation structure parsers. Because of the writing guidelines used (Peldszus and Stede, 2013, page 197), it also exhibits an unusually high proportion of attack relations. In particular, 97 of the 112 arguments (86.6%) include at least one attack relation. This proportion is rather unnatural, since authors tend to support their standpoint instead of considering opposing views (Wolfe and Britt 2009).
In previous work, we created a corpus of 90 persuasive essays, which we selected randomly from essayforum.com (Stab and Gurevych 2014a). We annotated the corpus in two consecutive steps: First, we identified argument components at the clause level and obtained an agreement of αU = 0.72 between three annotators. Second, we annotated argumentative support and attack relations between argument components and achieved an agreement of κ = 0.8. Because the corpus also includes non-argumentative text units, it allows for training end-to-end argumentation structure parsers that separate argumentative from non-argumentative text units. Apart from this corpus, we are only aware of one additional study on argumentation structures in persuasive essays. Botley (2014) analyzed 10 essays using argument diagramming for studying differences in argumentation strategies. Unfortunately, the corpus is too small for computational purposes and the reliability of the annotations is unknown. Table 1 provides an overview of existing corpora annotated with argumentation structures at the discourse-level.
Existing corpora annotated with argumentation structures at the discourse-level (#Doc = number of documents; #Comp = number of argument components; NoArg = presence of non-argumentative text units).
Source . | Genre . | #Doc . | #Comp . | NoArg . | Granularity . | IAA . |
---|---|---|---|---|---|---|
Reed et al. (2008) | various | ∼700 | ∼2,000 | yes | clause | unknown |
Stab and Gurevych (2014a) | student essays | 90 | 1,552 | yes | clause | αU = 0.72 |
Peldszus and Stede (2015) | microtexts | 112 | 576 | no | clause | κ = 0.83 |
(Kirschner et al. 2015) | scientific articles | 24 | ∼2,700 | yes | sentence | κ = 0.43 |
Source . | Genre . | #Doc . | #Comp . | NoArg . | Granularity . | IAA . |
---|---|---|---|---|---|---|
Reed et al. (2008) | various | ∼700 | ∼2,000 | yes | clause | unknown |
Stab and Gurevych (2014a) | student essays | 90 | 1,552 | yes | clause | αU = 0.72 |
Peldszus and Stede (2015) | microtexts | 112 | 576 | no | clause | κ = 0.83 |
(Kirschner et al. 2015) | scientific articles | 24 | ∼2,700 | yes | sentence | κ = 0.43 |
2.5 Discourse Analysis
The identification of argumentation structures is closely related to discourse analysis. Similar to the identification of argumentation structures, discourse analysis aims at identifying elementary discourse units and discourse relations between them. Existing approaches on discourse analysis mainly differ in the discourse theory utilized. RST (Mann and Thompson 1987), for instance, models discourse structures as trees by iteratively linking adjacent discourse units (Feng and Hirst 2014; Hernault et al. 2010) whereas approaches based on PDTB (Prasad et al. 2008) identify more shallow structures by linking two adjacent sentences or clauses (Lin, Ng, and Kan 2014). RST and PDTB are limited to discourse relations between adjacent discourse units, but SDRT (Asher and Lascarides 2003) also allows long distance relations (Afantenos and Asher 2014; Afantenos et al. 2015). However, similar to argumentation structure parsing, the main challenge of discourse analysis is to identify implicit discourse relations (Braud and Denis, 2014, page 1694).
Marcu and Echihabi (2002) proposed one of the first approaches for identifying implicit discourse relations. In order to collect large amounts of training data, they exploited several discourse markers like “because” or “but”. After removing the discourse markers, they found that word pair features are useful for identifying implicit discourse relations. Pitler, Louis, and Nenkova (2009) proposed an approach for identifying four implicit types of discourse relations in the PDTB and achieved F1 scores between 0.22 and 0.76. They found that using features tailored to each individual relation leads to the best results. Lin, Kan, and Ng (2009) showed that production rules collected from parse trees yield good results and Louis et al. (2010) found that features based on named entities do not perform as well as lexical features.
Approaches to discourse analysis usually aim at identifying various different types of discourse relations. However, only a subset of these relations is relevant for argumentation structure parsing. For example, Peldszus and Stede (2013) proposed support, attack, and counter-attack relations for modeling argumentation structures, whereas our work focuses on support and attack relations. This difference is also illustrated by the work of Biran and Rambow (2011). They selected a subset of 12 relations from the RST Discourse Treebank (Carlson, Marcu, and Okurowski 2001) and argue that only a subset of RST relations is relevant for identifying justifications.
3. Argumentation: Theoretical Background
The study of argumentation is a comprehensive and interdisciplinary research field. It involves philosophy, communication science, logic, linguistics, psychology, and computer science. The first approaches to studying argumentation date back to the ancient Greek sophists and evolved in the 6th and 5th centuries BCE (van Eemeren, Grootendorst, and Snoeck Henkemans 1996). In particular, the influential works of Aristotle on traditional logic, rhetoric, and dialectics set an important milestone and are a cornerstone of modern argumentation theory. Because of the diversity of the field, there are numerous proposals for modeling argumentation. Bentahar, Moulin, and Bélanger (2010) categorize argumentation models into three types: (1) monological models, (2) dialogical models, and (3) rhetorical models. Monological models address the internal microstructure of arguments. They focus on the function of argument components, the links between them, and the reasoning type. Most monological models stem from the field of informal logic and focus on arguments as product (O'Keefe 1977; Johnson 2000). On the other hand, dialogical models focus on the process of argumentation and ignore the microstructure of arguments. They model the external macrostructure and address relations between arguments from several interlocutors. Finally, rhetorical models consider neither the micro- nor the macrostructure but rather the way arguments are used as a means of persuasion. They consider the audience's perception and aim at studying rhetorical schemes that are successful in practice. In this article, we focus on the monological perspective, which is well-suited for developing computational methods (Peldszus and Stede 2013).
3.1 Argument Diagramming
The laying out of argument structure is a widely used method in informal logic (Copi and Cohen 1990; Govier 2010). This technique, referred to as argument diagramming, aims at transferring natural language arguments into a structured representation for evaluating them in subsequent analysis steps (Henkemans, 2000, page 447). Although argumentation theorists consider argument diagramming a manual activity, the diagramming conventions also serve as a good foundation for developing novel argument mining models (Peldszus and Stede 2013). An argument diagram is a node-link diagram whereby each node represents an argument component (i.e., a statement represented in natural language) and each link represents a directed argumentative relation indicating that the source component is a justification (or refutation) of the target component. Figure 1 shows some common argument structures. A basic argument includes a claim supported by a single premise. It can be considered the minimal form that an argument can take. A convergent argument comprises two premises that support the claim individually; an argument is serial if it includes a reasoning chain and divergent if a premise supports several claims (Beardsley 1950). Complementarily, Thomas (1973) defined linked arguments (Figure 1e). Like convergent arguments, a linked argument includes two premises. However, neither of the two premises independently supports the claim. The premises are only relevant to the claim in conjunction. More complex arguments can combine any of the elementary structures illustrated in Figure 1.
Microstructures of arguments: Nodes are argument components and links represent argumentative relations. Nodes at the bottom are the claims of the arguments.
Microstructures of arguments: Nodes are argument components and links represent argumentative relations. Nodes at the bottom are the claims of the arguments.
On closer inspection, however, there are several ambiguities when applying argument diagramming to real texts: First, the distinction between convergent and linked structures is often ambiguous in real argumentation structures (Henkemans 2000; Freeman 2011). Second, it is unclear if the argumentation structure is a graph or a tree. Third, the argumentative type of argument components is ambiguous in serial structures. We discuss each of these questions in the following sections.
3.1.1 Distinguishing between Linked and Convergent Arguments
The question of whether an argumentation model needs to distinguish between linked and convergent arguments is still debated in argumentation theory (Conway 1991; Yanal 1991; van Eemeren, Grootendorst, and Snoeck Henkemans 1996; Freeman 2011). From a perspective based on traditional logic, linked arguments indicate deductive reasoning and conver-gent arguments represent inductive reasoning (Henkemans, 2000, page 453). However, Freeman (2011, page 91ff.) showed that the traditional definition of linked arguments is frequently ambiguous in everyday discourse. Yanal (1991) argues that the distinction is equivalent to separating several arguments and Conway (1991) argues that linked structures can simply be omitted for modeling single arguments. From a computational perspective, the identification of linked arguments is equivalent to finding groups of premises or classifying the reasoning type of an argument as either deductive or inductive. Accordingly, it is not necessary to distinguish linked and convergent arguments during the identification of argumentation structures since this task can be solved in subsequent analysis steps.
3.1.2 Argumentation Structures as Trees
Defining argumentation structures as trees implies the exclusion of divergent arguments, to allow only one target for each premise and to neglect cycles. From a theoretical perspective, divergent structures are equivalent to several arguments (one for each claim) (Freeman, 2011, page 16). As a result of this treatment, a great many of theoretical textbooks neglect divergent structures (Henkemans 2000; Reed and Rowe 2004) and also most computational approaches consider arguments as trees (Cohen 1987; Mochales-Palau and Moens 2009; Peldszus 2014). However, there is little empirical evidence regarding the structure of arguments. We are only aware of one study, which showed that 5.26% of the arguments in political speeches (which can be assumed to exhibit complex argumentation structures) are divergent.
Essay writing usually follows a claim-oriented procedure (Kemper and Sebranek 2004; Shiach 2009; Whitaker 2009; Perutz 2010). Starting with the formulation of the standpoint on the topic, authors collect claims in support (or opposition) of their view. Subsequently, they collect premises that support or attack their claims. The following example illustrates this procedure. A major claim on abortion, for instance, is “abortion should be illegal”; a supporting claim could be “abortion is ethically wrong” and the associated premises “unborn babies are considered human beings” and “killing human beings is wrong”. Because of this common writing procedure, divergent and circular structures are rather unlikely in persuasive essays. Therefore, we assume that modeling the argumentation structure of essays as a tree is a reasonable decision.
3.1.3 Argumentation Structures and Argument Component Types
Assigning argumentative types to the components of an argument is unambiguous if the argumentation structure is shallow. It is, for instance, obvious that an argument component c1 is a premise and argument component c2 is a claim, if c1 supports c2 in a basic argument (cf. Figure 1). However, if the tree structure is deeper (i.e., exhibits serial structures), assigning argumentative types becomes ambiguous. Essentially, there are three different approaches for assigning argumentative types to argument components. First, according to Beardsley (1950) a serial argument includes one argument component which is both a claim and a premise. Therefore, the inner argument component bears two different argumentative types (multi-label approach). Second, Govier (2010, page 24) distinguishes between “main claim” and “subclaim”. Similarly, Damer (2009, page 17) distinguishes between “premise” and “subpremise” for labeling argument components in serial structures. Both approaches define specific labels for each level in the argumentation structure (level approach). Third, Cohen (1987) considers only the root node of an argumentation tree as a claim and the following nodes in the structure as premises (one-claim approach). In order to define an argumentation model for persuasive essays, we propose a hybrid approach that combines the level approach and the one-claim approach.
3.2 Argumentation Structures in Persuasive Essays
We model the argumentation structure of persuasive essays as a connected tree structure. We use a level approach for modeling the first level of the tree and a one-claim approach for representing the structure of each individual argument. Accordingly, we model the first level of the tree with two different argument component types and the structure of individual arguments with argumentative relations.
The major claim is the root node of the argumentation structure and represents the author's standpoint on the topic. It is an opinionated statement that is usually stated in the introduction and restated in the conclusion of the essay. The individual body paragraphs of an essay include the actual arguments. They either support or attack the author's standpoint expressed in the major claim. Each argument consists of a claim and at least one premise. In order to differentiate between supporting and attacking arguments, each claim has a stance attribute that can take the values “for” or “against”.
We model the structure of each argument with a one-claim approach. The claim constitutes the central component of each argument. The premises are the reasons of the argument. The actual structure of an argument comprises directed argumentative support and attack relations, which link a premise either to a claim or to another premise (serial arguments). Each premise p has one outgoing relation (i.e., there is a relation that has p as source component) and none or several incoming relations (i.e., there can be a relation with p as target component). A claim can exhibit several incoming relations but no outgoing relation. The ambiguous function of inner premises in serial arguments is implicitly modeled by the structure of the argument. The inner premise exhibits one outgoing relation and at least one incoming relation. Finally, the stance of each premise is indicated by the type of its outgoing relation (support or attack).
The following example illustrates the argumentation structure of a persuasive essay.3 The introduction of an essay describes the controversial topic and usually includes the major claim:
Ever since researchers at the Roslin Institute in Edinburgh cloned an adult sheep, there has been an ongoing debate about whether cloning technology is morally and ethically right or not. Some people argue for and others against and there is still no agreement whether cloning technology should be permitted. However, as far as I'm concerned, [cloning is an important technology for humankind]MajorClaim1 since [it would be very useful for developing novel cures]Claim1.
The first two sentences introduce the topic and do not include argumentative content. The third sentence contains the major claim (boldfaced) and a claim that supports the major claim (underlined). The following body paragraphs of the essay include arguments that either support or attack the major claim. For example, the following body paragraph includes one argument that supports the positive standpoint of the author on cloning:
To sum up, although [permitting cloning might bear some risks like misuse for military purposes]Claim6, I strongly believe that [this technology is beneficial to humanity]MajorClaim2. It is likely that [this technology bears some important cures which will significantly improve life conditions]Claim7.
Argumentation structure of the example essay. Arrows indicate argumentative relations. Arrowheads denote argumentative support relations and circleheads attack relations. Dashed lines indicate relations that are encoded in the stance attributes of claims. “P” denotes premises.
Argumentation structure of the example essay. Arrows indicate argumentative relations. Arrowheads denote argumentative support relations and circleheads attack relations. Dashed lines indicate relations that are encoded in the stance attributes of claims. “P” denotes premises.
4. Corpus Creation
The motivation for creating a new corpus is threefold: First, our previous corpus is relatively small. We believe that more data will improve the accuracy of our computational models. Second, we wanted to ensure the reproducibility of the annotation study and validate our previous results. Third, we improved our annotation guidelines. We added more precise rules for segmenting argument components and a detailed description of common essay structures. We expect that our novel annotation guidelines will guide annotators towards adequate agreement without collaborative training sessions. Our annotation guidelines comprise 31 pages and include the following three steps:
- 1.
Topic and stance identification: We found in our previous annotation study that knowing the topic and stance of an essay improves inter-annotator agreement (Stab and Gurevych 2014a). For this reason, we ask the annotators to read the entire essay before starting with the annotation task.
- 2.
Annotation of argument components: Annotators mark major claims, claims, and premises. They annotate the boundaries of argument components and determine the stance attribute of claims.
- 3.
Linking premises with argumentative relations: The annotators identify the structure of arguments by linking each premise to a claim or another premise with argumentative support or attack relations.
4.1 Data
We randomly selected 402 English essays with a description of the writing prompt from essayforum.com. This online forum is an active community that provides correction and feedback about different texts such as research papers, essays, or poetry. For example, students post their essays in order to receive feedback about their writing skills while preparing for standardized language tests. The corpus includes 7,116 sentences with 147,271 tokens.
4.2 Inter-Annotator Agreement
All three annotators independently annotated a random subset of 80 essays. The remaining 322 essays were annotated by the expert annotator. We evaluate the inter-annotator agreement of the argument component annotations using two different strategies: First, we evaluate if the annotators agree on the presence of argument components in sentences using observed agreement and Fleiss' κ (Fleiss 1971). We consider each sentence as a markable and evaluate the presence of each argument component type t ∈{MajorClaim,Claim,Premise} in a sentence individually. Accordingly, the number of markables for each argument component type t corresponds to the number of sentences N = 1,441, the number of annotations per markable equals the number of annotators (n = 3), and the number of categories is k = 2 (t or not t). Evaluating the agreement at the sentence level is an approximation of the actual agreement since the boundaries of argument components can differ from sentence boundaries and a sentence can include several argument components.5 Therefore, for the second evaluation strategy, we use Krippendorff's αU (Krippendorff 2004). In contrast to common alpha coefficients, this coefficient allows us to evaluate the agreement of unitizing tasks by comparing the boundaries of the annotation units. We use the squared difference δ2 between any two annotators' sections as proposed by Krippendorff (2004, page 9) and consider each essay as a single continuum at the token level. Accordingly, the length L of each continuum is the number of tokens in an essay. The number of annotators m that unitize the continuum is 3. We report the average αU scores over 80 essays. For determining the inter-annotator agreement, we use DKPro Agreement, whose implementations of inter-annotator agreement measures are well-tested with various examples from the literature (Meyer et al. 2014).
Table 2 shows the inter-annotator agreement of each argument component type. The agreement is best for major claims. The IAA score of 97.9% and κ = 0.877 indicate that annotators are able to reliably identify major claims in persuasive essays. In addition, the unitized alpha measure of αU = 0.810 shows that there are only few disagreements about the boundaries of major claims. The results also indicate good agreement for premises (κ = 0.833 and αU = 0.824). We obtain the lowest agreement of κ = 0.635 for claims, which shows that the identification of claims is more complex than identifying major claims and premises. The joint unitized measure for all argument components is αU = 0.767, and thus the agreement improved by 0.043 compared with our previous study (Stab and Gurevych 2014b). Therefore, we tentatively conclude that overall, human annotators agree on the argument components in persuasive essays.
Inter-annotator agreement of argument components.
Component type . | Observed agreement . | Fleiss' κ . | αU . |
---|---|---|---|
MajorClaim | 97.9% | 0.877 | 0.810 |
Claim | 88.9% | 0.635 | 0.524 |
Premise | 91.6% | 0.833 | 0.824 |
Component type . | Observed agreement . | Fleiss' κ . | αU . |
---|---|---|---|
MajorClaim | 97.9% | 0.877 | 0.810 |
Claim | 88.9% | 0.635 | 0.524 |
Premise | 91.6% | 0.833 | 0.824 |
For determining the agreement of the stance attribute, we follow the same methodology as for the sentence-level agreement described above, but we consider each sentence containing a claim as “for” or “against” according to its stance attribute, and all sentences without a claim as “none” (N = 1,441; n = 3; k = 3). Consequently, the agreement of claims constitutes the upper bound for the stance attribute. We obtain an agreement of 88.5% and κ = 0.623, which is slightly below the agreement scores of claims (cf. Table 2). Therefore, human annotators can reliably differentiate between supporting and attacking claims.
We determined the markables for evaluating the agreement of argumentative relations by pairing all argument components in the same paragraph. For each paragraph with argument components c1, …, cn, we consider each pair p = (ci,cj) with 1 ≤ i,j ≤ n and i≠j as markable. Thus, the set of all markables corresponds to all argument component pairs that can be annotated according to our guidelines. The number of argument component pairs is N = 4,922, the number of ratings per markable is n = 3, and the number of categories k = 2.
Table 3 shows the inter-annotator agreement of argumentative relations. We obtain kappa scores above 0.7 for both argumentative support and attack relations, which allows tentative conclusions (Krippendorff 2004). On average, the annotators marked only 0.9% of the 4,922 pairs as argumentative attack relations and 18.4% as argumentative support relations. Although the agreement is usually much lower if a category is rare (Artstein and Poesio, 2008, page 573), the annotators agree more on argumentative attack relations. This indicates that the identification of argumentative attack relations is a simpler task than the identification of argumentative support relations. The agreement scores for argumentative relations are approximately 0.10 lower compared with our previous study. This difference can be attributed to the fact that we did not explicitly annotate relations between claims and major claims, which are easy to annotate because claims are always linked to major claims (cf. Section 3.2).
4.3 Analysis of Human Disagreement
For analyzing the disagreements between the annotators, we determined Confusion Probability Matrices (CPMs) (Cinková, Holub, and Kríž 2012). Compared with traditional confusion matrices, a CPM also allows us to analyze confusion if more than two annotators are involved in an annotation study. A CPM includes conditional probabilities that an annotator assigns a category in the column given that another annotator selected the category in the row. Table 4 shows the CPM of argument component annotations. It shows that the highest confusion is between claims and premises. We observed that one annotator frequently did not split sentences including a claim. For instance, the annotator labeled the entire sentence as a claim although it includes an additional premise. This type of error also explains the lower unitized alpha score compared with the sentence-level agreements in Table 2. Furthermore, we found that concessions before claims were frequently not annotated as an attacking premise. For example, annotators often did not split sentences similarly to the following example:
Confusion probability matrix of argument component annotations (“NoArg” indicates sentences without argumentative content).
. | MajorClaim . | Claim . | Premise . | NoArg . |
---|---|---|---|---|
MajorClaim | 0.771 | 0.077 | 0.010 | 0.142 |
Claim | 0.036 | 0.517 | 0.307 | 0.141 |
Premise | 0.002 | 0.131 | 0.841 | 0.026 |
NoArg | 0.059 | 0.126 | 0.054 | 0.761 |
. | MajorClaim . | Claim . | Premise . | NoArg . |
---|---|---|---|---|
MajorClaim | 0.771 | 0.077 | 0.010 | 0.142 |
Claim | 0.036 | 0.517 | 0.307 | 0.141 |
Premise | 0.002 | 0.131 | 0.841 | 0.026 |
NoArg | 0.059 | 0.126 | 0.054 | 0.761 |
The distinction between major claims and claims exhibits less confusion. This may be because major claims are relatively easy to locate in essays since they occur usually in introductions or conclusions, whereas claims can occur anywhere in the essay.
Table 5 shows the CPM of argumentative relations. There is little confusion between argumentative support and attack relations. The CPM also shows that the highest confusion is between argumentative relations (support and attack) and unlinked pairs. This can be attributed to the identification of the correct targets of premises. In particular, we observed that agreement on the targets decreases if a paragraph includes several claims or serial argument structures.
Confusion probability matrix of argumentative relation annotations (“Not-Linked” indicates argument component pairs that are not argumentatively related).
. | Support . | Attack . | Not-Linked . |
---|---|---|---|
Support | 0.605 | 0.006 | 0.389 |
Attack | 0.107 | 0.587 | 0.307 |
Not-Linked | 0.086 | 0.004 | 0.910 |
. | Support . | Attack . | Not-Linked . |
---|---|---|---|
Support | 0.605 | 0.006 | 0.389 |
Attack | 0.107 | 0.587 | 0.307 |
Not-Linked | 0.086 | 0.004 | 0.910 |
4.4 Creation of the Final Corpus
We created a partial gold standard of the essays annotated by all annotators. We use this partial gold standard of 80 essays as our test data (20%) and the remaining 322 essays annotated by the expert annotator as our training data (80%). The creation of our gold standard test data consists of the following two steps: First, we merge the annotation of all argument components. Thus, each annotator annotates argumentative relations based on the same argument components. Second, we merge the argumentative relations to compile our final gold standard test data. Because the argument component types are strongly related—the selection of the premises, for instance, depends on the selected claim(s) in a paragraph—we did not merge the annotations using majority voting as in our previous study. Instead, we discussed the disagreements in several meetings with all annotators for resolving the disagreements.
4.5 Corpus Statistics
Table 6 gives an overview of the size of the corpus. It contains 6,089 argument components, 751 major claims, 1,506 claims, and 3,832 premises. Such a large proportion of claims compared with premises is common in argumentative texts because writers tend to provide several reasons for ensuring a robust standpoint (Mochales-Palau and Moens 2011).
Statistics of the final corpus.
. | . | all . | avg. per essay . | standard deviation . |
---|---|---|---|---|
![]() | Sentences | 7,116 | 18 | 4.2 |
Tokens | 147,271 | 366 | 62.9 | |
Paragraphs | 1,833 | 5 | 0.6 | |
![]() | Arg. components | 6,089 | 15 | 3.9 |
MajorClaims | 751 | 2 | 0.5 | |
Claims | 1,506 | 4 | 1.2 | |
Premises | 3,832 | 10 | 3.4 | |
Claims (for) | 1,228 | 3 | 1.3 | |
Claims (against) | 278 | 1 | 0.8 | |
![]() | Support | 3,613 | 9 | 3.3 |
Attack | 219 | 1 | 0.9 |
. | . | all . | avg. per essay . | standard deviation . |
---|---|---|---|---|
![]() | Sentences | 7,116 | 18 | 4.2 |
Tokens | 147,271 | 366 | 62.9 | |
Paragraphs | 1,833 | 5 | 0.6 | |
![]() | Arg. components | 6,089 | 15 | 3.9 |
MajorClaims | 751 | 2 | 0.5 | |
Claims | 1,506 | 4 | 1.2 | |
Premises | 3,832 | 10 | 3.4 | |
Claims (for) | 1,228 | 3 | 1.3 | |
Claims (against) | 278 | 1 | 0.8 | |
![]() | Support | 3,613 | 9 | 3.3 |
Attack | 219 | 1 | 0.9 |
The proportion of non-argumentative text amounts to 47,474 tokens (32.2%) and 1,631 sentences (22.9%). The number of sentences with several argument components is 583, of which 302 include several components with different types (e.g., a claim followed by premise). Therefore, the identification of argument components requires the separation of argumentative from non-argumentative text units and the recognition of component boundaries at the token level. The proportion of paragraphs with unlinked argument components (e.g., unsupported claims without incoming relations) is 421 (23%). Thus, methods that link all argument components in a paragraph are only of limited use for identifying the argumentation structures in our corpus.
In total, the corpus includes 1,130 arguments (i.e., claims supported by at least one premise). Only 140 of them have an attack relation. Thus, the proportion of arguments with attack relations is considerably lower than in the microtext corpus from Peldszus and Stede (2015). Most of the arguments are convergent—that is, the depth of the argument is 1. The number of arguments with serial structure is 236 (20.9%).
5. Parsing Argumentation Structure
Our approach for parsing argumentation structures consists of five consecutive subtasks, depicted in Figure 3. The identification model separates argumentative from non-argumentative text units and recognizes the boundaries of argument components. The next three models constitute a joint model for recognizing the argumentation structure. We train two base classifiers. The argument component classification model labels each argument component as major claim, claim, or premise, and the argumentative relation identification model recognizes if two argument components are argumentatively linked or not. The tree generation model globally optimizes the results of the two base classifiers for finding a tree (or several ones) in each paragraph. Finally, the stance recognition model differentiates between support and attack relations.
For preprocessing, we use several models from the DKPro Framework (Eckart de Castilho and Gurevych 2014). We identify tokens and sentence boundaries using the LanguageTool segmenter6 and identify paragraphs by checking for line breaks. We lemmatize each token using the Mate Tools lemmatizer (Bohnet et al. 2013) and apply the Stanford part-of-speech (POS) tagger (Toutanova et al. 2003), constituent and dependency parsers (Klein and Manning 2003), and sentiment analyzer (Socher et al. 2013). We use a discourse parser from Lin, Ng, and Kan (2014) for recognizing PDTB-style discourse relations. We use the DKPro TC text classification framework (Daxenberger et al. 2014) for feature extraction and experimentation.
In the following sections, we describe each model in detail. For finding the best-performing models, we conduct model selection on our training data using 5-fold cross-validation. Then, we conduct model assessment on our test data. We determine the evaluation scores of each cross-validation experiment by accumulating the confusion matrices of each fold into one confusion matrix, which has been shown to be the least biased method for evaluating cross-validation experiments (Forman and Scholz 2010). We use macro-averaging as described by Sokolova and Lapalme (2009) and report macro precision (P), macro recall (R), and macro F1 scores (F1). We use a two-sided Wilcoxon signed-rank test with p = 0.01 for significance testing. Because most evaluation measures for comparing system outputs are not normally distributed (Søgaard 2013), this non-parametric test is preferable to parametric tests, which make stronger assumptions about the underlying distribution of the random variables. We apply this test to all reported evaluation scores obtained for each of the 80 essays in our test set.
The remainder of this section is structured as follows: In the following section, we introduce the baselines and the upper bound for each task. In Section 5.2, we present the identification model that detects argument components and their boundaries. In Section 5.3, we propose a new joint model for identifying argumentation structures. In Section 5.4, we introduce our stance recognition model. In Section 5.5, we report the results of the model assessment on our test data and on the microtext corpus from Peldszus and Stede (2015). We present the results of the error analysis in Section 5.6. We evaluate the identification model independently and use the gold standard argument components for evaluating the remaining models.
5.1 Baselines and Upper Bound
For evaluating our models, we use two different types of baselines: First, we use majority baselines that label each instance with the majority class. Table A.1 in Appendix A shows the class distribution in our training data and test data for each task.
Second, we use heuristic baselines, which are motivated by the common structure of persuasive essays (Whitaker 2009; Perutz 2010). The heuristic baseline of the identification task exploits sentence boundaries. It selects all sentences as argument components except the first two and the last sentence of an essay.7 The heuristic baseline of the classification task labels the first argument component in each body paragraph as claim, and all remaining components in body paragraphs as premise. The last argument component in the introduction and the first argument component in the conclusion are classified as major claim and all remaining argument components in the introduction and conclusion are labeled as claim. The heuristic baseline for the relation identification classifies an argument component pair as linked if the target is the first component of a body paragraph. We expect that this baseline will yield good results, because 62% of all body paragraphs in our corpus start with a claim. The heuristic baseline of the stance recognition classifies each argument component in the second to last paragraph as attack. The motivation for this baseline stems from essay writing guidelines, which recommend including opposing arguments in the second to last paragraph.
We determine the human upper bound for each task by averaging the evaluation scores of all three annotator pairs on our test data.
5.2 Identifying Argument Components
We consider the identification of argument components as a sequence labeling task at the token level. We encode the argument components using an IOB-tagset (Ramshaw and Marcus 1995) and consider an entire essay as a single sequence. Accordingly, we label the first token of each argument component as “Arg-B”, the tokens covered by an argument component as “Arg-I”, and non-argumentative tokens as “O”. As a learner, we use a CRF (Lafferty, McCallum, and Pereira 2001) with the averaged perceptron training method (Collins 2002). Because a CRF considers contextual information, the model is particularly suited for sequence labeling tasks (Goudas et al., 2014, page 292). For each token, we extract the following features (Table 7):
Structural features capture the position of the token. We expect these features to be effective for filtering non-argumentative text units, since the introductions and conclusions of essays include few argumentatively relevant content. The punctuation features indicate if the token is a punctuation and if the token is adjacent to a punctuation.
Syntactic features consist of the token's POS as well as features extracted from the Lowest Common Ancestor (LCA) of the current token ti and its adjacent tokens in the constituent parse tree. First, we define , where |lcaPath(u,v)| is the length of the path from u to the LCA of u and v, and depth the depth of the constituent parse tree. Second, we define , which considers the current token ti and its following token ti +1.8 Additionally, we add the constituent types of both lowest common ancestors to our feature set.
Lexico-syntactic features have been shown to be effective for segmenting elementary discourse units (Hernault et al. 2010). We adopt the features introduced by Soricut and Marcu (2003). We use lexical head projection rules (Collins 2003) implemented in the Stanford tool suite to lexicalize the constituent parse tree. For each token t, we extract its uppermost node n in the parse tree with the lexical head t and define a lexico-syntactic feature as the combination of t and the constituent type of n. We also consider the child node of n in the path to t and its right sibling, and combine their lexical heads and constituent types as described by Soricut and Marcu (2003).
5.2.1 Results of Argument Component Identification
The results of model selection show that using all features performs best. Table C.1 in Appendix C provides the detailed results of the feature analysis. Table 8 shows the results of the model assessment on the test data. The heuristic baseline achieves a macro F1 score of 0.642. It achieves an F1 score of 0.677 for non-argumentative tokens (“O”) and 0.867 for argumentative tokens (“Arg-I”). Thus, the heuristic baseline effectively separates argumentative from non-argumentative text units. However, it achieves a low F1 score of 0.364 for identifying the beginning of argument components (“Arg-B”). Because it does not split sentences, it recognizes 145 fewer argument components than the number of gold standard components in the test data.
Model assessment of argument component identification († = significant improvement over baseline heuristic).
. | F1 . | P . | R . | F1 Arg-B . | F1 Arg-I . | F1 O . |
---|---|---|---|---|---|---|
Human upper bound | 0.886 | 0.887 | 0.885 | 0.821 | 0.941 | 0.892 |
Baseline majority | 0.259 | 0.212 | 0.333 | 0 | 0.778 | 0 |
Baseline heuristic | 0.642 | 0.664 | 0.621 | 0.364 | 0.867 | 0.677 |
CRF all features | †0.867 | †0.873 | †0.861 | †0.809 | †0.934 | †0.857 |
. | F1 . | P . | R . | F1 Arg-B . | F1 Arg-I . | F1 O . |
---|---|---|---|---|---|---|
Human upper bound | 0.886 | 0.887 | 0.885 | 0.821 | 0.941 | 0.892 |
Baseline majority | 0.259 | 0.212 | 0.333 | 0 | 0.778 | 0 |
Baseline heuristic | 0.642 | 0.664 | 0.621 | 0.364 | 0.867 | 0.677 |
CRF all features | †0.867 | †0.873 | †0.861 | †0.809 | †0.934 | †0.857 |
The CRF model with all features significantly outperforms the macro F1 score of the heuristic baseline (p = 7.85 ×10−15). Compared with the heuristic baseline, it performs significantly better in identifying the beginning of argument components (p = 1.65 ×10−14). It also performs better for separating argumentative from non-argumentative tokens (p = 4.06 ×10−14). In addition, the number of identified argument components differs only slightly from the number of gold standard components in our test data. It identifies 1,272 argument components, whereas the number of gold standard components in our test data amounts to 1,266. The human upper bound yields a macro F1 score of 0.886 for identifying argument components. The macro F1 score of our model is only 0.019 less. Therefore, our model achieves 97.9% of human performance.
5.2.2 Error Analysis
For identifying the most frequent errors of our model, we manually investigated the predicted argument components. The most frequent errors are false positives of “Arg-I”. The model classifies 1,548 out of 9,403 non-argumentative tokens (“O”) as argumentative (“Arg-I”). The reason for these errors is threefold: First, the model frequently labels non-argumentative sentences in the conclusion of an essay as argumentative. These sentences are, for instance, non-argumentative recommendations for future actions or summaries of the essay topic. Second, the model does not correctly recognize non-argumentative sentences in body paragraphs. It wrongly identifies argument components in 13 out of the 15 non-argumentative body paragraph sentences in our test data. The reason for these errors may be attributed to the high class imbalance in our training data. Third, the model tends to annotate lengthy non-argumentative preceding tokens as argumentative. For instance, it labels subordinate clauses preceding the actual argument component as argumentative in sentences similar to “In addition to the reasons mentioned above, [actual “Arg-B”] …” (underlined text units represent the annotations of our model).
The second most frequent cause of errors are misclassified beginnings of argument components. The model classifies 137 of the 1,266 beginning tokens as “Arg-I”. The model, for instance, fails to identify the correct beginning in sentences like “Hence, from this case we are capable of stating that [actual “Arg-B”] …” or “Apart from the reason I mentioned above, another equally important aspect is that [actual “Arg-B”] …”. These examples also explain the false negatives of non-argumentative tokens which are wrongly classified as “Arg-B”.
5.3 Recognizing Argumentation Structures
The identification of argumentation structures involves the classification of argument component types and the identification of argumentative relations. Both argumentative types and argumentative relations share information (Stab and Gurevych, 2014b, p. 54). For instance, if an argument component is classified as claim, it is less likely to exhibit outgoing relations and more likely to have incoming relations. On the other hand, an argument component with an outgoing relation and few incoming relations is more likely to be a premise. Therefore, we propose a joint model that combines both types of information for finding the optimal structure. We train two local base classifiers. One classifier recognizes the type of argument components (Section 5.3.1), and another identifies argumentative relations between argument components (Section 5.3.2). For both models, we use an SVM (Cortes and Vapnik 1995) with a polynomial kernel implemented in the Weka machine learning framework (Hall et al. 2009). The motivation for selecting this learner stems from the results of our previous work, in which we found that SVMs outperform several other learners in both tasks (Stab and Gurevych, 2014b, page 51). We globally optimize the outcomes of both classifiers in order to find the optimal argumentation structure using Integer Linear Programming (Section 5.3.3). In the following three sections, we first introduce the features of the two base classifiers before describing the Integer Linear Programming model.
5.3.1 Classifying Argument Components
We consider the classification of argument component types as multiclass classification and label each argument component as “major claim,” “claim,” or “premise.” We experiment with the following feature groups:
Lexical features consist of binary lemmatized unigrams and the 2k most frequent dependency word pairs. We extract the unigrams from the component and its preceding tokens to ensure that discourse markers are included in the features.
Structural features capture the position of the component in the document and token statistics (Table 9). Because major claims occur frequently in introductions or conclusions, we expect that these features are valuable for differentiating component types.
Features of the argument component classification model (*indicates genre-dependent features).
Indicator features are based on four categories of lexical indicators that we manually extracted from 30 additional essays. Forward indicators such as “therefore”, “thus”, or “consequently” signal that the component following the indicator is a result of preceding argument components. Backward indicators indicate that the component following the indicator supports a preceding component. Examples of this category are “in addition”, “because”, or “additionally”. Thesis indicators such as “in my opinion” or “I believe that” indicate major claims. Rebuttal indicators signal attacking premises or contra arguments. Examples are “although”, “admittedly”, or “but”. The complete lists of all four categories are provided in Table B.1 in Appendix B. We define for each category a binary feature that indicates if an indicator of a category is present in the component or its preceding tokens. An additional binary feature indicates if first-person indicators are present in the argument component or its preceding tokens (Table 9). We assume that first-person indicators are informative for identifying major claims.
Contextual features capture the context of an argument component. We define eight binary features set to true if a forward, backward, rebuttal, or thesis indicator precedes or follows the current component in its covering paragraph. Additionally, we count the number of noun and verb phrases of the argument component that are also present in the introduction or conclusion of the essay. These features are motivated by the observation that claims frequently restate entities or phrases of the essay topic. Furthermore, we add four binary features indicating if the current component shares a noun or verb phrase with the introduction or conclusion.
Syntactic features consist of the POS distribution of the argument component, the number of subclauses in the covering sentence, the depth of the constituent parse tree of the covering sentence, the tense of the main verb of the component, and a binary feature that indicates whether a modal verb is present in the component.
The probability features are the conditional probabilities of the current component being assigned the type t ∈{MajorClaim,Claim,Premise} given the sequence of tokens p directly preceding the component. To estimate P(t|p), we use maximum likelihood estimation on our training data.
Discourse features are based on the output of the PDTB-style discourse parser from Lin, Ng, and Kan (2014). Each binary feature is a triple combining the following information: (1) the type of the relation that overlaps with the current argument component, (2) whether the current argument component overlaps with the first or second elementary discourse unit of a relation, and (3) if the discourse relation is implicit or explicit. For instance, the feature Contrast_imp_Arg1 indicates that the current component overlaps with the first discourse unit of an implicit contrast relation. The use of these features is motivated by the findings of Cabrio, Tonelli, and Villata (2013). By analyzing several example arguments, they hypothesized that general discourse relations could be informative for identifying argument components.
Embedding features are based on word embeddings trained on a part of the Google news data set (Mikolov et al. 2013). We sum the vectors of each word of an argument component and its preceding tokens and add it to our feature set. In contrast to common bag-of-words representations, embedding features have a continuous feature space that helped to achieve better results in several NLP tasks (Socher et al. 2013).
By experimenting with individual features and several feature combinations, we found that a combination of all features yields the best results. The results of the model selection can be found in Table C.2 in Appendix C.
5.3.2 Identifying Argumentative Relations
The relation identification model classifies ordered pairs of argument components as “linked” or “not-linked.” In this analysis step, we consider both argumentative support and attack relations as “linked.” For each paragraph with argument components c1, …, cn, we consider p = (ci,cj) with i≠j and 1 ≤ i,j ≤ n as an argument component pair. An argument component pair is “linked” if our corpus contains an argumentative relation with ci as source component and cj as target component. The class distribution is skewed towards “not-linked” pairs (Table A.1). We experiment with the following features:
Lexical features are binary lemmatized unigrams of the source and target component and their preceding tokens. We limit the number of unigrams for both source and target component to the 500 most frequent words in our training data.
Syntactic features include binary POS features of the source and target component and the 500 most frequent production rules extracted from the parse tree of the source and target component as described in our previous work (Stab and Gurevych 2014b).
Structural features consist of the number of tokens in the source and target component, statistics on the components of the covering paragraph of the current pair, and position features (Table 10).
Features used for argumentative relation identification (*indicates genre-dependent features).
Indicator features are based on the forward, backward, thesis, and rebuttal indicators introduced in Section 5.3.1. We extract binary features from the source and target component and the context of the current pair (Table 10). We assume that these features are helpful for modeling the direction of argumentative relations and the context of the current component pair.
Discourse features are extracted from the source and target component of each component pair as described in Section 5.3.1. Although PDTB-style discourse relations are limited to adjacent relations, we expect that the types of general discourse relations can be helpful for identifying argumentative relations. We also experimented with features capturing PDTB relations between the target and source component. However, those were not effective for capturing argumentative relations.
PMI features are based on the assumption that particular words indicate incoming or outgoing relations. For instance, tokens like “therefore”, “thus”, or “hence” can signal incoming relations, whereas tokens such as “because”, “since”, or “furthermore” may indicate outgoing relations. To capture this information, we use Pointwise Mutual Information (PMI), which has been successfully used for measuring word associations (Turney 2002; Church and Hanks 1990). However, instead of determining the PMI of two words, we estimate the PMI between a lemmatized token t and the direction of a relation d = {incoming,outgoing} as . Here, p(t,d) is the probability that token t occurs in an argument component with either incoming or outgoing relations. The ratio between p(t,d) and p(t) p(d) indicates the dependence between a token and the direction of a relation. We estimate PMI(t,d) for each token in our training data. We extract the ratio of tokens positively and negatively associated with incoming or outgoing relations for both source and target components. Additionally, we extract four binary features, which indicate if any token of the components has a positive or negative association with either incoming or outgoing relations.
Shared noun features (shNo) indicate if the source and target components share a noun. We also add the number of shared nouns to our feature set. These features are motivated by the observation that claims and premises often share the same subject.
For selecting the best performing model, we conducted feature ablation tests and experimented with individual features. The results show that none of the feature groups is informative when used individually. We achieved the best performance by removing lexical features from our feature set (detailed results of the model selection can be found in Table C.3 in Appendix C).
5.3.3 Jointly Modeling Argumentative Relations and Argument Component Types
Having defined the ILP model, we consolidate the results of the two base classifiers. We consider this task by determining the weight matrix W ∈ℝn×n that includes the coefficients wij ∈ W of our objective function. The weight matrix W can be considered an adjacency matrix. The greater the weight of a particular relation is, the higher the likelihood that the relation appears in the optimal structure found by the ILP-solver.
First, we incorporate the results of the relation identification model. Its result can be considered as an adjacency matrix R ∈{0,1}n×n. For each pair of argument components (i,j) with 1 ≤ i,j ≤ n, each rij ∈ R is 1 if the relation identification model predicts an argumentative relation from argument component i (source) to argument component j (target), or 0 if the model does not predict an argumentative relation.
Third, we incorporate the argument component types predicted by the classification model. We assign a higher score to the weight wij if the target component j is predicted as claim, because it is more likely that argumentative relations point to claims. Accordingly, we set cij = 1 if argument component j is labeled as claim and cij = 0 if argument component j is labeled as premise.
After applying the ILP model, we adapt the argumentative relations and argument types according to the results of the ILP-solver. We revise each relation according to the determined xij scores, set the type of all components without outgoing relation to “claim,” and set the type of all remaining components to “premise.”
5.4 Classifying Support and Attack Relations
The stance recognition model differentiates between argumentative support and attack relations. We model this task as binary classification and classify each claim and premise as “support” or “attack.” The stance of each premise is encoded in the type of its outgoing relation, whereas the stance of each claim is encoded in its stance attribute. We use an SVM11 and the features listed in Table 11.
5.5 Evaluation
Table 12 shows the F1 scores of the classification, relation identification, and stance recognition tasks using our test data. The ILP joint model significantly outperforms the macro F1 score of the heuristic baselines for component classification (p = 1.49 ×10−4) and relation identification (p = 0.003). It also significantly outperforms the macro F1 score of the base classifier for component classification (p = 7.45 ×10−4). However, it does not yield a significant improvement over the macro F1 score of the base classifier for relation identification. The results show that the identification of claims and linked component pairs benefit most from the joint model. Compared with the base classifiers, the ILP joint model improves the F1 score of claims by 0.071 (p = 1.84 ×10−4) and the F1 score of linked component pairs by 0.077 (p = 6.95 ×10−5). The stance recognition model significantly outperforms the heuristic baseline by 0.118 macro F1 score (p = 0.008). It yields 0.947 F1 score for supporting components and 0.413 for attacking components.
Model assessment on persuasive essays († = significant improvement over baseline heuristic; ‡ = significant improvement over base classifier).
The human upper bound yields macro F1 scores of 0.868 for component classification, 0.854 for relation identification, and 0.844 for stance recognition. The ILP joint model almost achieves human performance for classifying argument components. Its F1 score is only .042 lower than human upper bound. Regarding relation identification and stance recognition, the macro F1 scores of our model are 0.103 and 0.164 lower than human performance. Thus, our model achieves 95.2% of human performance for component identification, 87.9% for relation identification, and 80.5% for stance recognition.
In order to verify the effectiveness of our approach, we also evaluated the ILP joint model on the English microtext corpus (cf. Section 2.4). To ensure the comparability to previous results, we used the same repeated cross-validation set-up as described by Peldszus and Stede (2015). Because the microtext corpus does not include major claims, we removed the major claim label from our component classification model. Furthermore, it was necessary to adapt several features of the base classifiers, since the microtext corpus does not include non-argumentative text units. Therefore, we did not consider preceding tokens for lexical, indicator, and embedding features and removed the probability feature of the component classification model. Additionally, we removed all genre-dependent features of both base classifiers.
Table 13 shows the evaluation results of our model on the microtext corpus. Our ILP joint model significantly outperforms the macro F1 score of the heuristic baselines for component classification (p = 2.10 ×10−10) and relation identification (p = 1.5 ×10−8). The results also show that our model yields significantly better macro F1 scores compared to the two base classifiers (p = 0.002 for component classification and p = 7.52 ×10−7 for relation identification). The stance recognition model achieves 0.745 macro F1 score on the microtext corpus. It significantly improves the macro F1 score of the heuristic baseline by 0.203 (p = 7.55 ×10−10).12
Model assessment on microtext corpus from Peldszus and Stede (2015) († = significant improvement over baseline heuristic; ‡ = significant improvement over base classifier).
The last two rows in Table 13 show the results reported by Peldszus and Stede (2015) on the English microtext corpus. The Best EG model is their best model for component classification, and MP+p is their best model for relation identification. Compared with our ILP joint model, the Best EG model achieves better macro F1 scores for component classification and relation identification. However, because the outcomes of their systems are not available to us, we cannot determine if this difference is significant. The MP+p model achieves a better macro F1 score for relation identification, but yields lower results for component classification and stance recognition compared to our ILP joint model. These differences can be attributed to the additional information about the function and role attribute incorporated in their joint models (cf. Section 2.3). They showed that both have a beneficial effect on the component classification and relation identification in their corpus (Peldszus and Stede, 2015, Figure 3). However, the role attribute is a unique feature of their corpus and the arguments in their corpus exhibit an unusually high proportion of attack relations. In particular, 86.6% of their arguments include attack relations, whereas the proportion of arguments with attack relations in our corpus amounts to only 12.4%. Therefore, we assume that incorporating function and role attributes will not be beneficial using our corpus.
Overall, the evaluation results show that our ILP joint model significantly outperforms challenging heuristic baselines and simultaneously improves the performance of component classification and relation identification on two different types of discourse.
5.6 Error Analysis
In order to analyze frequent errors of the ILP joint model, we investigated the predicted argumentation structures in our test data. The confusion matrix of the component classification task (Table 14) shows that the highest confusion is between claims and premises. The model classifies 74 actual premises as claims and 82 claims as premises. By manually investigating these errors, we found that the model tends to label inner premises in serial structures as claims and wrongly identifies claims in sentences containing two premises. Regarding the relation identification, we observed that the model tends to identify argumentation structures that are more shallow than the structures in our gold standard. The model correctly identifies only 34.7% of the 98 serial arguments in our test data. This can be attributed to the “claim-centered” weight calculation in our objective function. In particular, the predicted relations in matrix R are the only information about serial arguments, whereas the other two scores (cij and crij) assign higher weights to relations pointing to claims.
In order to determine if the ILP joint model correctly models the relationship between component types and argumentative relations, we artificially improved the predictions of both base classifiers as suggested by Peldszus and Stede (2015). The dashed lines in Figure 4 show the performance of the artificially improved base classifiers. Continuous lines show the resulting performance of the ILP joint model. Figures 4a and 4b show the effect of improving the component classification and relation identification. They show that correct predictions of one base classifier are not maintained after applying the ILP model if the other base classifier exhibits less accurate predictions. In particular, less accurate argumentative relations have a more detrimental effect on the component types (Figure 4a) than less accurate component types do on the outcomes of the relation identification (Figure 4b). Thus, it is more reasonable to focus on improving relation identification than component classification in future work.
Influence of improving the base classifiers (x-axis shows the proportion of improved predictions and y-axis the macro F1 score).
Influence of improving the base classifiers (x-axis shows the proportion of improved predictions and y-axis the macro F1 score).
Figure 4c depicts the effect of improving both base classifiers, which illustrates that the ILP joint model improves the component types more effectively than argumentative relations. Figure 4c shows that the ILP joint model improves both tasks if the base classifiers are improved. Therefore, we conclude that the ILP joint model successfully captures the natural relationship between argument component types and argumentative relations.
6. Discussion
Our argumentation structure parser is a pipeline consisting of several consecutive steps. Therefore, potential errors of the upstream models are propagated and negatively influence the results of the downstream models. For example, errors of the identification model can result in flawed argumentation structures if argumentatively relevant text units are not recognized or non-argumentative text units are identified as relevant. However, our identification model yields good accuracy and an αU of 0.958 for identifying argument components. Therefore, it is unlikely that identification errors will significantly influence the outcome of the downstream models when applied to persuasive essays. However, as demonstrated by Levy et al. (2014) and Goudas et al. (2014), the identification of argument components is more complex in other text genres than it is in persuasive essays. Another potential issue of the pipeline architecture is that wrongly classified major claims will decrease the accuracy of the model because they are not integrated in the joint modeling approach. For this reason, it is worthwhile to experiment in future work with structured machine learning methods that incorporate several tasks in one model (Moens 2013).
In this work, we presented an approach for recognizing argumentation structures in persuasive essays. Other text genres, however, may exhibit less explicit arguments. Habernal and Gurevych (2017, page 27), for instance, showed that 48% of the arguments in user-generated Web discourse do not include explicit claims. These incomplete arguments, so called enthymemes, make both annotation and automatic analysis challenging. Although humans may be able to deduce the missing parts by interpreting the argument, existing argument mining methods fail on that task and may produce incomplete or even wrong argumentation structures. In particular, the presented approach is not able to recognize gaps in reasoning (i.e., missing premises) or to infer the missing components of implicit arguments. Inferring implicit argument components is challenging since it requires robust methods for capturing the semantics of natural language arguments and appropriate background knowledge for reconstructing the missing parts.
The presented argumentation structure parser is an important milestone for implementing argumentative writing support systems. For example, the recognized argumentation structures allow highlighting unwarranted claims, missing major claims, or different types of quantitative analyses on the number of arguments or their premises. It is still unknown, however, if this feedback provides an adequate guidance for improving students' argumentation skills. In order to answer this question, it is required to integrate the proposed model in writing environments and to investigate the effect of different feedback types on the argumentation skills of students in future research.
7. Conclusion
In this article, we presented an end-to-end approach for parsing argumentation structures in persuasive essays. Previous approaches suffer from several limitations: Existing approaches either focus only on particular subtasks of argumentation structure parsing or rely on manually created rules. Consequently, previous approaches are only of limited use for parsing argumentation structures in real application scenarios. To the best of our knowledge, the presented work is the first approach that covers all required subtasks for identifying the global argumentation structure of documents. We showed that jointly modeling argumentation structures simultaneously improves the results of component classification and relation identification. Additionally, we introduced a novel annotation scheme and a new corpus of persuasive essays annotated with argumentation structures that represent the largest resource of its kind. Both the corpus and the annotation guidelines are freely available.
Appendix A. Class Distributions
Table A.1 shows the class distributions of the training and test data of the persuasive essay corpus for each analysis step.
Appendix B. Indicators
Table B.1 shows all of the lexical indicators we extracted from 30 persuasive essays. The lists include 24 forward indicators, 33 backward indicators, 48 thesis indicators, and 10 rebuttal indicators.
Appendix C. Detailed Results of Model Selections
The following tables show the model selection results for all five tasks using 5-fold cross-validation on our training data. Table C.1 shows the results of using individual feature groups for the argument component identification task. Lexico-syntactic features perform best regarding the macro F1 score, and they perform particularly well for recognizing the beginning of argument components (“Arg-B”). The second best features are structural features. They yield the best F1 score for separating argumentative from non-argumentative text units (“O”).
Syntactic features are useful for identifying the beginning of argument components. The probability feature yields the lowest macro F1 score. Nevertheless, we observe a significant decrease compared with the macro F1 score of the model with all features when evaluating the system without the probability feature (p = 0.003). We obtain the best results by using all features. Because persuasive essays exhibit a particular paragraph structure, which may not be present in other text genres (e.g., user-generated Web discourse), we also evaluate the model without genre-dependent features (cf. Table 7). This yields a significant difference compared with the macro F1 score of the model with all features (p = 2.24×10−54).
Table C.2 shows the model selection results of the classification model. Structural features are the only features that significantly outperform the macro F1 score of the heuristic baseline when used individually (p = 4.04×10−6). They are the most effective features for identifying major claims and claims. The second-best features for identifying claims are discourse features. With this knowledge, we can confirm the assumption that general discourse relations are useful for component classification (cf. Section 5.3.1). Embedding features do not perform as well as lexical features. They yield lower F1 scores for major claims and claims. Contextual features are effective for identifying major claims, since they implicitly capture if an argument component is present in the introduction or conclusion (cf. Section 5.3.1). Indicator features are most effective for identifying major claims, but contribute only slightly to the identification of claims. Syntactic features are predictive of major claims and premises, but are not effective for recognizing claims. The probability features are not informative for identifying claims, probably because forward indicators may also signal inner premises in serial structures. Omitting probability and embedding features yields the best accuracy. However, we select the best system by means of the macro F1 score, which is more appropriate for imbalanced data sets. Accordingly, we select the model with all features (Table C.2).
The model selection results for relation identification are shown in Table C.3. We report the results of feature ablation tests, since none of the feature groups yields remarkable results when used individually. Structural features are the most effective features for identifying relations. The second- and third-most effective feature groups are indicator and PMI features. Removing the shared noun feature does not yield a significant difference in accuracy or macro F1 score compared with SVM all features. We achieve the best macro F1 score by removing lexical features from the feature set.
Argumentative relation identification († = significant improvement over baseline heuristic; ‡ = significant difference compared to SVM all features).
Table C.4 shows the model selection results of the ILP joint model. Base+heuristic shows the result of applying the baseline to all paragraphs in which the base classifiers identify neither claims nor argumentative relations. The heuristic baseline is triggered in 31 paragraphs, which results in 3.3% more trees identified compared with the base classifiers. However, the difference between Base+heuristic and the base classifiers is not statistically significant. For this reason, we can attribute any further improvements to the joint modeling approach. Moreover, Table C.4 shows selected results of the hyperparameter tuning of the ILP joint model. Using only predicted relations in the ILP-naïve model does not yield an improvement compared with the macro F1 score of the base classifiers. ILP-relation uses only information from the relation identification base classifier. It significantly outperforms the macro F1 score of both base classifiers (p = 6.43×10−12 for relations and p = 7.23×10−13 for components), but converts a large number of premises to claims. The ILP-claim model uses only the outcomes of the argument component base classifier and improves neither component classification nor relation identification. All three models identify a relatively high proportion of claims compared to the number of claims in our training data. The reason for this is that many weights in W are 0. Combining the results of both base classifiers yields a more balanced proportion of component type conversions. All three models (ILP-equal, ILP-same, and ILP-balanced) significantly outperform the macro F1 score of the base classifiers. We identify the best performing system by means of the average macro F1 score for both tasks. Accordingly, we select ILP-balanced as our best performing ILP joint model.
Joint modeling approach († = significant improvement over base heuristic; ‡ = significant improvement over base classifier; Cl→Pr = number of claims converted to premises; Pr→Cl = number of premises converted to claims; Trees = Percentage of correctly identified trees).
Table C.5 shows the model selection results for the stance recognition model. Using sentiment, structural, and embedding features individually does not yield an improvement over the majority baseline. Lexical features yield a significant improvement over the macro F1 score of the heuristic baseline when used individually (p = 8.02 ×10−10). Syntactic features significantly improve precision (p = 1.81 ×10−30), recall (p = 1.95 ×10−47), F1 Support (p = 1.01 ×10−27), and F1 Attack (p = 1.53 ×10−54) over the heuristic baseline, but do not yield a significant improvement over the macro F1 score of the heuristic baseline. Discourse features significantly outperform the heuristic baseline regarding precision (p = 3.68 ×10−28), recall (p = 3.43 ×10−49), and F1 Support (p = 1.06 ×10−32). Because omitting any of the feature groups yields a lower macro F1 score, we select the model with all features as the best performing model.
Stance recognition († = significant improvement over baseline heuristic; ‡ = significant difference compared to SVM all features).
. | F1 . | P . | R . | F1 Support . | F1 Attack . |
---|---|---|---|---|---|
Baseline majority | 0.475 | 0.452 | 0.500 | 0.950 | 0 |
Baseline heuristic | 0.521 | 0.511 | 0.530 | 0.767 | 0.173 |
SVM only lexical | †0.663 | †0.677 | †0.650 | †0.941 | †0.383 |
SVM only syntactic | 0.649 | †0.725 | †0.587 | †0.950 | †0.283 |
SVM only discourse | 0.630 | †0.746 | †0.546 | †0.951 | 0.169 |
SVM all w/o lexical | †0.696 | †‡0.719 | †0.657 | †‡0.948 | †‡0.439 |
SVM all w/o syntactic | †0.687 | †‡0.691 | †‡0.684 | †‡0.941 | †‡0.433 |
SVM all w/o sentiment | †0.699 | †‡0.710 | †0.688 | †‡0.945 | †‡0.451 |
SVM all w/o structural | †0.698 | †‡0.710 | †0.686 | †‡0.946 | †‡0.449 |
SVM all w/o discourse | †0.675 | †‡0.685 | †‡0.666 | †‡0.941 | †‡0.408 |
SVM all w/o embeddings | †0.692 | †‡0.703 | †‡0.682 | †‡0.944 | †‡0.439 |
SVM all features | †0.702 | †0.714 | †0.690 | †0.946 | †0.456 |
. | F1 . | P . | R . | F1 Support . | F1 Attack . |
---|---|---|---|---|---|
Baseline majority | 0.475 | 0.452 | 0.500 | 0.950 | 0 |
Baseline heuristic | 0.521 | 0.511 | 0.530 | 0.767 | 0.173 |
SVM only lexical | †0.663 | †0.677 | †0.650 | †0.941 | †0.383 |
SVM only syntactic | 0.649 | †0.725 | †0.587 | †0.950 | †0.283 |
SVM only discourse | 0.630 | †0.746 | †0.546 | †0.951 | 0.169 |
SVM all w/o lexical | †0.696 | †‡0.719 | †0.657 | †‡0.948 | †‡0.439 |
SVM all w/o syntactic | †0.687 | †‡0.691 | †‡0.684 | †‡0.941 | †‡0.433 |
SVM all w/o sentiment | †0.699 | †‡0.710 | †0.688 | †‡0.945 | †‡0.451 |
SVM all w/o structural | †0.698 | †‡0.710 | †0.686 | †‡0.946 | †‡0.449 |
SVM all w/o discourse | †0.675 | †‡0.685 | †‡0.666 | †‡0.941 | †‡0.408 |
SVM all w/o embeddings | †0.692 | †‡0.703 | †‡0.682 | †‡0.944 | †‡0.439 |
SVM all features | †0.702 | †0.714 | †0.690 | †0.946 | †0.456 |
Acknowledgments
This work has been supported by the Volkswagen Foundation as part of the Lichtenberg-Professorship Program under grant no. I/82806 and by the German Federal Ministry of Education and Research (BMBF) as a part of the Software Campus project AWS under grant no. 01—S12054. We would like to thank the anonymous reviewers for their valuable feedback; Can Diehl, Ilya Kuznetsov, Todd Shore, and Anshul Tak for their valuable contributions; and Andreas Peldszus for providing details about his corpus.
Notes
The kappa coefficient is an IAA measure for categorical items that accounts for agreement by chance. The formal definition and a comprehensive overview of chance-corrected IAA measures can be found in the survey of Artstein and Poesio (2008).
The example essay was written by the authors to illustrate all phenomena of argumentation structures in persuasive essays.
Although it would be preferable to have a group of annotators with similar annotation experience (e.g. all non-experts), because of lack of resources it is a common practice to have mixed annotator groups.
In our evaluation set of 80 essays the annotators identified in 4.3% of the sentences several argument components of different types. Thus, evaluating the reliability of argument components at the sentence level is a good approximation of the inter-annotator agreement.
Full stops at the end of a sentence are all classified as non-argumentative.
We set LCApreceding = −1 if ti is the first token in its covering sentence and LCAfollowing = −1 if ti is the last token in its covering sentence.
We consider only claims and premises in our joint model, since argumentative relations between claims and major claims are modeled with a level approach (cf. Section 3.2).
We use the lpsolve framework (http://lpsolve.sourceforge.net) and set each variable in the objective function to binary mode for ensuring the upper bound of 1.
For finding the best learner, we compared naïve Bayes (John and Langley 1995), Random Forests (Breiman 2001), Multinomial Logistic Regression (le Cessie and van Houwelingen 1992), C4.5 Decision Trees (Quinlan 1993), and SVM (Cortes and Vapnik 1995); we found that an SVM outperforms all other classifiers.
The heuristic baseline for stance recognition on the microtext corpus classifies the fourth component as “attack” and all other components as “support.”
References
Author notes
Technische Universität Darmstadt, Ubiquitous Knowledge Processing (UKP) Lab, Hochschulstrasse 10, D-64289 Darmstadt, Germany and Ubiquitous Knowledge Processing Lab (UKP-DIPF), German Institute for Educational Research, Schloßstraße 29, D-60486 Frankfurt am Main, Germany.