Abstract
Despite extensive research efforts in recent years, computational argumentation (CA) remains one of the most challenging areas of natural language processing. The reason for this is the inherent complexity of the cognitive processes behind human argumentation, which integrate a plethora of different types of knowledge, ranging from topic-specific facts and common sense to rhetorical knowledge. The integration of knowledge from such a wide range in CA requires modeling capabilities far beyond many other natural language understanding tasks. Existing research on mining, assessing, reasoning over, and generating arguments largely acknowledges that much more knowledge is needed to accurately model argumentation computationally. However, a systematic overview of the types of knowledge introduced in existing CA models is missing, hindering targeted progress in the field. Adopting the operational definition of knowledge as any task-relevant normative information not provided as input, the survey paper at hand fills this gap by (1) proposing a taxonomy of types of knowledge required in CA tasks, (2) systematizing the large body of CA work according to the reliance on and exploitation of these knowledge types for the four main research areas in CA, and (3) outlining and discussing directions for future research efforts in CA.
1 Introduction
The phenomenon of argumentation, a direct reflection of human reasoning in natural language, has fascinated scholars across societies and cultures since ancient times (Aristotle, ca. 350 B.C.E./ translated 2007; Lloyd, 2007). The computational modeling of human argumentation, commonly referred to as computational argumentation (CA), has evolved into one of the most prominent and at the same time most challenging areas in natural language processing (NLP) (Lippi and Torroni, 2015).
CA encompasses several families of tasks and research directions, the main ones in NLP being argument mining, assessment, reasoning, and generation. Although it bears some resemblance to other NLP tasks, such as opinion mining and natural language inference (NLI), it is widely acknowledged to be of much higher difficulty than the other tasks (Habernal et al., 2014). While opinion mining (Liu, 2012) assesses stances towards entities or controversies by asking what the opinions are, CA provides answers to a more difficult question: Why is the stance of an opinion holder the way it is? In a similar vein, while NLI focuses on detecting simple entailments between statement pairs (Bowman et al., 2015; Dagan et al., 2013), CA addresses more complex reasoning scenarios that involve multiple entailment steps, often over implicit premises (Boltužić and Šnajder, 2016).
CA targets reasoning processes that are only partially explicated in text. Its mastery thus requires advanced natural language understanding capabilities and a substantial amount of background knowledge (Moens, 2018; Paul et al., 2020). For example, the assessment of an argument’s quality not only depends on the actual content of an argumentative text or speech but also on social and cultural context, such as speaker and audience characteristics, including their individual values, ideologies, and relationships (Wachsmuth et al., 2017a). Such contextual information remains most often implicit. For any concrete CA task, we here refer to all information that is not explicitly provided as input to models tackling the task but is (potentially) useful for it and (in most cases) normative in nature as knowledge (we detail this notion in §3.1).
Although there is ample awareness of the need for integrating various types of knowledge in CA models in the research community, there is no systematic overview of the types of knowledge that existing models and solutions for the different CA tasks rely on. This impedes targeted progress in pressing subareas of CA, such as argument generation. While general surveys on CA (e.g., Cabrio and Villata, 2018; Lawrence and Reed, 2020) and its subareas (e.g., Al Khatib et al., 2021; Schaefer and Stede, 2021) represent good starting points for targeted research along these lines, they lack a systematic analysis of the roles that different types of knowledge play in different CA tasks.
Contributions.
In this work, we aim to systematically inform the research community about the types of knowledge that have—or have not yet—been integrated into computational models in different CA tasks. For this purpose, we (1) propose a pyramid-like taxonomy systematizing the relevant types of knowledge. The pyramid is organized by knowledge specificity, from linguistic knowledge and world and topic knowledge to argumentation-specific and task-specific knowledge. Starting from 162 CA publications, we (2) survey the existing body of work with respect to the level of integration of the various types of knowledge and respective methodology by which the knowledge of each type is integrated into models. To this end, we carry out an expert annotation study in which we manually label individual papers with the types from the knowledge pyramid. Finally, we (3) identify trends and challenges in the four most prominent CA subareas (mining, assessment, reasoning, and generation), summarizing them into three key recommendations for future CA research:
All CA tasks are expected to benefit from more modeling of world and topic knowledge. Although several studies report empirical gains from incorporating these types of knowledge, their inclusion is still an exception rather than a rule across the landscape of all CA tasks.
Argument mining tasks are expected to benefit from more modeling of argumentation- and task-specific knowledge. Such specialized knowledge has been proven effective in assessment, reasoning, and generation tasks. Yet, it has so far been exploited only sporadically in argument mining approaches.
All CA tasks are expected to benefit from applying key techniques to other types of knowledge and data. As an example, methods that represent symbolic input in a semantic vector space (e.g., pretrained word embeddings or language models) are still rarely applied to sources other than text (e.g., to knowledge bases). The bottleneck to a wider application of general-purpose techniques such as representation learning in CA is the lack of structured knowledge resources. We thus argue that significant progress in CA critically hinges on the availability of such resources at larger scale. Accordingly, based on the results of this survey effort, we strongly encourage the CA community to foster the creation of knowledge-rich argumentative corpora.
Structure.
We start with an overview of the field of CA and its four most prominent subareas (§2). In §3, we describe our survey methodology, before we establish the knowledge pyramid and present the results of the survey with respect to the types of knowledge from the pyramid (§4). On this basis, we summarize emerging trends (§5) and offer recommendations for future progress in CA (§6).
2 Background
The study of argumentation in Western societies can be traced back to Ancient Greece. With the development of democracy and, thereby, the need to influence public decisions, the art of convincing others became an essential skill for successful participation in the democratic process (Aristotle, ca. 350 B.C.E./ translated, 2007). In that period, rhetorical theories also started appearing in Eastern societies and cultures, such as Nyaya Sutra (Lloyd, 2007). Since then, a plethora of phenomena in the realm of argumentation, such as fallacies (Hamblin, 1970) and argumentation schemes (Walton et al., 2008), have been studied extensively, usually focusing on specific domains, such as science (Gilbert, 1977) and law (Toulmin, 2003).
With the growing amount of argumentative data available publicly in Web debates, scientific articles, and other Internet sources, the computational modeling of argumentation, computational argumentation (CA), gradually gained prominence and popularity in the NLP community. As depicted in Figure 1, CA can be divided into four main subareas that represent the main high- level types of tasks being tackled with computational models: mining, assessment, reasoning, and generation.
Argument Mining.
Argument mining deals with the extraction of argumentative structures from natural language text (e.g., Stab and Gurevych, 2017a). Traditionally, it has been addressed with a pipeline of models each tackling one analysis task, most commonly component identification, component classification, and relation identification (Lippi and Torroni, 2015). The set of argument components and relations is defined by the selected underlying argument model which reflects the rhetorical, dialogical, or monological structure of argumentation (Bentahar et al., 2010b).
For instance, the model of Toulmin (2003), designed for the legal domain, encompasses six components: a claim with an optional qualifier, data (i.e., a fact supporting the claim) connected to the claim via a warrant (i.e., the reason why support is given) and its backing, and a rebuttal (i.e., a counterconsideration to the claim). Relations model the support or attack of components (or arguments) by others, sometimes with more fine-grained subtypes (Freeman, 2011). In contrast to argument reasoning (see below), the information needed for inferring argumentative relations is contained in the text.
Argument Assessment.
Computational models that address tasks in this subarea typically focus on particular properties of arguments in their context and automatically assign discrete or numeric labels for these properties. This includes the classification of stance towards some target (Bar-Haim et al., 2017a) as well as the identification of frames (or aspects) covered by the argument (Ajjour et al., 2019). Arguably, the most popular family of tasks belongs to argument quality assessment, which has been studied under various conceptualizations, such as clarity (Persing and Ng, 2013) or convincingness (Habernal and Gurevych, 2016b). Wachsmuth et al. (2017a) propose a taxonomy that divides the overall quality of an argument into three complementary aspects: logic, rhetoric, and dialectic. Each of these three aspects further consists of several quality dimensions (e.g., the dimension of global acceptability for the dialectical aspect).
Argument Reasoning.
In this subarea, the task is to understand the reasoning process behind an argument. In NLP, reasoning is instantiated in tasks such as predicting the entailment relationship between a premise and a hypothesis by means of natural language inference (Williams et al., 2017), or the more complex task of warrant identification, that is, to find (or even reconstruct) the missing warrant (Tian et al., 2018). Others have tried to classify schemes of inferences happening in arguments (Feng and Hirst, 2011) or to recognize fallacies of certain reasoning types in arguments, such as the common ad-hominem fallacy (Habernal et al., 2018c; Delobelle et al., 2019).
In argument reasoning, the challenge lies in inducing additional knowledge—not explicated in the text—from existing components, as opposed to relation identification, which focuses on recognizing argumentative content present in the text. In other words, argument mining structures explicated arguments and their connections, whereas argument reasoning infers knowledge missing from the text (e.g., a warrant that connects the premise to the claim). In practice, however, there is no guarantee that annotators for argument mining tasks (e.g., relation identification) do not resort to out-of-text reasoning, leveraging their commonsense and world knowledge to perform the task. However, from a structural point of view, a premise may still be given by an author to support a claim (e.g., indicated by lexical cues like because), while from a reasoning perspective, the premise might be irrelevant to the claim (e.g., the claim does not logically follow from the given premise).
Argument Generation.
With conversational AI (i.e., dialogue systems) arguably becoming the most prominent application in modern NLP and AI, the research efforts on generating argumentative language have also been gaining traction. Main tasks in argument generation include the summarization of arguments given (Wang and Ling, 2016), the synthesis of new claims and other argument components (Bilu and Slonim, 2016), and the synthesis of entire arguments, possibly conforming to some rhetorical strategy (El Baff et al., 2019).
The impact of argument generation is, for example, demonstrated by Project Debater (Slonim, 2018), a well-known argumentation system which combines models for several generation tasks.
3 Methodology
In this section, we first provide the definition of knowledge upon which we base this work. Then, we detail the methodology that we devised and pursued in order to organize the types of knowledge that CA approaches and models utilize.
3.1 An Operational Definition of Knowledge
Various definitions of “knowledge” have been proposed in the literature. One of the oldest is the tripartite definition of Plato (ca. 400 B.C.E.), who accepted as knowledge any justified true belief. This definition was later often challenged as being too narrow and was, accordingly, extended (e.g., Goldman, 1967; Hawthorne, 2002). As part of this effort, Dretske (1981) dressed Plato’s view into an information-theoretic gown, defining knowledge as information-caused belief, specifying more narrowly the informational source of the belief as the only valid justification and de facto eliminating the veracity constraint.
Departing from attempts to define knowledge ontologically, Gottschalk-Mazouz (2013) adopted an impact-based viewpoint and argue that it is more important to understand what knowledge can do and what it is like than to ontologically answer what knowledge is. In their view, knowledge is thus normative and has practical implications. In the work at hand, we adopt this impact-oriented view on knowledge. We further operationalize the view, in the context of NLP and CA, as follows:
Knowledge
is any kind of normative information that is considered to be relevant for solving a task at hand and that is not given as task input itself.
In CA research, knowledge has been be modeled in a variety of forms that conform to this definition, ranging from lexicons, and engineered features to specially tailored pipelines, model components, or overall algorithm design (e.g., auxiliary tasks, or special training objectives). While this is not the primary dimension of our analysis (see §4.1), it is worth noting the difference between knowledge that is presented explicitly, namely, that can be rather directly used to shape the input representations for the task (e.g., lexicons, feature engineering, predictions of existing auxiliary models), and knowledge that is introduced implicitly through the algorithm or model design (e.g., auxiliary tasks in multi-task learning, or ordering of individual models in model pipelines). Both, we argue, conform to the above operational definition of knowledge to which we subscribe in this work. Finally, we emphasize that we consider the annotated corpora, leveraged in supervised task learning, to be input and not external knowledge brought to facilitate learning.
3.2 Analysis Scope
Generally, we focus on natural language argumentation and its computational treatment in NLP. Hence, we exclude work outside of this community, for example, studies on abstract argumentation (e.g., Vreeswijk, 1997), except if there is a strong link to natural language argumentation. For articles published in non-NLP venues, we made the decision based on the title. When unclear from the title whether the work primarily addresses natural language argumentation (e.g., as in the case of McBurney and Parsons, 2021), we analyzed the whole article before making the scope decision. Our survey covers the four subareas of CA in NLP from §2, with the following restrictions:
In argument mining, we do not include methods that have been designed strictly for a specific genre or domain and are not applicable elsewhere. Argumentative zoning (e.g., Teufel et al., 1999, 2009; Mo et al., 2020) and citation analysis (e.g., Athar, 2011; Lauscher et al., 2021), both specific to scientific publications, exemplify such methods. In contrast, we include methods that the general mining of argumentative structures, even if evaluated only in specific domains (e.g., Lauscher et al., 2018).
In argument assessment, we exclude work targeting sentiment analysis (e.g., Socher et al., 2013; Wachsmuth et al., 2014), as it is inherently more generic than other argumentation tasks and, accordingly, well-explored in general natural language understanding. Also, we exclude work on general-purpose natural language inference and common-sense reasoning (Bowman et al., 2015; Rajani et al., 2019; Ponti et al., 2020) in argument reasoning, and we do not cover the body of work on leveraging external structured knowledge for improved reasoning (e.g., Forbes et al., 2020; Lauscher et al., 2020a); we view these methods as more generic reasoning approaches that can, among others, also support argumentative reasoning (e.g., Habernal et al., 2018b), which we do cover in this survey. Finally, our overview of argument generation is limited strictly to argumentative text generation, as in argument summarization (e..g, Syed et al., 2020) and claim synthesis (e.g., Bilu and Slonim, 2016). The enormous body of work on (non-argumentative) natural language generation (Gatt and Krahmer, 2018) is out of our scope.
Note that some applications of CA are typically addressed through larger systems, which are composed of models tackling several of the tasks above. For instance, in argument search, a system might be composed of an argument extraction component (mining), a retrieval component that determines relevant arguments, as well as a quality rating component (assessment) to rank the mined arguments retrieved for given a topic (Wachsmuth et al., 2017b). In this work, we focus on core CA tasks and do not specifically discuss such composite systems. Within the described scope, we aim for comprehensiveness. However, given the immense body of work on natural language argumentation, we do not claim that this survey is complete.
3.3 Analysis and Annotation Process
We survey the state of the art in CA through the prism of the knowledge types leveraged in existing approaches. For each of the four CA subareas, we conducted our literature research in two steps: (1) in a pre-study, we collected all papers that we saw as relevant. To this end, we combined our expert knowledge of the field with extensive search in scientific search engines and proceedings of relevant conferences and workshops. On this basis, we established the knowledge pyramid. (2) In an in-depth study, we then selected the 10 most representative papers (according to scientometric indicators and our expert judgment) for each subarea and annotated them with the types of knowledge from the pyramid. We instructed three expert annotators to read each paper carefully. Based on our knowledge definition above and common forms of knowledge we identified in the pre-study, they were asked to decide what types and what forms of knowledge were involved, thus assigning all applicable types from the pyramid to each of the 40 sampled papers.
Agreement.
We measured inter-annotator agreement (IAA) in a top-level and an all-levels variant across all sampled 40 papers (10 for each CA area) in terms of pair-wise averaged Cohen’s κ score. First, for each of the papers, we determined the most specific type of knowledge that it exploits (i.e., the one that is highest in the pyramid). Here, we observe a moderate IAA (Landis and Koch, 1977) with κ = 0.54. Second, across all categories, we observe a substantial IAA of κ = 0.74. All cases of disagreement were discussed thoroughly and resolved jointly.
The final distribution of knowledge types identified in papers for each CA subarea is shown in Figure 2b. As expected, almost all works (36 out of 40) leverage linguistic knowledge in some form. In contrast, world and topic knowledge (e.g., common-sense and factual knowledge, logic and rules) seem to be used least across the board. A reason for the latter may lie in the computational complexity of encoding such knowledge in a way that it can benefit concrete approaches to tasks—whereas this is often much more straightforward for argumentation-specific knowledge (e.g., using lexicons) and task-specific knowledge (e.g., adopting a multitask learning setup). Moreover, topic knowledge is likely to make approaches more topic-dependent, namely, less broadly applicable, which is, more generally, often seen as an undesirable property for NLP approaches. We discuss the distribution in detail in the next section.
Pre-Study.
Our aim was to collect as many relevant publications as we could for each of the four CA subareas. We first compiled a list of publications that we were personally aware of (i.e., leveraging “expert knowledge”). Then, we augmented the list by firing queries with relevant keywords (again, compiled based on our expert knowledge) against the ACL Anthology1 and Google Scholar.2
For example, we used the following queries for argument mining: “argument[ation] mining”, “argument[ative] component”, “argument[ative] relation”, and “argument[ative] structure”. For argument generation, we queried “argument generation”, “argument synthesis”, “claim generation”, “claim synthesis”, and “argument summarization”. In addition, we examined all publications from the proceedings of all seven editions (2014–2020) of the Argument Mining workshop series.
In each subarea, we included only publications that propose a computational approach to solving (at least) one CA task; in contrast, we did not consider publications describing shared tasks (Habernal et al., 2018b) or external knowledge resources for CA (Al Khatib et al., 2020a). With these rules in place, we ultimately collected a total of 162 CA papers, entirely listed in Table 1. By analyzing the types of knowledge used by approaches from collected publications, we induced the pyramid of knowledge types in Figure 2 with four coarse-grained knowledge types (§4.1), which was then the basis for our in-depth study (§4.2–§4.5).
Task . | Paper . | Top Pyramid Level . | Task . | Paper . | Top Pyramid Level . |
---|---|---|---|---|---|
Argument Mining . | |||||
Comp. identification . | Boltužić and Šnajder (2014) . | World and topic . | Multiple tasks . | Stab and Gurevych (2014) . | Arg.-specific . |
Ajjour et al. (2017) | Linguistic | Persing and Ng (2020) | Task-specific | ||
Spliethöver et al. (2019) | Linguistic | Lawrence and Reed (2015) | Arg.-specific | ||
Petasis (2019) | Linguistic | Sobhani et al. (2015) | Arg.-specific | ||
Trautmann et al. (2020) | Linguistic | Peldszus and Stede (2015) | Task-specific | ||
Comp. classification | Ong et al. (2014) | Linguistic | Persing and Ng (2016a) | Arg.-specific | |
Sobhani et al. (2015) | Arg.-specific | Eger et al. (2017) | Linguistic | ||
Rinott et al. (2015) | Task-specific | Lawrence and Reed (2017a) | Arg.-specific | ||
Al Khatib et al. (2016) | Linguistic | Lawrence and Reed (2017b) | Arg.-specific | ||
Liebeck et al. (2016) | Linguistic | Potash et al. (2017b) | Arg.-specific | ||
Daxenberger et al. (2017) | Linguistic | Aker et al. (2017) | Arg.-specific | ||
Levy et al. (2017) | Arg.-specific | Niculae et al. (2017) | Arg.-specific | ||
Shnarch et al. (2017) | Arg.-specific | Stab and Gurevych (2017a) | Arg.-specific | ||
Habernal and Gurevych (2017) | Arg.-specific | Saint-Dizier (2017) | Task-specific | ||
Dusmanu et al. (2017) | Arg.-specific | Schulz et al. (2018) | Linguistic | ||
Lauscher et al. (2018) | Arg.-specific | Shnarch et al. (2018) | Linguistic | ||
Lugini and Litman (2018) | Arg.-specific | Eger et al. (2018) | Linguistic | ||
Stab et al. (2018b) | Arg.-specific | Morio and Fujita (2018) | Arg.-specific | ||
Jo et al. (2019) | Linguistic | Gemechu and Reed (2019) | Linguistic | ||
Mensonides et al. (2019) | Arg.-specific | Lin et al. (2019) | Arg.-specific | ||
Reimers et al. (2019) | Arg.-specific | Hewett et al. (2019) | Arg.-specific | ||
Hua et al. (2019b) | Arg.-specific | Haddadan et al. (2019) | Arg.-specific | ||
Relation identification | Cabrio and Villata (2012) | World and topic | Eide (2019) | Arg.-specific | |
Carstens and Toni (2015) | Arg.-specific | Chakrabarty et al. (2019) | Arg.-specific | ||
Cocarascu and Toni (2017) | Linguistic | Huber et al. (2019) | Arg.-specific | ||
Hou and Jochim (2017) | Task-specific | Accuosto and Saggion (2019) | Task-specific | ||
Galassi et al. (2018) | Linguistic | Morio et al. (2020) | Linguistic | ||
Paul et al. (2020) | World and topic | Wang et al. (2020) | Arg.-specific | ||
Argument Assessment | |||||
Stance Detection | Ranade et al. (2013) | Arg.-specific | Quality assessment | Habernal and Gurevych (2016b) | Linguistic |
Hasan and Ng (2014) | Linguistic | Ghosh et al. (2016) | Arg.-specific | ||
Sobhani et al. (2015) | Arg.-specific | Wachsmuth et al. (2016) | Arg.-specific | ||
Persing and Ng (2016b) | Arg.-specific | Wei et al. (2016) | Task-specific | ||
Toledo-Ronen et al. (2016) | Task-specific | Tan et al. (2016) | Task-specific | ||
Sobhani et al. (2017) | Linguistic | Chalaguine and Schulz (2017) | Linguistic | ||
Bar-Haim et al. (2017a) | Arg.-specific | Stab and Gurevych (2017b) | Linguistic | ||
Boltužić and Šnajder (2017) | Task-specific | Potash et al. (2017a) | Linguistic | ||
Bar-Haim et al. (2017b) | Task-specific | Wachsmuth et al. (2017c) | Arg.-specific | ||
Rajendran et al. (2018a) | Linguistic | Persing and Ng (2017) | Task-specific | ||
Sun et al. (2018) | Arg.-specific | Lukin et al. (2017) | Task-specific | ||
Rajendran et al. (2018b) | Arg.-specific | Wachsmuth et al. (2017a) | Task-specific | ||
Kotonya and Toni (2019) | Linguistic | Simpson and Gurevych (2018) | Linguistic | ||
Durmus et al. (2019) | Linguistic | Gu et al. (2018) | Linguistic | ||
Durmus and Cardie (2019) | Task-specific | Passon et al. (2018) | Arg.-specific | ||
Toledo-Ronen et al. (2020) | Linguistic | Ji et al. (2018) | Task-specific | ||
Kobbe et al. (2020a) | Arg.-specific | Durmus and Cardie (2018) | Task-specific | ||
Sirrianni et al. (2020) | Arg.-specific | El Baff et al. (2018) | Task-specific | ||
Somasundaran and Wiebe (2010) | Arg.-specific | Dumani and Schenkel (2019) | Linguistic | ||
Porco and Goldwasser (2020) | Task-specific | Potthast et al. (2019) | Linguistic | ||
Scialom et al. (2020) | Task-specific | Gleize et al. (2019) | Linguistic | ||
Frame identification | Ajjour et al. (2019) | Task-specific | Toledo et al. (2019) | Linguistic | |
Trautmann (2020) | Linguistic | Potash et al. (2019) | Linguistic | ||
Quality assessment | Liu et al. (2008) | Task-specific | Gretz et al. (2020b) | Linguistic | |
Persing et al. (2010) | Linguistic | El Baff et al. (2020) | Linguistic | ||
Persing and Ng (2013) | Linguistic | Wachsmuth and Werner (2020) | Linguistic | ||
Ong et al. (2014) | Linguistic | Li et al. (2020) | Arg.-specific | ||
Persing and Ng (2014) | Linguistic | Al Khatib et al. (2020b) | Task-specific | ||
Song et al. (2014) | Arg.-specific | Lauscher et al. (2020b) | Task-specific | ||
Persing and Ng (2015) | Arg.-specific | Skitalinskaya et al. (2021) | Linguistic | ||
Stab and Gurevych (2016) | Linguistic | Other tasks | Kobbe et al. (2020b) | Task-specific | |
Habernal and Gurevych (2016a) | Linguistic | Yang et al. (2019) | Linguistic | ||
Argument Reasoning | |||||
Warrant identification | Boltužić and Šnajder (2016) | Linguistic | Scheme classification | Feng and Hirst (2011) | Task-specific |
Sui et al. (2018) | Linguistic | Song et al. (2014) | Task-specific | ||
Liebeck et al. (2018) | Linguistic | Lawrence and Reed (2015) | Linguistic | ||
Tian et al. (2018) | Linguistic | Liga (2019) | Linguistic | ||
Brassard et al. (2018) | Linguistic | ||||
Sui et al. (2018) | Linguistic | Fallacy Recognition | Habernal et al. (2018c) | Linguistic | |
Botschen et al. (2018) | World and topic | Habernal et al. (2018a) | Linguistic | ||
Choi and Lee (2018) | World and topic | Delobelle et al. (2019) | Linguistic | ||
Niven and Kao (2019) | World and topic | Other tasks | Becker et al. (2021) | World and topic | |
Argument Generation | |||||
Summarization | Egan et al. (2016) | Linguistic | Argument synthesis | Zukerman et al. (2000) | Task-specific |
Wang and Ling (2016) | Linguistic | Carenini and Moore (2006) | Task-specific | ||
Syed et al. (2020) | Arg.-specific | Sato et al. (2015) | Linguistic | ||
Alshomary et al. (2020a) | Arg.-specific | Reisert et al. (2015) | Arg.-specific | ||
Bar-Haim et al. (2020) | Arg.-specific | Hua and Wang (2018) | World and topic | ||
Claim Synthesis | Bilu and Slonim (2016) | Task-specific | Wachsmuth et al. (2018) | Arg.-specific | |
Chen et al. (2018) | World and topic | Le et al. (2018) | Arg.-specific | ||
Hidey and McKeown (2019) | Arg.-specific | Hua et al. (2019a) | World and topic | ||
Alshomary et al. (2020b) | Arg.-specific | Hua and Wang (2019) | World and topic | ||
Gretz et al. (2020a) | Arg.-specific | El Baff et al. (2019) | Arg.-specific | ||
Alshomary et al. (2021) | Task-specific | Bilu et al. (2019) | Task-specific | ||
Schiller et al. (2021) | Task-specific |
Task . | Paper . | Top Pyramid Level . | Task . | Paper . | Top Pyramid Level . |
---|---|---|---|---|---|
Argument Mining . | |||||
Comp. identification . | Boltužić and Šnajder (2014) . | World and topic . | Multiple tasks . | Stab and Gurevych (2014) . | Arg.-specific . |
Ajjour et al. (2017) | Linguistic | Persing and Ng (2020) | Task-specific | ||
Spliethöver et al. (2019) | Linguistic | Lawrence and Reed (2015) | Arg.-specific | ||
Petasis (2019) | Linguistic | Sobhani et al. (2015) | Arg.-specific | ||
Trautmann et al. (2020) | Linguistic | Peldszus and Stede (2015) | Task-specific | ||
Comp. classification | Ong et al. (2014) | Linguistic | Persing and Ng (2016a) | Arg.-specific | |
Sobhani et al. (2015) | Arg.-specific | Eger et al. (2017) | Linguistic | ||
Rinott et al. (2015) | Task-specific | Lawrence and Reed (2017a) | Arg.-specific | ||
Al Khatib et al. (2016) | Linguistic | Lawrence and Reed (2017b) | Arg.-specific | ||
Liebeck et al. (2016) | Linguistic | Potash et al. (2017b) | Arg.-specific | ||
Daxenberger et al. (2017) | Linguistic | Aker et al. (2017) | Arg.-specific | ||
Levy et al. (2017) | Arg.-specific | Niculae et al. (2017) | Arg.-specific | ||
Shnarch et al. (2017) | Arg.-specific | Stab and Gurevych (2017a) | Arg.-specific | ||
Habernal and Gurevych (2017) | Arg.-specific | Saint-Dizier (2017) | Task-specific | ||
Dusmanu et al. (2017) | Arg.-specific | Schulz et al. (2018) | Linguistic | ||
Lauscher et al. (2018) | Arg.-specific | Shnarch et al. (2018) | Linguistic | ||
Lugini and Litman (2018) | Arg.-specific | Eger et al. (2018) | Linguistic | ||
Stab et al. (2018b) | Arg.-specific | Morio and Fujita (2018) | Arg.-specific | ||
Jo et al. (2019) | Linguistic | Gemechu and Reed (2019) | Linguistic | ||
Mensonides et al. (2019) | Arg.-specific | Lin et al. (2019) | Arg.-specific | ||
Reimers et al. (2019) | Arg.-specific | Hewett et al. (2019) | Arg.-specific | ||
Hua et al. (2019b) | Arg.-specific | Haddadan et al. (2019) | Arg.-specific | ||
Relation identification | Cabrio and Villata (2012) | World and topic | Eide (2019) | Arg.-specific | |
Carstens and Toni (2015) | Arg.-specific | Chakrabarty et al. (2019) | Arg.-specific | ||
Cocarascu and Toni (2017) | Linguistic | Huber et al. (2019) | Arg.-specific | ||
Hou and Jochim (2017) | Task-specific | Accuosto and Saggion (2019) | Task-specific | ||
Galassi et al. (2018) | Linguistic | Morio et al. (2020) | Linguistic | ||
Paul et al. (2020) | World and topic | Wang et al. (2020) | Arg.-specific | ||
Argument Assessment | |||||
Stance Detection | Ranade et al. (2013) | Arg.-specific | Quality assessment | Habernal and Gurevych (2016b) | Linguistic |
Hasan and Ng (2014) | Linguistic | Ghosh et al. (2016) | Arg.-specific | ||
Sobhani et al. (2015) | Arg.-specific | Wachsmuth et al. (2016) | Arg.-specific | ||
Persing and Ng (2016b) | Arg.-specific | Wei et al. (2016) | Task-specific | ||
Toledo-Ronen et al. (2016) | Task-specific | Tan et al. (2016) | Task-specific | ||
Sobhani et al. (2017) | Linguistic | Chalaguine and Schulz (2017) | Linguistic | ||
Bar-Haim et al. (2017a) | Arg.-specific | Stab and Gurevych (2017b) | Linguistic | ||
Boltužić and Šnajder (2017) | Task-specific | Potash et al. (2017a) | Linguistic | ||
Bar-Haim et al. (2017b) | Task-specific | Wachsmuth et al. (2017c) | Arg.-specific | ||
Rajendran et al. (2018a) | Linguistic | Persing and Ng (2017) | Task-specific | ||
Sun et al. (2018) | Arg.-specific | Lukin et al. (2017) | Task-specific | ||
Rajendran et al. (2018b) | Arg.-specific | Wachsmuth et al. (2017a) | Task-specific | ||
Kotonya and Toni (2019) | Linguistic | Simpson and Gurevych (2018) | Linguistic | ||
Durmus et al. (2019) | Linguistic | Gu et al. (2018) | Linguistic | ||
Durmus and Cardie (2019) | Task-specific | Passon et al. (2018) | Arg.-specific | ||
Toledo-Ronen et al. (2020) | Linguistic | Ji et al. (2018) | Task-specific | ||
Kobbe et al. (2020a) | Arg.-specific | Durmus and Cardie (2018) | Task-specific | ||
Sirrianni et al. (2020) | Arg.-specific | El Baff et al. (2018) | Task-specific | ||
Somasundaran and Wiebe (2010) | Arg.-specific | Dumani and Schenkel (2019) | Linguistic | ||
Porco and Goldwasser (2020) | Task-specific | Potthast et al. (2019) | Linguistic | ||
Scialom et al. (2020) | Task-specific | Gleize et al. (2019) | Linguistic | ||
Frame identification | Ajjour et al. (2019) | Task-specific | Toledo et al. (2019) | Linguistic | |
Trautmann (2020) | Linguistic | Potash et al. (2019) | Linguistic | ||
Quality assessment | Liu et al. (2008) | Task-specific | Gretz et al. (2020b) | Linguistic | |
Persing et al. (2010) | Linguistic | El Baff et al. (2020) | Linguistic | ||
Persing and Ng (2013) | Linguistic | Wachsmuth and Werner (2020) | Linguistic | ||
Ong et al. (2014) | Linguistic | Li et al. (2020) | Arg.-specific | ||
Persing and Ng (2014) | Linguistic | Al Khatib et al. (2020b) | Task-specific | ||
Song et al. (2014) | Arg.-specific | Lauscher et al. (2020b) | Task-specific | ||
Persing and Ng (2015) | Arg.-specific | Skitalinskaya et al. (2021) | Linguistic | ||
Stab and Gurevych (2016) | Linguistic | Other tasks | Kobbe et al. (2020b) | Task-specific | |
Habernal and Gurevych (2016a) | Linguistic | Yang et al. (2019) | Linguistic | ||
Argument Reasoning | |||||
Warrant identification | Boltužić and Šnajder (2016) | Linguistic | Scheme classification | Feng and Hirst (2011) | Task-specific |
Sui et al. (2018) | Linguistic | Song et al. (2014) | Task-specific | ||
Liebeck et al. (2018) | Linguistic | Lawrence and Reed (2015) | Linguistic | ||
Tian et al. (2018) | Linguistic | Liga (2019) | Linguistic | ||
Brassard et al. (2018) | Linguistic | ||||
Sui et al. (2018) | Linguistic | Fallacy Recognition | Habernal et al. (2018c) | Linguistic | |
Botschen et al. (2018) | World and topic | Habernal et al. (2018a) | Linguistic | ||
Choi and Lee (2018) | World and topic | Delobelle et al. (2019) | Linguistic | ||
Niven and Kao (2019) | World and topic | Other tasks | Becker et al. (2021) | World and topic | |
Argument Generation | |||||
Summarization | Egan et al. (2016) | Linguistic | Argument synthesis | Zukerman et al. (2000) | Task-specific |
Wang and Ling (2016) | Linguistic | Carenini and Moore (2006) | Task-specific | ||
Syed et al. (2020) | Arg.-specific | Sato et al. (2015) | Linguistic | ||
Alshomary et al. (2020a) | Arg.-specific | Reisert et al. (2015) | Arg.-specific | ||
Bar-Haim et al. (2020) | Arg.-specific | Hua and Wang (2018) | World and topic | ||
Claim Synthesis | Bilu and Slonim (2016) | Task-specific | Wachsmuth et al. (2018) | Arg.-specific | |
Chen et al. (2018) | World and topic | Le et al. (2018) | Arg.-specific | ||
Hidey and McKeown (2019) | Arg.-specific | Hua et al. (2019a) | World and topic | ||
Alshomary et al. (2020b) | Arg.-specific | Hua and Wang (2019) | World and topic | ||
Gretz et al. (2020a) | Arg.-specific | El Baff et al. (2019) | Arg.-specific | ||
Alshomary et al. (2021) | Task-specific | Bilu et al. (2019) | Task-specific | ||
Schiller et al. (2021) | Task-specific |
In-Depth Study.
In the second step, we used the knowledge pyramid as the basis for an in-depth analysis of a subset of 40 publications (10 per research area; bold in Table 1). Our selection of prominent papers for the in-depth study was guided by the following set of (sometimes mutually conflicting) criteria: (1) maximize the scientific impact of the publications in the sample, measured as a combination of the number of publication citations and our expert judgment of publication’s overall impact on the CA field or subarea; (2) maximize the number of different methodological approaches in the sample;3 and (3) maximize the representation of different researchers and research groups.
Once we had selected the 40 publications, three authors of this paper independently labeled all of them with the knowledge types from the pyramid. This allowed us to measure the inter-annotator agreement and to test the extent of shared understanding of the knowledge types captured by the pyramid and their usage in individual methodological approaches in CA. While we are aware that we cannot draw statistically significant conclusions based on a sample of such a limited size, we believe that our findings and this in- depth perspective will still be informative for the CA community.
4 Knowledge in Argumentation
As a result of our survey, we now introduce the argumentation knowledge pyramid, our proposed taxonomy encompassing four coarse-grained types of knowledge leveraged in CA. We then profile the large body of papers from the four CA subareas through the lens of the pyramid.
4.1 Argumentation Knowledge Pyramid
Based on the findings of our pre-study, we identify four coarse-grained types of knowledge being leveraged in CA research, which we organize in a taxonomy, as depicted in Figure 2. We chose to visualize our organization as a pyramid because it allows us to express a hierarchical generality-specificity relationship between the different types of knowledge.
Linguistic Knowledge.
At the bottom of the pyramid is linguistic knowledge, leveraged by virtually all CA models and needed in practically all NLP tasks. In our pyramid, linguistic knowledge is a broad category that includes features derived from word n-grams, information about linguistic structure (e.g., part-of-speech tags, dependency parses), as well as features based on models of distributional semantics, such as (pre-trained) word embedding spaces (e.g., Mikolov et al., 2013; Pennington et al., 2014; Bojanowski et al., 2017) or representation spaces spanned by neural language models (LMs) (e.g., Clark et al., 2020; Devlin et al., 2019). We also consider leveraging distributional spaces (word embeddings or pretrained LMs) built for specific (argumentative) tasks and domains as a form of linguistic knowledge, since such representation spaces are induced purely from textual corpora without any external supervision signal.
World and Topic Knowledge.
Above the linguistic knowledge, we place the category of world and topic knowledge, in which we bundle all types of knowledge that are generally considered useful for various natural language understanding tasks, but that are not (or even cannot be) directly derived from textual corpora. This includes all types of common-sense knowledge, task-independent world knowledge (also known as factual knowledge), logical general-purpose axioms and rules, and similar. In most cases, such knowledge is collected from external structured or semi-structured resources (Sap et al., 2020; Lauscher et al., 2020a; Ji et al., 2021). Knowledge about a specific debate topic (e.g., legalization of marijuana) falls under this category, since topics encompass a set of real-world concepts (e.g., marijuana) and related facts (e.g., medical aspects of marijuana usage). Some systems explicitly require the debate topic as input, in order to gather topic knowledge from external sources.
Argumentation-Specific Knowledge.
The third category in our knowledge pyramid encompasses knowledge about what constitutes argumentation, arguments, and argumentative language, including knowledge about subjective language (Stede and Schneider, 2018). This includes models of argumentation and argumentative structures (Toulmin, 2003; Bentahar et al., 2010a), models of cultural aspects and moral values (Haidt and Joseph, 2004; Graham et al., 2013), lexicons with terms indicating subjective, psychological, and moral categories (Hu and Liu, 2004; Tausczik and Pennebaker, 2010; Graham et al., 2009), predictions of subjectivity and sentiment classification models (Socher et al., 2013), and so forth. While sentiment, emotions, and affect are not argumentative per se, subjectivity is ingrained in argumentation and strongly influences argumentative manifestations (or lack thereof).
Task-Specific Knowledge.
As the most specific type of knowledge, this category covers the types of knowledge that are relevant only for a specific CA task or a small set of tasks. For instance, leveraging discourse structure is considered beneficial for argumentative relation identification (Stab and Gurevych, 2014; Persing and Ng, 2016a; Opitz and Frank, 2019), a common argument mining task.
Table 2 illustrates the four types of knowledge from the pyramid by means of concrete examples.
Knowledge . | Source . | CA Subarea (Task) . | Introduced . | Explanation . |
---|---|---|---|---|
Linguistic | Habernal et al. (2018c) | Argument reasoning (fallacy recognition) | Explicitly | Semantic associations between lexical units in the word embedding space enable generalization across different lexicalizations of ad hominem arguments (e.g., “pretentions [explanation]” vs. “narcissistic [idiot]”) and wordings that point to fallacious reasoning (e.g., “[if only you wouldn’t rely on] fallacious arguments” vs. “[another] unsubstantiated statement)”. |
World and topic | Hua and Wang (2019) | Argument generation (argument synthesis) | Implicitly | The structure of the argument – sequence of Premise, Claim, and Functional utterances – is conditioned by the topic of debate. For example, Reddit arguments in political topics (e.g., “US cutting off foreign aid” tend to start with a Claim (“It can be a useful political bargaining chip”), continue with supporting Premises (e.g., “US cut financial aid to Uganda due to its plans to make homosexuality a crime”) and finish with Functional utterances (e.g., “Please change your mind!”). |
Arg.-specific | Wachsmuth et al. (2017c) | Argument assessment (quality assessment) | Explicitly | Argument relevance is determined in an “objective” way. Argument “reuse”, where one argument leverages the conclusion of another argument is the base for the induction of a large-scale (directed) argument graph. Running a PageRank algorithm on that graphs yields relevance scores for all arguments. Such objective and content-agnostic argument relevance score can be useful for a wide variety of CA tasks; knowledge about argument reuse thus represents argumentation-specific knowledge. |
Task-specific | Peldszus and Stede (2015) | Argument mining (multiple tasks) | Implicitly | Argumentative structure of the text assumed to be a tree: There is one central claim for the text which is the root of the tree, other argumentative components are the nodes of the tree, and edges reflect the support or attack relations between argumentative discourse components. |
Knowledge . | Source . | CA Subarea (Task) . | Introduced . | Explanation . |
---|---|---|---|---|
Linguistic | Habernal et al. (2018c) | Argument reasoning (fallacy recognition) | Explicitly | Semantic associations between lexical units in the word embedding space enable generalization across different lexicalizations of ad hominem arguments (e.g., “pretentions [explanation]” vs. “narcissistic [idiot]”) and wordings that point to fallacious reasoning (e.g., “[if only you wouldn’t rely on] fallacious arguments” vs. “[another] unsubstantiated statement)”. |
World and topic | Hua and Wang (2019) | Argument generation (argument synthesis) | Implicitly | The structure of the argument – sequence of Premise, Claim, and Functional utterances – is conditioned by the topic of debate. For example, Reddit arguments in political topics (e.g., “US cutting off foreign aid” tend to start with a Claim (“It can be a useful political bargaining chip”), continue with supporting Premises (e.g., “US cut financial aid to Uganda due to its plans to make homosexuality a crime”) and finish with Functional utterances (e.g., “Please change your mind!”). |
Arg.-specific | Wachsmuth et al. (2017c) | Argument assessment (quality assessment) | Explicitly | Argument relevance is determined in an “objective” way. Argument “reuse”, where one argument leverages the conclusion of another argument is the base for the induction of a large-scale (directed) argument graph. Running a PageRank algorithm on that graphs yields relevance scores for all arguments. Such objective and content-agnostic argument relevance score can be useful for a wide variety of CA tasks; knowledge about argument reuse thus represents argumentation-specific knowledge. |
Task-specific | Peldszus and Stede (2015) | Argument mining (multiple tasks) | Implicitly | Argumentative structure of the text assumed to be a tree: There is one central claim for the text which is the root of the tree, other argumentative components are the nodes of the tree, and edges reflect the support or attack relations between argumentative discourse components. |
4.2 Knowledge in Argument Mining
Pre-Study.
From the 162 papers we surveyed, 56 belong to the subarea of argument mining, which is the second-largest subarea after argument assessment. The publications that we analyzed were published in the period from 2012 to 2020. Of these 56 publications, 17 relied purely on linguistic knowledge, three exploited world and topic knowledge as the most specific knowledge type, 30 leveraged argumentation-specific knowledge, and six task-specific knowledge. We next describe the detailed findings of our in-depth analysis.
In-Depth Study.
Table 3 shows the results of our assignment of all applicable knowledge types to 10 sampled argument mining papers, published between 2012 and 2018. All but one rely on linguistic knowledge: Earlier approaches leveraged traditional linguistic features, such as n-grams and syntactic features (e.g., Peldszus and Stede, 2015; Lugini and Litman, 2018), whereas later work resorted to word embeddings as the dominant representation (e.g., Eger et al., 2017; Niculae et al., 2017; Daxenberger et al., 2017; Galassi et al., 2018).
Approach . | Linguistic . | World . | Arg. . | Task. . |
---|---|---|---|---|
. | . | and Topic . | specific . | Specific . |
Argument Mining | ||||
Cabrio and Villata (2012) | ✗ | ✓ | ✗ | ✗ |
Peldszus and Stede (2015) | ✓ | ✗ | ✓ | ✓ |
Daxenberger et al. (2017) | ✓ | ✗ | ✗ | ✗ |
Eger et al. (2017) | ✓ | ✗ | ✗ | ✓ |
Niculae et al. (2017) | ✓ | ✗ | ✗ | ✗ |
Lawrence and Reed (2017b) | ✓ | ✓ | ✓ | ✗ |
Levy et al. (2017) | ✓ | ✓ | ✓ | ✗ |
Ajjour et al. (2017) | ✓ | ✗ | ✓ | ✗ |
Galassi et al. (2018) | ✓ | ✗ | ✗ | ✗ |
Lugini and Litman (2018) | ✓ | ✗ | ✗ | ✓ |
Argument Assessment | ||||
Persing and Ng (2015) | ✓ | ✗ | ✓ | ✗ |
Habernal and Gurevych (2016b) | ✓ | ✗ | ✓ | ✗ |
Wachsmuth et al. (2017c) | ✗ | ✗ | ✓ | ✗ |
Bar-Haim et al. (2017a) | ✓ | ✓ | ✓ | ✗ |
Durmus and Cardie (2018) | ✓ | ✗ | ✓ | ✓ |
Trautmann (2020) | ✓ | ✗ | ✗ | ✗ |
Kobbe et al. (2020b) | ✓ | ✗ | ✓ | ✗ |
El Baff et al. (2020) | ✓ | ✗ | ✓ | ✓ |
Al Khatib et al. (2020b) | ✓ | ✗ | ✗ | ✓ |
Gretz et al. (2020b) | ✓ | ✗ | ✗ | ✗ |
Argument Reasoning | ||||
Feng and Hirst (2011) | ✗ | ✗ | ✗ | ✓ |
Lawrence and Reed (2015) | ✓ | ✗ | ✗ | ✓ |
Boltužić and Šnajder (2016) | ✓ | ✗ | ✗ | ✗ |
Habernal et al. (2018c) | ✓ | ✗ | ✗ | ✗ |
Choi and Lee (2018) | ✓ | ✓ | ✗ | ✗ |
Tian et al. (2018) | ✓ | ✗ | ✗ | ✗ |
Botschen et al. (2018) | ✓ | ✓ | ✗ | ✗ |
Delobelle et al. (2019) | ✓ | ✗ | ✗ | ✗ |
Niven and Kao (2019) | ✓ | ✗ | ✗ | ✗ |
Liga (2019) | ✓ | ✗ | ✗ | ✗ |
Argument Generation | ||||
Zukerman et al. (2000) | ✗ | ✗ | ✓ | ✓ |
Sato et al. (2015) | ✓ | ✗ | ✓ | ✗ |
Bilu and Slonim (2016) | ✓ | ✗ | ✓ | ✓ |
Wang and Ling (2016) | ✓ | ✗ | ✗ | ✗ |
El Baff et al. (2019) | ✓ | ✗ | ✗ | ✓ |
Hua et al. (2019b) | ✓ | ✓ | ✗ | ✗ |
Bar-Haim et al. (2020) | ✓ | ✗ | ✓ | ✗ |
Gretz et al. (2020a) | ✓ | ✗ | ✗ | ✗ |
Alshomary et al. (2021) | ✓ | ✗ | ✗ | ✓ |
Schiller et al. (2021) | ✓ | ✗ | ✓ | ✓ |
Approach . | Linguistic . | World . | Arg. . | Task. . |
---|---|---|---|---|
. | . | and Topic . | specific . | Specific . |
Argument Mining | ||||
Cabrio and Villata (2012) | ✗ | ✓ | ✗ | ✗ |
Peldszus and Stede (2015) | ✓ | ✗ | ✓ | ✓ |
Daxenberger et al. (2017) | ✓ | ✗ | ✗ | ✗ |
Eger et al. (2017) | ✓ | ✗ | ✗ | ✓ |
Niculae et al. (2017) | ✓ | ✗ | ✗ | ✗ |
Lawrence and Reed (2017b) | ✓ | ✓ | ✓ | ✗ |
Levy et al. (2017) | ✓ | ✓ | ✓ | ✗ |
Ajjour et al. (2017) | ✓ | ✗ | ✓ | ✗ |
Galassi et al. (2018) | ✓ | ✗ | ✗ | ✗ |
Lugini and Litman (2018) | ✓ | ✗ | ✗ | ✓ |
Argument Assessment | ||||
Persing and Ng (2015) | ✓ | ✗ | ✓ | ✗ |
Habernal and Gurevych (2016b) | ✓ | ✗ | ✓ | ✗ |
Wachsmuth et al. (2017c) | ✗ | ✗ | ✓ | ✗ |
Bar-Haim et al. (2017a) | ✓ | ✓ | ✓ | ✗ |
Durmus and Cardie (2018) | ✓ | ✗ | ✓ | ✓ |
Trautmann (2020) | ✓ | ✗ | ✗ | ✗ |
Kobbe et al. (2020b) | ✓ | ✗ | ✓ | ✗ |
El Baff et al. (2020) | ✓ | ✗ | ✓ | ✓ |
Al Khatib et al. (2020b) | ✓ | ✗ | ✗ | ✓ |
Gretz et al. (2020b) | ✓ | ✗ | ✗ | ✗ |
Argument Reasoning | ||||
Feng and Hirst (2011) | ✗ | ✗ | ✗ | ✓ |
Lawrence and Reed (2015) | ✓ | ✗ | ✗ | ✓ |
Boltužić and Šnajder (2016) | ✓ | ✗ | ✗ | ✗ |
Habernal et al. (2018c) | ✓ | ✗ | ✗ | ✗ |
Choi and Lee (2018) | ✓ | ✓ | ✗ | ✗ |
Tian et al. (2018) | ✓ | ✗ | ✗ | ✗ |
Botschen et al. (2018) | ✓ | ✓ | ✗ | ✗ |
Delobelle et al. (2019) | ✓ | ✗ | ✗ | ✗ |
Niven and Kao (2019) | ✓ | ✗ | ✗ | ✗ |
Liga (2019) | ✓ | ✗ | ✗ | ✗ |
Argument Generation | ||||
Zukerman et al. (2000) | ✗ | ✗ | ✓ | ✓ |
Sato et al. (2015) | ✓ | ✗ | ✓ | ✗ |
Bilu and Slonim (2016) | ✓ | ✗ | ✓ | ✓ |
Wang and Ling (2016) | ✓ | ✗ | ✗ | ✗ |
El Baff et al. (2019) | ✓ | ✗ | ✗ | ✓ |
Hua et al. (2019b) | ✓ | ✓ | ✗ | ✗ |
Bar-Haim et al. (2020) | ✓ | ✗ | ✓ | ✗ |
Gretz et al. (2020a) | ✓ | ✗ | ✗ | ✗ |
Alshomary et al. (2021) | ✓ | ✗ | ✗ | ✓ |
Schiller et al. (2021) | ✓ | ✗ | ✓ | ✓ |
A few papers exploit other types of knowledge. Cabrio and Villata (2012), for example, leverage a pretrained NLI model to analyze online debate interactions.4 While they resort to the abstract argumentation framework of Dung (1995), they do so only for the purposes of the evaluation, which is why we do not judge their approach as reliant on argumentation-specific knowledge. Lawrence and Reed (2017b) use, in addition to word embeddings, world and topic knowledge from WordNet and argumentation-specific knowledge in the form of structural assumptions for mining large-scale debates. Ajjour et al. (2017) combine linguistic knowledge in the form of GloVe embeddings (Pennington et al., 2014) and other linguistic features with an argumentation- specific lexicon of discourse markers. Task- specific mining knowledge is mostly leveraged in multi-task learning scenarios (Lugini and Litman, 2018) or when aiming to extract arguments of more complex structures, that is, with multiple components and/or chains of claims (Eger et al., 2017; Peldszus and Stede, 2015). For instance, Peldszus and Stede (2015) jointly predict different aspects of the argument structure and then apply minimum spanning tree decoding, exploiting that mining of argument structure bears similarities with discourse parsing. The only template-based approach we cover is that of Levy et al. (2017), who construct queries using templates and use ground sentences in Wikipedia concepts (i.e., world and topic knowledge) for unsupervised claim detection. Their approach also leverages an argumentation-specific lexicon of claim-related words (i.e., arg.-specific knowledge), next to the linguistic and world/topic knowledge.
4.3 Knowledge in Argument Assessment
Pre-Study.
The largest portion of the 162 publications, 64 in total, belong to the area of argument assessment, spanning the time period from 2008 to 2021. Of those publications, 29 leverage only linguistic knowledge, but almost 20 rely on task-specific knowledge as the most specific knowledge type. Interestingly, none of the surveyed papers use world and topic knowledge as the most specific knowledge type. That is, if they rely on world and topic knowledge, they also leverage argumentation-specific and/or task-specific knowledge.
In-Depth Study.
The 10 assessment papers analyzed in-depth (period 2015–2020) reveal that, much like in argument mining, most of the work models linguistic knowledge (e.g., Trautmann, 2020; Kobbe et al., 2020b). For example, Gretz et al. (2020b) assess argument quality based on a representation that combines bag-of-words (i.e., sparse symbolic text representation) with latent embeddings, both derived from static GloVe word embeddings (Pennington et al., 2014) and produced by a pretrained BERT model (Devlin et al., 2019). Most of the papers at the linguistic knowledge level of the pyramid, however, predominantly rely on sparse symbolic (i.e., word-based) linguistic features (e.g., Persing and Ng, 2015; Bar-Haim et al., 2017b; Durmus and Cardie, 2018; Al Khatib et al., 2020b; El Baff et al., 2020).
Only one of the 10 selected publications resorts to world and topic knowledge: Bar-Haim et al. (2017a) map the content of claims to Wikipedia concepts for stance classification. A common technique in argument assessment is to include argumentation-specific knowledge about sentiment or subjectivity: this is motivated by the intuition that these features directly affect argumentation quality and correlate with stances. For instance, Wachsmuth et al. (2017a) note that emotional appeal, which is clearly correlated with the sentiment of the text, may affect the rhetorical effectiveness of arguments. Technically, the information on subjectivity is introduced either by means of subjective lexica (e.g., Bar-Haim et al., 2017a; Durmus and Cardie, 2018; El Baff et al., 2020) or via predictions of pretrained sentiment classifiers (Habernal and Gurevych, 2016b). In a different example of the use of argumentation-specific knowledge, Wachsmuth et al. (2017c) exploit reuses between arguments (e.g., a premise of one argument uses the claim of another) to quantify argument relevance by means of graph-based propagation with PageRank.
A notable task-specific knowledge category is the use of user information for argument quality assessment. According to theory (Wachsmuth et al., 2017a), argument quality does not only depend on the text utterance itself but also on the speaker and the audience, for example, on their prior beliefs and their cultural context. To model this, Durmus and Cardie (2018) include information about users’ prior beliefs as predictors of arguments’ persuasiveness, Al Khatib et al. (2020b) predict persuasiveness using user-specific feature vectors, and El Baff et al. (2020) train audience-specific classifiers.
4.4 Knowledge in Argument Reasoning
Pre-Study.
According to our pre-study, argument reasoning is the smallest subarea of CA, with only 17 (out of 162) papers published (in the period between 2011 and 2021). The tasks in this subarea include argumentation scheme classification (Feng and Hirst, 2011; Lawrence and Reed, 2015), warrant identification and exploitation (Habernal et al., 2018b; Boltužić and Šnajder, 2016), and fallacy recognition (Habernal et al., 2018c; Delobelle et al., 2019). Linguistic knowledge denotes the most commonly used type of knowledge in reasoning as well (11 out of 17 papers rely on some type of linguistic knowledge), and four papers in this subarea exploit world and topic knowledge.
In-Depth Study.
In our subset from argument reasoning, general-domain embeddings are by far the most frequently employed type of knowledge injection approach (Boltužić and Šnajder, 2016; Habernal et al., 2018c; Choi and Lee, 2018; Tian et al., 2018; Botschen et al., 2018; Delobelle et al., 2019; Niven and Kao, 2019). In contrast, Lawrence and Reed (2015) use traditional linguistic features, and Liga (2019) models syntactic features with tree kernels to recognize specific reasoning structures in arguments. Task-specific knowledge is modeled by Feng and Hirst (2011), who design specific features for classifying argumentation schemes, and Lawrence and Reed (2015) utilize features specific to individual types of premises and conclusions. Choi and Lee (2018) use a pretrained natural language inference model to select the correct warrant in warrant identification.5 For the same task, Botschen et al. (2018) leverage event knowledge about common situations (from FrameNet) and factual knowledge about entities (from Wikidata).
4.5 Knowledge in Argument Generation
Pre-Study.
Finally, we surveyed 23 generation papers, ranging from 2000 to 2021. Argumentation-specific knowledge is the most specific knowledge type in most (10) publications. Six publications have task-specific knowledge as the most specific knowledge type, and four do not employ anything more specific than world and topic knowledge. Unlike in other subareas, only a few publications (3) in argument generation rely purely on linguistic knowledge. Common argument generation tasks include argument summarization (Egan et al., 2016; Bar-Haim et al., 2020), claim synthesis (Bilu et al., 2019; Alshomary et al., 2021), and argument synthesis (Zukerman et al., 2000; Sato et al., 2015).
In-Depth Study.
As in the case of argument reasoning, many generation approaches employ linguistic knowledge in the form of general-purpose embeddings (Wang and Ling, 2016; Hua et al., 2019a; Bar-Haim et al., 2020; Gretz et al., 2020a; Schiller et al., 2021). Only Sato et al. (2015) report using traditional (i.e., sparse, symbolic) linguistic features; Bilu and Slonim (2016) used traditional linguistic features for predicting the suitability of candidate claims.
World and topic knowledge is utilized by Hua et al. (2019a), who retrieve Wikipedia passages as claim candidates. As argumentation-specific knowledge, Bar-Haim et al. (2020) use an external quality classifier. In a similar vein, Schiller et al. (2021) incorporate the output from argument and stance classifiers from the ArgumenText API (Stab et al., 2018a) and condition the generation model on control codes encoding topic, stance, and aspect of the argument. Alshomary et al. (2021) condition their model on a audience beliefs by deriving bag-of-words representations from the authors’ texts and then fine-tuning a pretrained language model. Sato et al. (2015) model (argumentation-specific) knowledge about values. Predicate and sentiment lexica are employed by Bilu and Slonim (2016), whereas El Baff et al. (2019) learn likely sequences of argumentative units from features computed from argumentation- specific knowledge. They additionally include task-specific knowledge by using a knowledge base with components of claims. A pioneering work that stands out is the approach of Zukerman et al. (2000), which uses argumentation-specific knowledge about micro-structure in combination with task-specific discourse templates.
5 Emerging Trends and Discussion
We now summarize the emerging trends and open challenges in the four CA areas, abstracted from our analyses of the use of knowledge types.
General Observations.
Most of the 162 publications that we reviewed aim to capture some type of “advanced” knowledge, that is, knowledge beyond what can be inferred from the text data alone: 60 publications rely purely on linguistic knowledge, whereas the remaining 102 model at least one of the other three higher knowledge types. This empirically confirms the intuition that success in CA crucially depends on complex knowledge that is external to the text. Also, unsurprisingly, argumentation-specific knowledge is overall the most common type of external knowledge used in CA approaches: Argumentation-specific knowledge can, in principle, facilitate any computational argumentation task. In comparison, world and common-sense knowledge are fairly underrepresented: Only seven of the 40 publications in our in-depth study rely on some variant of it. This is surprising, given that the approaches that leverage such knowledge consistently report substantial performance gains.
Comparison across Types of Knowledge.
We observe differences in the form in which the different knowledge types (e.g., linguistic vs. argument-specific knowledge) are commonly provided and incorporated in methodological approaches. We provide examples in Table 4.
Type . | Common Modeling Techniques . |
---|---|
Task-specific | Structure (e.g., multitask learning), user information (e.g., features), … |
Argumentation-specific | Sentiment (e.g., lexicon, external classifier), argumentation (e.g., fine-tuning), … |
World and topic | Inference knowledge (e.g., infusion), world knowledge (e.g., linking to Wikipedia), … |
Linguistic | n-grams (e.g., traditional features), general semantics (e.g., GloVe embeddings), … |
Type . | Common Modeling Techniques . |
---|---|
Task-specific | Structure (e.g., multitask learning), user information (e.g., features), … |
Argumentation-specific | Sentiment (e.g., lexicon, external classifier), argumentation (e.g., fine-tuning), … |
World and topic | Inference knowledge (e.g., infusion), world knowledge (e.g., linking to Wikipedia), … |
Linguistic | n-grams (e.g., traditional features), general semantics (e.g., GloVe embeddings), … |
Comparison across Areas.
We also note substantial differences across the four high-level CA subareas. The predominant most specific knowledge types vary across the areas: in argument mining and assessment, linguistic and argumentation-specific knowledge are most commonly employed, whereas in argument reasoning approaches, world and topic knowledge (e.g., knowledge about reasoning mechanisms) represents the most common top-level category from the pyramid. In argument generation, argumentation-specific and task-specific knowledge were the most common top-level categories. We believe that this variance is due to the nature of the tasks in each area: Predicting argumentative structures in argument mining is strongly driven by lexical cues (linguistic knowledge) and structural aspects (argumentation-specific knowledge). Despite being studied most extensively, argument mining rarely exploits world and topic knowledge (e.g., from knowledge bases or lexico-semantic resources): There is possibly room for progress in argument mining from more extensive exploitation of structured knowledge sources.
As previously suggested by Wachsmuth et al. (2017a), we find that argument assessment relies on a combination of linguistic features and higher-level argumentation-related properties that are assessed independently, such as sentiment. Argument reasoning, in contrast, strongly relies on basic inference rules and general world knowledge. Finally, the knowledge used in argument generation seems to be highly task- and domain-dependent.
Not only the types of knowledge but also the techniques employed for injecting that knowledge into CA models substantially differ across the subareas. Considering linguistic knowledge, for example, argument assessment approaches predominantly use lexical cues and traditional symbolic text representations, whereas the body of work on argument reasoning primarily relies on latent semantic representations (i.e., embeddings). Most variation in terms of knowledge modeling techniques is found in the argument generation area. Here, the techniques range from template- and structure-based approaches to external lexica and classifiers to embeddings and infusion.
Diachronic Analysis.
Figure 3 depicts the temporal development of knowledge modeling techniques in CA, with year, CA subarea, and knowledge type as dimensions. We analyze four time periods, corresponding to pioneering work (2000–2010), the rise of CA in NLP (2011–2015), the shift to distributional methods (2016–2018), and the most recent trends (2019–2021).
This diachronic analysis reveals that CA is roughly aligned with trends observed in other NLP areas: in the pre-neural era before 2016, knowledge has traditionally been modeled via features, sometimes using knowledge from external resources and outputs or previously trained classifiers (i.e., the pipelined approaches). Later, more advanced techniques such as grounding, infusion, and above all embeddings became more popular. However, we note that distinct techniques are used for the different knowledge types; embeddings, in particular, have been used exclusively to encode linguistic knowledge. Although representation learning can be applied to other argumentative resources, CA efforts in this direction have been few and far between (e.g., Toledo-Ronen et al., 2016; Al Khatib et al., 2020a). This warrants more CA work on embedding structured knowledge and towards a unified argumentative representation space that would support the whole spectrum of CA tasks.
6 Where Should We Go from Here?
Mastering argumentative discourse requires various types of advanced knowledge (Moens, 2018), making CA one of the most complex problems in AI (Atkinson et al., 2017). This raises the question of a suitable path to reaching argumentative proficiency for computational models. In this survey, we identified empirical evidence that integrating advanced knowledge can lead to performance improvements on a range of CA tasks. In the following, we pick out those that we see as key ideas toward the goal of mastering argumentation computationally.
Argument mining is often seen as a structure-oriented task. Lawrence and Reed (2017a) brought up the notion that topic knowledge may actually predict relations between argument components. Eger et al. (2017), on the other hand, formulated mining of argument structure as an end-to-end task. Integrating these two views and combining respective methods could hold much promise.
Despite an abundance of work on encoding and leveraging common sense knowledge (e.g., Lauscher et al., 2020a; Lin et al., 2021), argument assessment methods fail to decompose arguments into concepts, with the work of Bar-Haim et al. (2017a) on stance classification as the positive exception. Despite some evidence of difficulty of integration of common-sense knowledge in argument reasoning tasks (Botschen et al., 2018), there is no alternative to accurately representing/ encoding common-sense knowledge, if we are to build reliable CA systems. Beyond that, Kobbe et al. (2020b) looked at the impact of morals on argument quality. Such research on modeling fine-grained and socially and culturally-dependent knowledge, such as values and social norms— across languages, is still in its infancy in NLP in general. Systematic research on building respective knowledge sources and benchmarks could push CA to the next level.
As emphasized by existing work (e.g., Stede and Schneider, 2018), argumentation is inherently social and thus highly dependent on the relationship between the speaker and her audience. A more straightforward integration of knowledge about the speaker could prove beneficial: The work of Alshomary et al. (2021), encoding speaker’s belief in argument generation, is a step in this direction.
In sum, what we believe is missing in existing work and what could drive the future of CA is a unified knowledge representation space that would aggregate and consolidate all CA-relevant knowledge, and be universally beneficial across CA tasks. As shown in this survey, CA-relevant knowledge is fragmented across heterogeneous sources (e.g., corpora, knowledge bases, lexicons) and coupled only sporadically and in an ad-hoc (not principled) manner. Considering the modest sizes of existing CA resources, a methodological orientation to modular and sample-efficient learning and adaptation (Houlsby et al., 2019; Gururangan et al., 2020; Ponti et al., 2022) could provide means to this end.
7 Conclusion
Motivated by the theoretical importance of knowledge in argumentation and by previous work pointing to the need for more research on incorporating advanced types of knowledge in computational argumentation, we have studied the role of knowledge in the body of research works in the field. In total, we surveyed 162 publications spanning the subareas of argument mining, assessment, reasoning, and generation. To organize the approaches described in these works, we proposed a pyramid-like knowledge taxonomy systematizing the types of knowledge according to their specificity, from basic linguistic to task- specific knowledge.
Our survey yields important findings. Many approaches employing advanced knowledge types (e.g., world and argumentation-specific knowledge) report empirical gains. Still, reliance on such external knowledge types is far from uniform across CA areas: While exploitation of such knowledge is pervasive in argument reasoning and generation, it is far less present in argument mining. We hope that our findings lead to more systematic consideration of different knowledge sources for CA tasks.
Notes
Note that diversifying the sample with respect to methods is different than diversifying it according to knowledge types: two approaches may use the same type(s) of knowledge (e.g., linguistic) while adopting different methods (e.g., syntactic features vs. neural LMs). Our aim was to reduce the methodological redundancy of the sample.
Note that our judgments reflect only the types of knowledge that the approach presented in the paper directly exploits: this is why, for example, we judge the reliance of the approach of Cabrio and Villata (2012) on a pretrained NLI model as exploitation of world and topic knowledge only, even though the NLI model itself (Kouylekov and Negri, 2010) had been trained using a range of linguistic features.
As in the case of Cabrio and Villata (2012) in argument mining, we consider a pretrained NLI model to represent world and topic knowledge.
References
Author notes
Equal contribution.
Action Editor: Mark Steedman