Abstract
Discourse analysis is a highly applicable area of natural language processing. In English and other languages, resources for discourse-based tasks are widely available. Thai, however, has hitherto lacked such resources. We present the Thai Discourse Treebank, the first, large Thai corpus annotated in the style of the Penn Discourse Treebank. The resulting corpus has over 10,000 sentences and 18,000 instances of connectives in 33 different relations. We release the corpus alongside our list of 148 potentially polysemous discourse connectives with a total of 340 form-sense pairs and their classification criteria to facilitate future research. We also develop models for connective identification and classification tasks. Our best models achieve an F1 of 0.96 in the identification task and 0.46 on the sense classification task. Our results serve as benchmarks for future models for Thai discourse tasks.
1 Introduction
Discourse analysis is a challenging area of Natural Language Processing (NLP) because the model must take into account context larger than the sentence. NLP has reached a high accuracy rate for tasks whose input is a sentence or a pair of sentences, but tasks involving larger linguistic units attract more attention from the research community as the discourse processing would enable us to improve many downstream tasks such as information extraction (Regneri and Wang, 2012; Tang et al., 2021), text complexity assessment (Davoodi and Kosseim, 2016), and automatic essay grading (Wang et al., 2018; Nadeem et al., 2019). While corpora with discourse annotation exist across languages, there is a lack of such a corpus for languages with little morphology and punctuation. These languages are particularly in need of a discourse corpus because fewer formal cues are available for automated discourse-related tasks such as argument identification. For example, thanks to its morphology, sentential adverbs that signal discourse relations in English often have the -ly suffix (e.g., additionally, actually). One such language, Thai, lacks both systematic descriptions of discourse elements and annotated corpora that enable supervised machine learning methods and analysis of discourse phenomena.
The Penn Discourse Treebank (PDTB) (Miltsakaki et al., 2004) proposes a lexically grounded method for discourse annotation. Each discourse relation has two arguments, which are two semantically related contiguous textual spans, and a sense, which categorizes the relationship. although is a canonical example of a discourse connective that signals a concession relation between the two arguments of the discourse relation. afterwards might signal a temporal relation. Under the PDTB paradigm, a discourse relation can be explicit (signalled by a connective) or implicit (inferred). The discourse analysis in this style results in a list of discourse relations. PDTB-style annotation has been adopted successfully in many languages such as English (Prasad et al., 2019), Hindi (Oza et al., 2009), Czech (Poláková et al., 2013), and Chinese (Li et al., 2014; Zhou et al., 2014).
Following previous endeavors in these languages, we present the Thai Discourse Treebank, a corpus of Thai discourse connectives annotated on 10,000 sentences from 384 documents in the style of PDTB. All documents are annotated for explicit discourse connectives, their senses, and connectives they are paired with (if any; see Section 3.2). Additionally, 180 out of 384 documents are annotated for connective arguments. We then construct a Thai discourse connective identification system trained on the corpus, which serves as a benchmark for the task of shallow discourse parsing and as a proof of the utility of the corpus. In addition, we provide an extensive list of the connectives and the analysis of their linguistic properties.
At cursory glance, the task of annotating explicit discourse connectives may seem trivial or even amenable to automation. However, dictionary-based rules are demonstrably insufficient (see Section 5.2) because many connectives are ambiguous among different senses (cf. Webber et al., 2019). Furthermore, there is the added layer of complexity brought about by homonymy, where words that serve as connectives in certain contexts might serve other roles such as a coordinating conjunction or relativizer in other contexts. Finally, the distinctive nature of Thai orthography, which lacks the punctuation that demarcate clause or sentence boundaries, further elevates the difficulty of discourse argument identification, which the PDTB-style annotation requires explicit clausal units to perform. This absence means that manual linguistically informed annotation is essential to ensure accuracy.
Our corpus is an initiative to address these challenges. We believe that it represents the most complete study in the context of explicit discourse connectives in a low-to-medium resource language such as Thai. This focus was intentional, as we aim to study (explicit) discourse connectives as a structural element that signals coherence and discourse relations. By honing in on these connectives, we aim to contribute a dataset that not only tackles the issues highlighted but also offers insights into the discourse analysis in Thai language, which can transfer to other languages where the clausal units are not clearly marked such as Lao and Burmese.
Our main contributions can be summarized as follows:
We construct the first and large (10,000-sentence) Thai corpus annotated with explicit discourse relations.1 The corpus is annotated following the English PDTB-3 scheme, with over 18,374 instances of connectives in 33 different relations (Section 4).
We compile the first extensive list and classification of 148 polysemous connectives with a total of 340 form-sense pairs. The list can be extended and new connectives can be classified following our criteria (Section 3).
We develop and train machine learning models for identifying connectives and the discourse relation senses they convey. Our best model achieves an F1 score of 0.96 in the connective identification task, and 0.46 in the sense classification task (Section 5).
2 Related Work
2.1 Discourse Corpora
We are not aware of any other discourse corpora for Thai. Thus, the current work has introduced Thai into the pool of languages with PDTB-style corpora. Such corpora already include the PDTB for English (Miltsakaki et al., 2004; Prasad et al., 2006, 2008, 2019), the Prague Discourse Treebank for Czech (Poláková et al., 2013), the CUHK Discourse Treebank for Chinese (Zhou et al., 2014), the Chinese Discourse Treebank (Lu et al., 2014), the Hindi Discourse Relation Bank (Oza et al., 2009), the Leeds Arabic Discourse Treebank (Al-Saif and Markert, 2010), and the Turkish Discourse Bank (Zeyrek et al., 2013).
Some of the aforementioned corpora adopt a sense classification scheme that is different from the PDTB. For example, the CUHK treebank (Zhou et al., 2014) has its own set of Level-2 and Level-3 senses. The Hindi treebank also added the sense Goal when there was no purpose sense in PDTB2 (Oza et al., 2009). In this regard, the Thai Discourse Treebank is no different, adding Expansion.GenExpansion as needed (see Section 3.3). The modification is minimal, so strong compatibility with the PDTB is maintained.
These corpora also provide discourse annotation on an existing constituency or dependency treebank. However, the Thai Discourse Treebank annotates over tokenized data, as there was no sufficiently large hand-annotated treebank at the time of corpus construction. This issue is not worrying, however, because the full corpus focuses on annotating the connective senses and pairs only, which do not require structural information.
Apart from PDTB, a popular framework for discourse annotation is Rhethorical Structure Theory (RST; Mann and Thompson, 1988). RST provides discourse tree structures and has been applied to English (Carlson et al., 2003) and other languages such as Spanish, Chinese (Li et al., 2014; Cao et al., 2018), and Russian (Toldova et al., 2017). Future work should aim to extend and align the current corpus with the LST20-based Blackboard Treebank2 which has since become available, and annotate it using either the PDTB or RST scheme.
2.2 Connective Identification and Sense Classification
Previous work has been done on both connective identification and classification tasks for explicit connectives, but we are unaware of any work that has been done on Thai.
In English, models often rely on word-pair features (Rutherford and Xue, 2014), syntactic features such as constituency or dependency features. Early models including Naive Bayes (Pitler and Nenkova, 2009) and Maximum Entropy (Lin et al., 2014; Wang and Lan, 2015) already achieved good results (F1 > 0.90) on both identification and classification tasks under different datasets derived from PDTB-2. The properties of discourse connectives themselves have also been exploited to help annotate implicit discourse relations (Rutherford and Xue, 2015).
More recent approaches adopt (deep) neural networks and embeddings. Scholman et al. (2021) directly compared two deep learning models (Nie et al., 2019; Knaebel and Stede, 2020) with two simpler models (Lin et al., 2014; Wang and Lan, 2015) across different domains on the identification task. They found Discopy (Knaebel and Stede, 2020), which uses BERT pretrained embeddings, performs best in the identification task in three out of four test datasets, including PDTB-2 (F1 ≈ 0.95). Other high-performing models for the identification task include ensembles that incorporate RNNs as well as automated and gold-standard syntactic features (Yu et al., 2019). Gessler et al. (2021) achieved SOTA results on the DISRPT2021 shared connective identification and sense classification tasks, which cover a wide variety of languages. Their model for the identification task is a CRF model that uses a vector derived from Transformer-based word embeddings and Bi-LSTM character embeddings. Similarly, their sense classification models are BERT-based, but also incorporate handcrafted syntactic features such as syntactic dependencies. The task is set up as a next sentence prediction task and the models take discourse argument pairs (sentences) as input.
Other languages have received relatively less attention. The CoNLL 2016 shared task is the first attempt to spark interest in the multilingual discourse parsing task for English and Chinese (Xue et al., 2016). Existing models use well-known architectures and similar features: German Random Forest connective identifiers (Bourgonje and Stede, 2018), German BERT-based sense classifiers (Bourgonje and Stede, 2020), Chinese SVM sense classifiers with lexical features such as length and POS (Huang and Chen, 2011), MaxEnt sense classifiers with syntactic features (Wang and Lan, 2016), and an end-to-end Chinese discourse parser that uses a BERT-based sense classifier (Hung et al., 2020).
3 Thai Discourse Connectives
Following the original PDTB (Miltsakaki et al., 2004), we define discourse connectives as predicates that take two or more abstract objects as arguments, where abstract objects refer to events, states, or propositions (Asher, 1993). These arguments are called Arg1 and Arg2. As a convention, Arg2 is identified as the argument the connective binds to. In Thai, the arguments are mostly clauses. In this section, we briefly discuss Thai discourse connectives and their properties, which inform our annotation guidelines for the Thai Discourse Treebank. Detailed information for each connective is provided alongside the corpus.
3.1 Syntax and Semantics of Discourse Connectives
As in English, Thai discourse connectives can be identified and categorized into three types according to their syntactic and discourse-semantic behaviors.
Co-ordinating Conjunctions (e.g., English and and Thai และ ‘and’) assert all their arguments. This means that all arguments of a co-ordinating conjunction participate in the sentence’s truth value. A cross-linguistically applicable way to test this is to perform a negation test (Cristofaro, 2005). (1) is true only if ‘Kim went to school’ is false and/or ‘Kim ate lunch’ is false.
We treat all of the coordinating conjunctions as discourse connectives. The connectives of this type always appear at the beginning of an independent clause.
Subordinating Conjunctions (English if and Thai แม้ ‘even though’) have one argument that do not participate in the truth value, namely, the subordinate clause. (2) is true when ‘we still work’ is false, regardless of whether ‘we are sleepy’ is false or not.
This negation test is crucial for our analysis because Thai does not mark sentence boundaries. A discourse connective that is a subordinating conjunction binds a clause expressing a proposition insensitive to the negation test. Additionally, discourse connectives of this type can occur either at the beginning or the end of a subordinate clause that they bind to, depending on the idiosyncrasy of each connective. Subordinating conjunctions that occur clause-finally typically occur in pairs (see Section 3.2). แม้ ‘even though’ in (2) is an example of a connective that precedes the argument that it binds with syntactically. … ปุ๊บ … ปํ๊บ ‘as soon as’ (3).
Adverbials also have an argument that does not participate in the truth value, but, unlike subordinating conjunctions, exhibit properties parallel to pronouns (Webber et al., 2003; Forbes-Riley, 2005). They can establish a relation from within an embedded clause to the matrix clause (4).
They may also refer to inferred events. (5) shows that the connective refers to the inferred event of ‘not being tired’.
Adverbials come in four classes: clause-initial, clause-final, clause-peripheral (initial/final), or post-subject, which is also called second position (Enfield, 2007). Importantly, post-subject connectives exclusively appear immediately after the subject, unlike English adverbs, which can also occur elsewhere. An example of this is เลย ‘consequently’, as in (6).
3.2 Pairedness
Some subordinating conjunctions occur in pairs. In such cases, one conjunction (the head) selects another conjunction that can occur with it (the child). (7) illustrates this for the conjunctions such as … ทีไร … ทุกที ‘whenever’. In these cases, Arg2 is the argument that does not participate in the truth value under negation.
Co-ordinating conjunctions and adverbials do not have lexically specific pairs.
All three types can optionally be paired with ก็, a general post-subject adverbial that can signal a variety of discourse relations (Iwasaki and Ingkaphirom, 2005). ก็ strictly occurs in the argument that is a matrix clause and linearly comes later. Examples involving co-ordinating conjunctions and adverbials are given in (8) and (9).
For subordinating conjunctions, it can only occur when the subordinate clause is fronted, as in (10a), but not (10b).
3.3 Senses
We largely follow the PDTB 3.0 (Prasad et al., 2019) for describing the senses of Thai connectives. Each PDTB sense is a result of three-level classification as the sense inventory is organized hierarchically. For example, Contingency.Cause.Reason consists of Contingency (Level-1 sense), Cause (Level-2 sense) and Reason (Level-3) sense, respectively.
The only deviation from the English PDTB is our new sense Expansion.GenExpansion. This describes the sense of ซึ่ง and other connectives with the same function. This sense applies when Arg2 provides information that can only be fully comprehended with Arg1 as prior knowledge, and is equivalent to Background in Rhetorical Structure Theory (Mann and Thompson, 1988). Approximate English translations are the adverbials ‘in this regard’, or ‘in regard to this’, as in (11).
Under the original PDTB guideline, ซึ่ง should fall under Expansion.Conjunction. Our decision to propose a separate sense is because ซึ่ง and the quintessential Conjunction และ ‘and’ are hardly interchangeable in Thai. (12b) is unacceptable for many Thai speakers, and has a different meaning from (12a) even for those who accept it.
4 The Thai Discourse Treebank
4.1 Corpus Construction
We first annotated discourse connectives on a sample of 50 documents from the LST20 corpus. (Boonkwan et al., 2020). The corpus consists of news articles in various genres from politics to science and technology and has been previously annotated with word boundaries, sentence boundaries, part-of-speech tags, and named entities. We retokenized the documents so that every connective counts as one token because some connectives consist of many tokens according to the LST20 annotation guidelines. To ease the workload of the annotators, we automatically annotated potential connectives for their connective status and senses by simple dictionary lookups using our list of Thai discourse connectives that is compiled by a native Thai linguist and classified according to the criteria in Section 3. As an initial phase, an annotator with expertise in Thai linguistics corrected senses and removed non-connectives, wrongly labeled by the automated process. They also tagged clauses and verb phrases as arguments for each discourse connective/relation. Largely following Prasad et al. (2019), Arg2 is assigned to the argument attached to the connective, whereas Arg1 is the other argument. As the final pass, they identified additional connectives not in our initial connective list. We now have a more complete list of discourse connectives for the next phase.
We sampled 334 documents more so that the total number of sentences is close to 10,000. The corpus was first pre-processed as above and pre-annotated using the discourse connective list. The documents were then assigned roughly equally to three linguistically trained annotators for each task (argument span and sense annotation). We conducted double annotation for sense and argument annotation to measure inter-annotator agreement, with each annotator annotatoing on the first 7 sentences of 21 randomly sampled documents. We achieve high inter-annotator agreement for connective annotation tasks (Cohen’s κ > 0.75; Table 2), and decent agreement for argument annotation (Jaccard index ≈ 0.6–0.7; Table 3). Because these disagreements were not substantial, we proceeded with the full annotation without double annotation.
The final corpus consists of 10,602 sentences from 384 documents, 180 of which have complete annotation of explicit discourse relations (discourse connective and its two argument spans). The documents are news articles about politics (134 documents; 34.89%), crimes and accidents (60 documents; 15.63%), and economics (43 documents; 11.20%). The corpus has the total of 18,374 instances of discourse connectives from 33 different senses. We consider this corpus to be a complete first release, with potential plans to update the corpus in later versions, just as the PDTB did.
We compare the most common level-2 senses in our Thai corpus and other two PDTB-style corpora (Table 1). The exact rankings differ, but the most common senses tend to be the same, with Conjunction at the top.
Thai Discourse Treebank (18374) . | PDTB3 (25406) . | CDTB (11084) . | |||
---|---|---|---|---|---|
Expansion.Conjunction | 21.28 | Expansion.Conjunction | 34.67 | Conjunction | 62.78 |
Contingency.Cause | 15.96 | Comparison.Concession | 19.11 | Expansion | 14.47 |
Temporal.Asynchronous | 13.17 | Contingency.Cause | 9.28 | Temporal | 6.52 |
Expansion.GenExpansion | 12.89 | Temporal.Asynchronous | 8.31 | Causation | 5.59 |
Comparison.Concession | 10.33 | Temporal.Synchronous | 7.68 | Contrast | 4.43 |
Contingency.Condition | 8.86 | Contingency.Condition | 5.87 | Purpose | 2.68 |
Contingency.Purpose | 7.86 | Comparison.Contrast | 5.33 | Conditional | 2.02 |
Comparison.Contrast | 6.72 | Expansion.Manner | 2.15 | Progression | 1.30 |
Expansion.Disjunction | 1.71 | Contingency.Purpose | 1.50 | Alternative | 0.20 |
Temporal.Synchronous | 1.22 | Expansion.Instantiation | 1.28 |
Thai Discourse Treebank (18374) . | PDTB3 (25406) . | CDTB (11084) . | |||
---|---|---|---|---|---|
Expansion.Conjunction | 21.28 | Expansion.Conjunction | 34.67 | Conjunction | 62.78 |
Contingency.Cause | 15.96 | Comparison.Concession | 19.11 | Expansion | 14.47 |
Temporal.Asynchronous | 13.17 | Contingency.Cause | 9.28 | Temporal | 6.52 |
Expansion.GenExpansion | 12.89 | Temporal.Asynchronous | 8.31 | Causation | 5.59 |
Comparison.Concession | 10.33 | Temporal.Synchronous | 7.68 | Contrast | 4.43 |
Contingency.Condition | 8.86 | Contingency.Condition | 5.87 | Purpose | 2.68 |
Contingency.Purpose | 7.86 | Comparison.Contrast | 5.33 | Conditional | 2.02 |
Comparison.Contrast | 6.72 | Expansion.Manner | 2.15 | Progression | 1.30 |
Expansion.Disjunction | 1.71 | Contingency.Purpose | 1.50 | Alternative | 0.20 |
Temporal.Synchronous | 1.22 | Expansion.Instantiation | 1.28 |
Raw agreement proportions (P) and Cohen’s κ for connective status and sense annotation. Ant = annotator. Ant0-Ant2 work on documents with only connective annotation. Ant3-Ant5 annotate both arguments and connectives.
Ant #1 . | Ant #2 . | P/κ (Status) . | P/κ (Sense) . | #Inst . |
---|---|---|---|---|
Ant0 | Ant1 | 0.92 / 0.83 | 0.84 / 0.76 | 475 |
Ant1 | Ant2 | 0.88 / 0.91 | 0.89 / 0.82 | 488 |
Ant2 | Ant0 | 0.96 / 0.77 | 0.82 / 0.77 | 501 |
Ant3 | Ant4 | 0.92 / 0.84 | 0.89 / 0.85 | 449 |
Ant3 | Ant5 | 0.91 / 0.83 | 0.86 / 0.81 | 549 |
Ant4 | Ant5 | 0.92 / 0.83 | 0.91 / 0.87 | 503 |
Ant #1 . | Ant #2 . | P/κ (Status) . | P/κ (Sense) . | #Inst . |
---|---|---|---|---|
Ant0 | Ant1 | 0.92 / 0.83 | 0.84 / 0.76 | 475 |
Ant1 | Ant2 | 0.88 / 0.91 | 0.89 / 0.82 | 488 |
Ant2 | Ant0 | 0.96 / 0.77 | 0.82 / 0.77 | 501 |
Ant3 | Ant4 | 0.92 / 0.84 | 0.89 / 0.85 | 449 |
Ant3 | Ant5 | 0.91 / 0.83 | 0.86 / 0.81 | 549 |
Ant4 | Ant5 | 0.92 / 0.83 | 0.91 / 0.87 | 503 |
Jaccard similarity (intersection over union) for argument span annotation. Each annotation set is a set of 4-tuples document, token ID, tag (Arg1/Arg2), connective. Only arguments of words identified as connectives by both annotators are considered.
Ant #1 . | Ant #2 . | Agreement . | Union size . |
---|---|---|---|
Ant3 | Ant4 | 0.70 | 8275 |
Ant3 | Ant5 | 0.63 | 10932 |
Ant4 | Ant5 | 0.72 | 8446 |
Ant #1 . | Ant #2 . | Agreement . | Union size . |
---|---|---|---|
Ant3 | Ant4 | 0.70 | 8275 |
Ant3 | Ant5 | 0.63 | 10932 |
Ant4 | Ant5 | 0.72 | 8446 |
4.2 Analysis of Annotation Disagreements
We conducted an analysis to determine the source of the disagreements and to highlight the points where the annotators should give special care. To streamline our analysis, we decided to merge these pairs into a single contingency matrix because the annotators exhibited similar patterns of disagreement (Figure 1). However, we filter the matrix to include only the top 15 senses, as lower-ranked senses yield an insufficient number of instances for meaningful analysis. And note that the contingency matrix is not symmetric like a confusion matrix in a classification task because the contingency matrix compares two annotators with different annotations.
Contingency matrix from the double annotation performed by the annotators shows that the non-connectives are confused more often than the senses are confused.
Contingency matrix from the double annotation performed by the annotators shows that the non-connectives are confused more often than the senses are confused.
Most frequent disagreements come from confusing connectives and non-connectives in multi-functional words. By multi-functional, we mean the connectives that have non-connective homonyms which serve a grammatical function. The most common is between Expansion. Conjunction and N in และ, ทั้ง and พร้อม. และ behaves similarly to English and in that they can conjoin S, VP or NP. Thus, the annotators sometimes struggled to distinguish between the use of NP coordination, which is not a discourse connective, and VP- and S-coordination, which is. Similarly, พร้อม is a relatively rare connective and is often used as an adjective meaning ‘ready.’
Other frequent disagreements involve Expansion.GenExpansion, Temporal.Asynchronous.Succession, Comparison.Contrast, and Temporal.Asynchronous.Precedence. For each of these, (e.g., GenExpansion ซึ่ง), an exclusive particle (e.g., Contrast แต่), an auxiliary (e.g., Precedence แล้ว), or a preposition (e.g., Precedence ก่อน). (13) gives an example of a disagreement on ก่อน, where one annotator mistakes it as a non-connective.
Some of these disagreements are due to genuine ambiguity, and no mistakes were made. In the common case (14), แล้ว is ambiguous between a post-clausal aspect auxiliary (non-connective) and a subordinator attached to the second clause (a Precedence connective).
Whether multifunctionality leads to disagreements depends on its frequency as a connective and syntactic behavior, however. Comparison and Contingency connectives induce disagreements less, even though they have a non- connective counterpart because they tend to be more often connective than not. For instance, แต่ is most frequently a Comparison. Concession.Arg2-As-Denier connective, not an exclusive particle. In addition, the purpose connective เพื่อ can be easily distinguished from its non-connective prepositional counterpart because the connective usually requires a complementizer (e.g., ให้) when taking a clause, whereas the preposition does not.
For these potentially tricky cases, annotators analyzed the syntax during annotation (e.g., checking whether the argument is an NP, VP, or S) and consulted a supervising linguist in hard cases.
Sense disagreements can result from closely related, confusable senses. On top of the confusability between non-connectives and connectives, even among instances of clear connectives, some senses are harder to tease apart than others. Notably, โดย is solely responsible for all disagreements involving Expansion.GenExpansion and Expansion.Level-of-detail.Arg2-As-Detail. Although the annotators distinguish its sense as a preposition (non-connective) perfectly, they sometimes encounter challenges in distinguishing between the subtle nuances between these two senses. Expansion.GenExpansion relates to a broader expansion of information or ideas, while Expansion.Level-of-detail.Arg2-As-Detail involves the inclusion of specific details or finer granularity within the discourse. For these easily confusable senses, annotators were instructed to work wth clearer paraphrases (e.g., denier แต่, but not contrastive แต่, can be paraphrased with another denier such as ถืง) and consult a supervising linguist when in doubt.
Argument span disagreements can result from attachment ambiguity. Unlike sense labelling, disagreements in argument agreement spans are more sporadic and do not recur. One discernible pattern involves attachment ambiguity where it is not clear whether a subordinate clause is attached to S, VP, or an embedded VP. In (15), for example, it is not clear whether the clause with หลัง attaches to a whole clause including the subject, only the VP of the clause, or the VP-complement of เล็ง. Such ambiguities barely affect the semantics of the sentence. Like (14), neither annotator is incorrect, but their annotations differ because they opt for different underlying structures.
5 Discourse Connective Identification and Classification Models
With our corpus, we develop classifiers for two basic shallow discourse parsing tasks:
Connective identification: determining whether a word is an explicit discourse connective. This task is formulated as a binary classification problem or a binary sequence tagging problem.
Sense classification: classifying a given word as a non-connective or one of the discourse senses conveyed by that connective. In other words, this task takes care of connective identification and sense classification in one go.
We train models with all three PDTB sense levels under two conditions, i.e., given raw tokens vs. gold-standard connectives. For each task, we experiment with three families of models.
Dictionary To establish a baseline, we create a simple model that always classifies as connective a word that is more often than not a connective in the connective identification task, and selects the most frequent sense in the training set for a given token in the sense classification task.
Conditional Random Fields (CRF) We formulated the task as a sequence tagging problem and trained CRF models on both tasks, using different combinations of the following features:
- (a)
Word form
- (b)
Gold-standard part-of-speech tag annotated in LST20 corpus or automatic part-of-speech tag, using PyThaiNLP’s perceptron engine (Phatthiyaphaibun et al., 2016)
- (c)
All word and/or POS tag uni-, bi-, tri-grams within a context window of 3 words around a given word
- (a)
Pretrained Language Model (PLM) We formulated the task as a token classification task. We finetuned WangchanBERTa (Lowphansirikul et al., 2021), a RoBERTa- style pretrained Thai language model, on both tasks. We experiment with feeding one sentence at a time and with feeding sentence pairs as input.
For sense classification with gold-standard connectives, we use dictionary and PLM models, but switch CRF out for Maximum Entropy (MaxEnt) models with the same features since the task operates on specific tokens (i.e., the known connectives), not a sequence of tokens.
5.1 Experiment Setup
We randomly split the data into the train-dev-test sets (80:10:10). Since the sense inventory is organized in a hierarchy, we train separate models for level-1 (5-way), level-2 (20-way), and leaf-level (35-way) senses for the sense classification task. We add an extra label for non-connective.
It is worth noting that the F1 measures from PLMs are calculated on predictions postprocessed for consistency. PLMs such as WangchanBERTa come with their own tokenizers. If the tokenizer incorrectly tokenizes the text, we have no chance of being correct. Thus, we postprocess the predicted labels as follows. If a predicted connective ends with one of the 10 words that the WangchanBERTa tokenizer frequently mismerges into the previous word (จะ, ที่, มี, ให้, เป็น, ต้อง, ไม่, ว่า, ขอ, การ), we separate these words out into a distinct token. We then map each predicted label to the gold-standard token that 1) shares the same character index span, or, failing that, 2) starts at an index contained within the auto-tokenized token. This ensures an equal number of tokens across models.
5.2 Key Results
We evaluate each model using macro-averaged F1 scores. To obtain conservative measures, we exclude the label non-connective from our calculations. Table 4 gives the macro-averaged F1 scores across tasks. To summarize briefly, across all tasks except one, the best-performing models are PLMs (F1 = 0.96 for the identification task, and 0.46 for the leaf-level sense classification task), followed by log-linear models and the baseline dictionary-based systems, respectively (Table 4). Our key findings are as follows.
Macro-F1 scores for each model. The scores are calculated after excluding the class non-connective. Models under Raw tokens are trained and tested on all of the tokens in the data. Models under Gold standard connectives are trained and tested on only gold-standard connectives. Boldfaced figures are the highest F1 scores among the models under the same dataset (raw tokens/gold-standard connectives) and task (connective identification/Level 1–3 sense classification). All models are tested on the test set.
Model . | F1 . | |||
---|---|---|---|---|
Identification . | Sense LV1 . | Sense LV2 . | Sense . | |
Raw tokens | ||||
WangchanBERTa - 1 sentence | 0.956 | 0.886 | 0.517 | 0.460 |
WangchanBERTa - 2 sentences | 0.906 | 0.882 | 0.508 | 0.413 |
CRFWord | 0.725 | 0.660 | 0.323 | 0.297 |
CRFAuto-POS +Word | 0.732 | 0.679 | 0.328 | 0.303 |
CRFPOS +Word | 0.784 | 0.760 | 0.360 | 0.336 |
CRFWord 1,2,3-grams | 0.853 | 0.815 | 0.433 | 0.380 |
CRFAuto-POS +Word 1,2,3-grams | 0.858 | 0.811 | 0.438 | 0.385 |
CRFPOS +Word 1,2,3-grams | 0.868 | 0.831 | 0.501 | 0.418 |
Dictionary | 0.718 | 0.643 | 0.326 | 0.302 |
Gold standard connectives | ||||
WangchanBERTa - 1 sentence | – | 0.964 | 0.636 | 0.610 |
WangchanBERTa - 2 sentences | – | 0.865 | 0.624 | 0.538 |
MaxEntWord | – | 0.921 | 0.634 | 0.518 |
MaxEntAuto-POS +Word | – | 0.921 | 0.677 | 0.573 |
MaxEntPOS +Word | – | 0.922 | 0.723 | 0.576 |
MaxEntWord 1,2,3-grams | – | 0.941 | 0.596 | 0.538 |
MaxEntAuto-POS +Word 1,2,3-grams | – | 0.930 | 0.592 | 0.531 |
MaxEntPOS +Word 1,2,3-grams | – | 0.937 | 0.607 | 0.538 |
Dictionary | – | 0.689 | 0.352 | 0.326 |
Model . | F1 . | |||
---|---|---|---|---|
Identification . | Sense LV1 . | Sense LV2 . | Sense . | |
Raw tokens | ||||
WangchanBERTa - 1 sentence | 0.956 | 0.886 | 0.517 | 0.460 |
WangchanBERTa - 2 sentences | 0.906 | 0.882 | 0.508 | 0.413 |
CRFWord | 0.725 | 0.660 | 0.323 | 0.297 |
CRFAuto-POS +Word | 0.732 | 0.679 | 0.328 | 0.303 |
CRFPOS +Word | 0.784 | 0.760 | 0.360 | 0.336 |
CRFWord 1,2,3-grams | 0.853 | 0.815 | 0.433 | 0.380 |
CRFAuto-POS +Word 1,2,3-grams | 0.858 | 0.811 | 0.438 | 0.385 |
CRFPOS +Word 1,2,3-grams | 0.868 | 0.831 | 0.501 | 0.418 |
Dictionary | 0.718 | 0.643 | 0.326 | 0.302 |
Gold standard connectives | ||||
WangchanBERTa - 1 sentence | – | 0.964 | 0.636 | 0.610 |
WangchanBERTa - 2 sentences | – | 0.865 | 0.624 | 0.538 |
MaxEntWord | – | 0.921 | 0.634 | 0.518 |
MaxEntAuto-POS +Word | – | 0.921 | 0.677 | 0.573 |
MaxEntPOS +Word | – | 0.922 | 0.723 | 0.576 |
MaxEntWord 1,2,3-grams | – | 0.941 | 0.596 | 0.538 |
MaxEntAuto-POS +Word 1,2,3-grams | – | 0.930 | 0.592 | 0.531 |
MaxEntPOS +Word 1,2,3-grams | – | 0.937 | 0.607 | 0.538 |
Dictionary | – | 0.689 | 0.352 | 0.326 |
Dictionary-based models are insufficient for sense classification tasks. The dictionary-lookup system using the majority sense performs somewhat decently in the identification task (F1 = 0.72). However, the system progressively performs worse as the sense classification task becomes finer-grained. The system achieves 0.69 F1 score in classifying level-1 sense when given a gold-standard discourse connective because the percentage of majority senses of most common discourse connectives generally ranges from 0.7 to 1.0, and is as low as 0.3 in the case of the multiply ambiguous ก็ (Table 5). This confirms the premise that there is a need for a machine learning–based solution for these tasks in Thai. In line with previous results in English and other languages, the best-performing models are PLMs. These models outperform the log-linear models in all settings except when given gold-standard connectives in the level-2 sense setting (Table 7). In the level-1 sense setting, the F1 scores are in the range of 0.82–0.94 for PLMs, which are higher than 0.72–0.90 for the best CRF models (Table 6).
Macro-F1 scores for level-1 sense classification of raw tokens. The scores in the CRF column are from the best CRF model, CRFPOS +Word 1,2,3-grams.
Sense LV1 . | F1 . | Support . | ||
---|---|---|---|---|
PLM . | CRF . | Dict . | ||
Comparison | 0.91 | 0.88 | 0.81 | 383 |
Contingency | 0.94 | 0.90 | 0.89 | 717 |
Expansion | 0.88 | 0.82 | 0.56 | 731 |
Temporal | 0.82 | 0.72 | 0.31 | 308 |
micro avg | 0.89 | 0.85 | 0.71 | 2139 |
macro avg | 0.89 | 0.83 | 0.64 | 2139 |
weighted avg | 0.89 | 0.85 | 0.68 | 2139 |
Sense LV1 . | F1 . | Support . | ||
---|---|---|---|---|
PLM . | CRF . | Dict . | ||
Comparison | 0.91 | 0.88 | 0.81 | 383 |
Contingency | 0.94 | 0.90 | 0.89 | 717 |
Expansion | 0.88 | 0.82 | 0.56 | 731 |
Temporal | 0.82 | 0.72 | 0.31 | 308 |
micro avg | 0.89 | 0.85 | 0.71 | 2139 |
macro avg | 0.89 | 0.83 | 0.64 | 2139 |
weighted avg | 0.89 | 0.85 | 0.68 | 2139 |
Macro-F1 scores for level-2 sense classification of raw tokens. The scores in the CRF column is of the best CRF model, CRFPOS +Word 1,2,3-grams.
Sense LV2 . | F1 . | #Inst. . | ||
---|---|---|---|---|
PLM . | CRF . | Dict . | ||
Comparison.Concession | 0.90 | 0.87 | 0.83 | 256 |
Comparison.Contrast | 0.74 | 0.74 | 0.55 | 125 |
Comparison.Similarity | 0.00 | 0.00 | 0.00 | 2 |
Contingency.Cause | 0.93 | 0.94 | 0.92 | 366 |
Contingency.Cause+Belief | 0.20 | 0.00 | 0.46 | 9 |
Contingency.Condition | 0.89 | 0.79 | 0.79 | 192 |
Contingency.Condition+SA | 0.00 | 0.00 | 0.00 | 1 |
Contingency.Neg-Condition | 0.00 | 0.00 | 0.00 | 5 |
Contingency.Purpose | 0.96 | 0.97 | 0.89 | 144 |
Expansion.Conjunction | 0.87 | 0.84 | 0.27 | 398 |
Expansion.Disjunction | 0.86 | 0.91 | 0.00 | 39 |
Expansion.Exception | 0.00 | 0.00 | 0.00 | 4 |
Expansion.GenExpansion | 0.86 | 0.81 | 0.76 | 265 |
Expansion.Instantiation | 0.57 | 0.40 | 0.00 | 3 |
Expansion.Level-of-detail | 0.00 | 0.21 | 0.00 | 16 |
Expansion.Substitution | 0.00 | 0.29 | 0.00 | 6 |
Temporal.Asynchronous | 0.81 | 0.74 | 0.34 | 277 |
Temporal.Synchronous | 0.72 | 0.51 | 0.06 | 31 |
micro avg | 0.86 | 0.83 | 0.68 | 2139 |
macro avg | 0.52 | 0.50 | 0.33 | 2139 |
weighted avg | 0.86 | 0.82 | 0.61 | 2139 |
Sense LV2 . | F1 . | #Inst. . | ||
---|---|---|---|---|
PLM . | CRF . | Dict . | ||
Comparison.Concession | 0.90 | 0.87 | 0.83 | 256 |
Comparison.Contrast | 0.74 | 0.74 | 0.55 | 125 |
Comparison.Similarity | 0.00 | 0.00 | 0.00 | 2 |
Contingency.Cause | 0.93 | 0.94 | 0.92 | 366 |
Contingency.Cause+Belief | 0.20 | 0.00 | 0.46 | 9 |
Contingency.Condition | 0.89 | 0.79 | 0.79 | 192 |
Contingency.Condition+SA | 0.00 | 0.00 | 0.00 | 1 |
Contingency.Neg-Condition | 0.00 | 0.00 | 0.00 | 5 |
Contingency.Purpose | 0.96 | 0.97 | 0.89 | 144 |
Expansion.Conjunction | 0.87 | 0.84 | 0.27 | 398 |
Expansion.Disjunction | 0.86 | 0.91 | 0.00 | 39 |
Expansion.Exception | 0.00 | 0.00 | 0.00 | 4 |
Expansion.GenExpansion | 0.86 | 0.81 | 0.76 | 265 |
Expansion.Instantiation | 0.57 | 0.40 | 0.00 | 3 |
Expansion.Level-of-detail | 0.00 | 0.21 | 0.00 | 16 |
Expansion.Substitution | 0.00 | 0.29 | 0.00 | 6 |
Temporal.Asynchronous | 0.81 | 0.74 | 0.34 | 277 |
Temporal.Synchronous | 0.72 | 0.51 | 0.06 | 31 |
micro avg | 0.86 | 0.83 | 0.68 | 2139 |
macro avg | 0.52 | 0.50 | 0.33 | 2139 |
weighted avg | 0.86 | 0.82 | 0.61 | 2139 |
Functionally ambiguous connectives are difficult for models and humans alike. As Table 9 (left) shows, most of the model misclassifications result from confusion between N (non-connective) and another sense. Interestingly, the errors that the models make are the same ones that cause disagreement among annotators (Table 9, right; see also Section 6), suggesting the natural difficulty rather than model-specific flaws.
The fact that these errors are due to multifunctionality can explain why POS information is important for CRF models (F1 increases by 0.05 from CRFWord to CRFWord +POS), but not for the corresponding MaxEnt models that operate on tokens known to be connectives (F1 increases by 0.001 from MaxEntWord to MaxEntPOS +Word). It also explains why PLMs perform best, as they have been shown to implicitly encode morphosyntactic information (Hewitt and Manning, 2019; Manning et al., 2020). Both POS and syntax could help with cases of multifunctionality such as adjectival พร้อม ‘ready’ vs. conjunctive พร้อม mentioned in Section 4.2 (although not always; cf. Section 6). Future developments of discourse-based models should thus focus on extracting grammatical information, and may test whether using two separate models for the joint task yield better performance, since the present models fare well in the isolated identification task (Best PLM F1 = 0.96).
CRF models and PLMs do not differ in the types of errors they make. The PLMs and CRF models appear to make the same types of errors, almost across the board. The relative performances exhibit the same trends across senses. In Level-1 sense classification, for example, Contingency being the easiest class, followed by Comparison, Expansion, and Temporal (Table 6). The exceptions to this trend appear to be due to sparse data rather than the true performance. In level-2 classification, CRF models either beat PLMs by a margin of 1% or in senses with few 40 tokens (39 tokens with Expansion.Disjunction, 16 with Expansion.Level-of-detail and 6 with Expansion.Substitution). Similarly, in level-3 sense classification, CRF models beat PLMs in only three senses, two of which (Expansion.Disjunction, Expansion.Level-of-detail.Arg2-As-Detail) have fewer than 40 tokens (Table 8).
Macro-F1 scores for level-3 sense classification of raw tokens. The scores in the CRF column is of the best CRF model, CRFPOS +Word 1,2,3-grams.
Connective sense . | F1 . | N . | ||
---|---|---|---|---|
PLM . | CRF . | Dict . | ||
Comparison.Concession.Arg1-As-Denier | 0.80 | 0.62 | 0.62 | 60 |
Comparison.Concession.Arg2-As-Denier | 0.85 | 0.84 | 0.80 | 196 |
Comparison.Contrast | 0.75 | 0.74 | 0.55 | 125 |
Comparison.Similarity | 0.00 | 0.00 | 0.00 | 2 |
Contingency.Cause+Belief.Reason+Belief | 0.00 | 0.00 | 0.67 | 5 |
Contingency.Cause+Belief.Result+Belief | 0.33 | 0.00 | 0.00 | 4 |
Contingency.Cause.Reason | 0.97 | 0.97 | 0.97 | 222 |
Contingency.Cause.Result | 0.84 | 0.86 | 0.83 | 144 |
Contingency.Condition+SA | 0.00 | 0.00 | 0.00 | 1 |
Contingency.Condition.Arg1-As-Cond | 0.00 | 0.00 | 0.00 | 13 |
Contingency.Condition.Arg2-As-Cond | 0.86 | 0.81 | 0.83 | 179 |
Contingency.Negative-Condition.A1-NC | 0.00 | 0.00 | 0.00 | 1 |
Contingency.Negative-Condition.A2-NC | 0.00 | 0.00 | 0.00 | 4 |
Contingency.Purpose.Arg1-As-Goal | 0.00 | 0.00 | 0.00 | 3 |
Contingency.Purpose.Arg2-As-Goal | 0.98 | 0.98 | 0.90 | 141 |
Expansion.Conjunction | 0.87 | 0.84 | 0.27 | 398 |
Expansion.Disjunction | 0.83 | 0.91 | 0.00 | 39 |
Expansion.Exception.Arg2-As-Excpt | 0.00 | 0.00 | 0.00 | 4 |
Expansion.GenExpansion | 0.85 | 0.80 | 0.76 | 265 |
Expansion.Instantiation.Arg2-As-Instance | 0.86 | 0.40 | 0.00 | 3 |
Expansion.Level-of-detail.Arg2-As-Detail | 0.00 | 0.21 | 0.00 | 16 |
Expansion.Substitution.Arg1-As-Subst | 0.00 | 0.00 | 0.00 | 3 |
Expansion.Substitution.Arg2-As-Subst | 0.00 | 0.00 | 0.00 | 3 |
Temporal.Asynchronous.Precedence | 0.64 | 0.58 | 0.24 | 106 |
Temporal.Asynchronous.Succession | 0.85 | 0.78 | 0.38 | 171 |
Temporal.Synchronous | 0.69 | 0.52 | 0.06 | 31 |
micro avg | 0.84 | 0.82 | 0.67 | 2139 |
macro avg | 0.46 | 0.42 | 0.30 | 2139 |
weighted avg | 0.83 | 0.80 | 0.60 | 2139 |
Connective sense . | F1 . | N . | ||
---|---|---|---|---|
PLM . | CRF . | Dict . | ||
Comparison.Concession.Arg1-As-Denier | 0.80 | 0.62 | 0.62 | 60 |
Comparison.Concession.Arg2-As-Denier | 0.85 | 0.84 | 0.80 | 196 |
Comparison.Contrast | 0.75 | 0.74 | 0.55 | 125 |
Comparison.Similarity | 0.00 | 0.00 | 0.00 | 2 |
Contingency.Cause+Belief.Reason+Belief | 0.00 | 0.00 | 0.67 | 5 |
Contingency.Cause+Belief.Result+Belief | 0.33 | 0.00 | 0.00 | 4 |
Contingency.Cause.Reason | 0.97 | 0.97 | 0.97 | 222 |
Contingency.Cause.Result | 0.84 | 0.86 | 0.83 | 144 |
Contingency.Condition+SA | 0.00 | 0.00 | 0.00 | 1 |
Contingency.Condition.Arg1-As-Cond | 0.00 | 0.00 | 0.00 | 13 |
Contingency.Condition.Arg2-As-Cond | 0.86 | 0.81 | 0.83 | 179 |
Contingency.Negative-Condition.A1-NC | 0.00 | 0.00 | 0.00 | 1 |
Contingency.Negative-Condition.A2-NC | 0.00 | 0.00 | 0.00 | 4 |
Contingency.Purpose.Arg1-As-Goal | 0.00 | 0.00 | 0.00 | 3 |
Contingency.Purpose.Arg2-As-Goal | 0.98 | 0.98 | 0.90 | 141 |
Expansion.Conjunction | 0.87 | 0.84 | 0.27 | 398 |
Expansion.Disjunction | 0.83 | 0.91 | 0.00 | 39 |
Expansion.Exception.Arg2-As-Excpt | 0.00 | 0.00 | 0.00 | 4 |
Expansion.GenExpansion | 0.85 | 0.80 | 0.76 | 265 |
Expansion.Instantiation.Arg2-As-Instance | 0.86 | 0.40 | 0.00 | 3 |
Expansion.Level-of-detail.Arg2-As-Detail | 0.00 | 0.21 | 0.00 | 16 |
Expansion.Substitution.Arg1-As-Subst | 0.00 | 0.00 | 0.00 | 3 |
Expansion.Substitution.Arg2-As-Subst | 0.00 | 0.00 | 0.00 | 3 |
Temporal.Asynchronous.Precedence | 0.64 | 0.58 | 0.24 | 106 |
Temporal.Asynchronous.Succession | 0.85 | 0.78 | 0.38 | 171 |
Temporal.Synchronous | 0.69 | 0.52 | 0.06 | 31 |
micro avg | 0.84 | 0.82 | 0.67 | 2139 |
macro avg | 0.46 | 0.42 | 0.30 | 2139 |
weighted avg | 0.83 | 0.80 | 0.60 | 2139 |
The left table shows the top 15 disgreement between annotators. The right table shows the top 15 disagreement (misclassification) between the best PLM model and the gold standard data.
Sense A . | Sense B . | Count . | Sense A . | Sense B . | Count . |
---|---|---|---|---|---|
Expansion.Conjunction | N | 59 | Expansion.Conjunction | N | 81 |
Expansion.GenExpansion | N | 32 | Expansion.GenExpansion | N | 66 |
Expansion.GenExpansion | Expansion.Level-of-detail.Arg2-As-Detail | 26 | N | Temporal.Asynchronous.Precedence | 38 |
N | Temporal.Asynchronous.Succession | 25 | Comparison.Contrast | N | 30 |
N | Temporal.Asynchronous.Precedence | 25 | N | Temporal.Asynchronous.Succession | 27 |
Comparison.Contrast | N | 24 | Contingency.Condition.Arg2-As-Cond | N | 23 |
Comparison.Contrast | Temporal.Synchronous | 21 | Comparison.Concession.Arg2-As-Denier | Comparison.Contrast | 22 |
Comparison.Concession.Arg2-As-Denier | Comparison.Contrast | 17 | Contingency.Cause.Result | N | 21 |
Comparison.Contrast | Expansion.Conjunction | 13 | Comparison.Concession.Arg2-As-Denier | N | 20 |
Expansion.Conjunction | Temporal.Asynchronous.Precedence | 13 | Comparison.Concession.Arg1-As-Denier | Comparison.Concession.Arg2-As-Denier | 17 |
Contingency.Cause.Result | N | 13 | Expansion.Disjunction | N | 13 |
Comparison.Concession.Arg2-As-Denier | N | 11 | N | Temporal.Synchronous | 12 |
N | Temporal.Synchronous | 10 | Temporal.Asynchronous.Precedence | Temporal.Asynchronous.Succession | 9 |
Contingency.Cause.Result | Temporal.Asynchronous.Precedence | 8 | Contingency.Condition.Arg2-As-Cond | Temporal.Asynchronous.Succession | 9 |
Expansion.Instantiation.Arg2-As-Instance | N | 7 | Expansion.Conjunction | Temporal.Asynchronous.Precedence | 9 |
Sense A . | Sense B . | Count . | Sense A . | Sense B . | Count . |
---|---|---|---|---|---|
Expansion.Conjunction | N | 59 | Expansion.Conjunction | N | 81 |
Expansion.GenExpansion | N | 32 | Expansion.GenExpansion | N | 66 |
Expansion.GenExpansion | Expansion.Level-of-detail.Arg2-As-Detail | 26 | N | Temporal.Asynchronous.Precedence | 38 |
N | Temporal.Asynchronous.Succession | 25 | Comparison.Contrast | N | 30 |
N | Temporal.Asynchronous.Precedence | 25 | N | Temporal.Asynchronous.Succession | 27 |
Comparison.Contrast | N | 24 | Contingency.Condition.Arg2-As-Cond | N | 23 |
Comparison.Contrast | Temporal.Synchronous | 21 | Comparison.Concession.Arg2-As-Denier | Comparison.Contrast | 22 |
Comparison.Concession.Arg2-As-Denier | Comparison.Contrast | 17 | Contingency.Cause.Result | N | 21 |
Comparison.Contrast | Expansion.Conjunction | 13 | Comparison.Concession.Arg2-As-Denier | N | 20 |
Expansion.Conjunction | Temporal.Asynchronous.Precedence | 13 | Comparison.Concession.Arg1-As-Denier | Comparison.Concession.Arg2-As-Denier | 17 |
Contingency.Cause.Result | N | 13 | Expansion.Disjunction | N | 13 |
Comparison.Concession.Arg2-As-Denier | N | 11 | N | Temporal.Synchronous | 12 |
N | Temporal.Synchronous | 10 | Temporal.Asynchronous.Precedence | Temporal.Asynchronous.Succession | 9 |
Contingency.Cause.Result | Temporal.Asynchronous.Precedence | 8 | Contingency.Condition.Arg2-As-Cond | Temporal.Asynchronous.Succession | 9 |
Expansion.Instantiation.Arg2-As-Instance | N | 7 | Expansion.Conjunction | Temporal.Asynchronous.Precedence | 9 |
Connective-related tasks in Thai require wide context. The tested models work with different context window size, from the dictionary-based system with essentially no context, to CRF models with local contexts (N-gram features), to the PLMs with large context window (416 tokens during training). The trend is that more contextual information yields higher performance. One apparent anomaly is the PLMs with two-sentence inputs should perform best as they have the largest context size. However, RoBERTa-style PLMs, like WangchanBERTa, are not pretrained on the next sentence prediction task to learn about the discourse information across a sentence pairs like in BERT-style PLMs. This might explain why it is better to feed the model one sentence at a time.
6 Error Analysis of Models
We conducted a more detailed error analysis on the predictions of our best PLM model by inspecting the error rates under different sentence lengths. We also extracted 10 most common per- token-and-misclassification errors (115 instances, total) and manually annotated each instance for its error type (Table 10). In line with our key results, our error analysis reveals a striking parallel between the kinds of mistakes that the models make and the ones where annotators disagree (cf. Section 4.2).
Common sources of errors in the results.
Error sources . | Count . |
---|---|
Similar senses | 32 |
Co-ordination | 26 |
Other grammatical functions (multifunctionality) | 22 |
Arg1-Arg2 misidentification | 13 |
NP-S misidentification | 2 |
No discernable pattern | 20 |
Error sources . | Count . |
---|---|
Similar senses | 32 |
Co-ordination | 26 |
Other grammatical functions (multifunctionality) | 22 |
Arg1-Arg2 misidentification | 13 |
NP-S misidentification | 2 |
No discernable pattern | 20 |
PLMs perform better than CRF models regardless of sentence length. We explore the effect of sentence lengths on the error rates. One might expect that CRF models may perform better or similar to PLMs when the sentence is short as local context is often enough. However, we find no such difference, suggesting that wide contexts are important even in short sentences (Figure 2).
The best PLM model performs better than the best CRF model regardless of sentence length.
The best PLM model performs better than the best CRF model regardless of sentence length.
Similar senses induce misclassification. We note that 32 out of the 115 errors extracted are cases of similar senses in แต่ and โดย; 19 แต่ instances involve a misclassification between Comparison.Concession.Arg2-As-Denier and Comparison.Contrast. Both senses highlight a contrast between two arguments, the difference being that only the former also involves denying an implicature. (16) gives an example.
The 13 errors involving โดย feature a misclassification between Expansion.GenExpansion and Expansion.Level-of-detail.Arg2-As-Detail. As noted in Section 4.2, these two senses are also closely related.
Errors are primarily syntactic. Most remaining errors are syntax-related. The most interesting cases are those due to multifunctionality, recapitulating the inter-annotator disagreements. For example, in (17), ก่อน is a Precedence connective misclassified as a non-connective, plausibly because it can also be a preposition (like English before).
Another particularly common error type involves the model misclassifying a co-ordinator as a non-connective that has the same form (or vice-versa). In (18), หรือ is a connective co-ordinating the two VPs but is misclassified as a non-connective, plausibly because it can also co-ordinate NPs in other cases (cf. English or).
Finally, some errors are specific to แต่. While itself a connective, แต่ could also be selected by ถืง ‘although’. When paired this way, the argument following แต่ is Arg1 instead of Arg2 because Arg2 is taken to be the clause attached to ถืง, an annotation choice made to uniformly treat all subordinate clauses as Arg2. This and the previous two error types highlight the need for more syntactic information in NLP tasks when working with languages with few orthographical cues to syntax (e.g., spaces and punctuation).
7 Conclusion
In this paper, we present the Thai Discourse Treebank, the first Thai corpus annotated with PDTB-style discourse relations. We develop the annotation guidelines for Thai and annotate 18,374 instances of discourse connectives over 10,602 sentences on top of the LST20 corpus. We benchmark the task of connective identification and sense classification using log-linear models and PLM-based systems. We also provide extensive qualitative analyses of the data and model results. In later releases of the data, we hope to incorporate syntactic information and implicit connectives.
Acknowledgments
We would like to thank reviewers and action editors for their helpful reviews and feedback.
Notes
References
Author notes
Action Editor: Asli Celikyilmaz