Overview of SMP-CAIL2020-Argmine: The Interactive Argument-Pair Extraction in Judgement Document Challenge

Abstract In this paper we present the results of the Interactive Argument-Pair Extraction in Judgement Document Challenge held by both the Chinese AI and Law Challenge (CAIL) and the Chinese National Social Media Processing Conference (SMP), and introduce the related data set - SMP-CAIL2020-Argmine. The task challenged participants to choose the correct argument among five candidates proposed by the defense to refute or acknowledge the given argument made by the plaintiff, providing the full context recorded in the judgement documents of both parties. We received entries from 63 competing teams, 38 of which scored higher than the provided baseline model (BERT) in the first phase and entered the second phase. The best performing system in the two phases achieved accuracy of 0.856 and 0.905, respectively. In this paper, we will present the results of the competition and a summary of the systems, highlighting commonalities and innovations among participating systems. The SMP-CAIL2020-Argmine data set and baseline models① have been already released.


INTRODUCTION
In a trial process, the opinions, testimonies and results of both sides of the case are all recorded in detail in the judgement document [1], an example of which is shown in Figure 1. Traditionally, the summarisation of such text information remains to be organized and analyzed by the judge manually, which is highly time consuming and of low efficiency. In recent years, with the increasing interest in automatic analysis in the judicial field [2,3,4], more and more attention has been paid to an automatic system for judicial process, from Ulmer's proposal of quantitative methods and probability theory [5], Nagel's [6] optimization and statistical methods, to Liu & Chen's [7], Sulea et al.'s [8] and Katz et al.'s [9] natural language processing (NLP) models leveraging more lexical features in judicial documents, which indicates that such a task is greatly in need and of practical value.
Another research area of interest is argumentation mining, since argument is playing an increasingly important role in decision making on social issues. As an automatic technique to process and analyze arguments, computational argumentation, aimed at mining the semantic and logical structure of the given text, has become a rapidly growing field in natural language processing. Existing research on argumentation mining covers argument structure prediction [10,11,12], claims generation [13][14][15][16][17], and interactive argument pairs identification [18][19][20][21][22][23][24]. Recently, Cheng et al. [25] extracted argument pairs from peer review and rebuttal data in order to study the content, structure and the connections between them.

Figure 1.
An instance of judgement document, which contains the statement of the defense and the plaintiff, the judgement date, the result of the trial, the judges' names, and the recorder's name.
In the works mentioned above, an interactive argument pair refers to the one that contains two arguments that have logical or semantic interactions with each other, e.g., "The global warming does not affect our daily life as the scientists say.", and "I cannot imagine what my life would be if my homeland is beneath the sea level.", which consists of two arguments, mainly talking about the same topic, the global warming in our examples, and the second one is responding to the first argument by hypothesizing the scene of global warming.

Overview of SMP-CAIL2020-Argmine: The Interactive Argument-Pair Extraction in Judgement Document Challenge
Since during the trial process, the two parties both have to make their own points clear and make response to the opposite party, which resembles the process of a debate to a large extent, and it is intuitive yet promising to apply computational argumentation methods to such a field. A typical task of this kind is to automatically extract the focus of dispute of the two parties in a trial process. Specifically, in a trail process, the focus of dispute between the plaintiff and the defense can refer to the arguments that two sides propose on fact statement or claim settlement, either consistent with each other or attacking each other, an example of which is shown in Figure 2, which is mainly the same with the setting of interactive argument pairs extraction. Therefore, such a task is of high practical value since the judge can be free from reading, comprehending, and analyzing the lengthy judgement documents manually with an automatic system to extract these focuses of dispute, and moreover, improve the efficiency and objectivity of the whole trial process.

Figure 2.
An example of three pairs of focus of dispute in one judgement document. Note: Each pair contains a sentence (i.e., argument) from the plaintiff and the defense, respectively. Among the three pairs, two of them are of Denying relationship and the other is of Partially Acknowledging relationship.
In order to address the aforementioned task, we hosted the Interactive Argument-Pair Extraction in Judgement Document (SMP-CAIL2020-Argmine) Challenge. We constructed a purpose-built data set that contains 4,080 entries of argument pairs from 976 judgement documents collected from http://wenshu. court.gov.cn/ published by the Supreme People's Court of China.
All the argument pairs are manually annotated by undergraduates and graduates majoring in law. Each of the argument pair consists of one argument from the plaintiff and the other from the defense that interacts with each other logically or semantically. During the process of annotation, annotators were given the full context of both sides and then required to extract all the interactive arguments between the plaintiff and the defense. Note that there can be multiple arguments from the defense that interact with the same argument from the plaintiff, and vice versa.
The task setting referred to the one designed in the Ji et al.'s work [23]. The systems participating in the SMP-CAIL2020-Argmine Challenge were required to identify the correct argument from the defense interacting with the given argument from the plaintiff among the five candidate arguments. That is to say, every entry of the collected argument pairs is converted into a multiple argument choice problem with four false options. Therefore, performing well in the task requires the system to deeply understand the semantic relationship of the given argument from the plaintiff and the candidate arguments. We conduct the competition in a two-phase fashion by setting threshold accuracy in the first phase, and only those whose system over-performs the baseline models we provide can enter the second phase. The number of argument Overview of SMP-CAIL2020-Argmine: The Interactive Argument-Pair Extraction in Judgement Document Challenge pairs reaches 4,080, including both the training data sets and the test data sets in two phases. In total, 315 teams from over 100 colleges and enterprises entered for the competition, 63 of which successfully submitted their models. We hope that research and practice in these fields will be stimulated by the challenges presented in this competition.
In this paper, we present a detailed description of the task and the data set, along with a summary of the submissions, and discuss the possible future research directions of the task.

Automatic Analysis of Judicial Documents
Automatic analysis of judicial documents has been studied for decades. At the very first stage, research tended to focus on mathematical and statistical analyses on existing court cases, instead of conclusions or methodologies on the prediction or summarisation of judicial documents. Ulmer proposed to suggest some uses of quantitative methods and probability theory in analyzing judicial materials [5]. Similar work including Nagel's [6] and Kort's [26] typically used optimization and statistics to conduct automatic judgement prediction. More recently, Lauderdale applied a kernel-weighted optimal classification estimator to recover estimates of judicial preferences [27].
These years have witnessed the booming in natural language processing (NLP), both theoretically and practically. As a natural application scenario of NLP, automation in judicial fields is also getting increasingly popular among NLP researchers. As a result, such automatic process of analyzing judicial documents has entered a brand new era. Liu and Chen [7] and Sulea et al. [8] extracted word features such as N-grams to train classifiers to predict the result of judgement, while Katz et al. [9] utilized case profile information (e.g., dates, terms, locations and case types). More advanced, Luo et al. introduced an attention-based neural model to predict charges of criminal cases, and verified the effectiveness of taking law articles into consideration [28].
Besides the automatic systems, a great number of interesting and meaningful tasks have also been proposed. For example, Xiao et al. [29] proposed a large-scale legal data set for judgement prediction, collected from China Judgments Online  , and then organized a competition for this task [30]. After that, more judicial tasks and challenges were brought out such as Xiao et al. [31] and Liu et al. [32].
However, existing research mostly focuses on the case-level information understanding, such as applicable law articles, charges, and prison terms [29,30], and insufficient research has noticed the importance of automatically extracting the focus of dispute, i.e., the interactive arguments from both sides of the case.

Argumentation Mining
Argumentation mining is also a theoretical research area which has obtained much more attention, especially in the nearest years. As a research field in mining the logical and semantic structure in texts, various meaningful works have been proposed in recent years. For instance, Baff et al. [33] compared content-and style-oriented classifiers on editorials from the Liberal New York Times with ideology-specific effect annotations to explore the effect of writing style of editorials to audience of different parties; Ji et al. [23] proposed the task of identifying interactive argument pairs in online debate forum such as ChangeMyView (CMV), along with a novel representation learning method called Discrete Variational Encoder (DVAE) to encode different dimensions of information brought by the arguments in the corpus; Cheng et al. [25] collected the text data from peer review and rebuttal process to mine the argumentative relationship entailed in such discussion, and proposed a challenging data set of argument pair extraction with a multi-task learning framework to address such a task.
Also, the proposition of pretrained language models such as BERT [34] opens a brand new era of NLP, with impressively improved performance in nearly all tasks.
Obviously, the trial process greatly resembles the debate in many ways, since there are both two parties expressing their own opinions on the same topic and attacking each other's arguments. Therefore, it is practical to leverage models and methods in argumentation mining in the aforementioned judicial tasks.

DATA SET CONSTRUCTION
As discussed before, our goal is to construct an automatic system such that it can identify all the interactive argument pairs contained in the given judgement document which records the statement of both the plaintiff and the defense. Therefore, we collect the related data set from the judgement document corpus.

Data Source and Preprocessing
The raw data of judgement are provided by China Justice Big Data Institute, including over 10,000 entries in JSON format.
We first conducted random sampling on the raw data set, finding that there existed some documents of low quality. More specifically, the statement from the defense in some documents was so trivial, only containing the acknowledgement of all the statement made by the plaintiff; interaction of two sides in some documents only focused on the amount of charge, without any semantic or logical interactive arguments; and some documents contained too few or too many sentences to be analyzed.

Overview of SMP-CAIL2020-Argmine: The Interactive Argument-Pair Extraction in Judgement Document Challenge
In order to solve these problems, we refrained the data set with the following rules: • Delete all the entries that contain "供认不讳" (forfeiting) or "无异议" (having no opposite opinions) in the first sentence of the defense's statement, since very few of these entries refute the statement of the plaintiff.
• Delete all the entries that contain less than two non-charging sentences in either statement of the plaintiff or the one of the defense (the "non-charging sentence" means the sentence that does not contain figures), as we do not hope the focus of dispute only aims at the amount of charge.
• Delete all the entries that contain less than four sentences in either statement of the plaintiff or the one of the defense, and all the entries that contain more than 1,500 words in the statement of both sides, so as to control the length of the data set, thus improving its quality.
After such filtering, we finally obtained 2,238 instances of judgement documents that are of high quality. Then we randomly sampled 40 of the obtained judgement documents and asked four graduate students to conduct human annotation of interactive argument pairs extraction. As a result, 120.25 argument pairs were extracted per person, and the average agreement was 0.628, which indicates that the task is both plausible and challenging.

Annotation
Af ter preprocessing the raw data, we started the annotation of the data set. The platform used for annotation is shown in Figure 3, which acts as displaying the sentences in the judgement documents and saving the annotation results to database on the server.
We then employed six annotators who were undergraduates or graduates majoring in law, for more professional annotation. Each judgement document was annotated by two different annotators, in order to reduce the accidental error.
As shown in Figure 3, during annotation, the annotators were given the whole statement of both the plaintiff and the defense, with each sentence ordered and marked a number. Their task is then two-fold: • Annotating features of the case. For the given case, annotators were required to specify some basic features of the whole case, including the case type, the type of the crime involved, as well as the entities of the plaintiff and the defense.
• Identifying all the interactive argument pairs in both sides' statement. The annotators then were required to identify all the interactive argument pairs entailed in the given case. Note that the amount of such pairs was not constant, so the annotators had to record all the interactive argument pairs by adding them one by one. Furthermore, we classified the argument pairs into four emotional categories: acknowledging, partially acknowledging, simple denying and active denying.

Overview of SMP-CAIL2020-Argmine: The Interactive Argument-Pair Extraction in Judgement Document Challenge
Note that in the second task, besides the identification of interactive argument pairs, the annotators were also required to classify each argument pairs collected. The four categories mentioned above represent different emotional polarities of the defense. Specifically, the argument pairs of acknowledging generally refer to the ones whose defense simply incorporates arguments like "I confess.", partially acknowledging means the defense's argument acknowledges some parts of the plaintiff's but denying the others, simple denying contains the simple and direct denial such as "I did not hit the plaintiff.", while the active denial

Overview of SMP-CAIL2020-Argmine: The Interactive Argument-Pair Extraction in Judgement Document Challenge
is more complicated, and sometimes it includes completely opposite statement on the same topic, e.g., "I did not hit the plaintiff, and instead, the plaintiff hit me with umbrella.". We conducted the classification for the purpose of making it more convenient for the judge to know which argument pairs needed further judgement and evidence. With these annotation standards, an instance of annotation is shown in Figure 4.

Statistics on t he Data Set
After six months of annotation, some basic statistics on the data set is shown in Table 1 below. From the table we can find that law major students indeed achieved higher agreement, indicating that professional knowledge helps improve the performance in this task. Another notable point lies in that interactive argument pairs, compared with all the sentence pairs in the corpus, are of very low density and bring challenges for automation.

Task Formulation
As mentioned above, the density of interactive argument pairs is very low (compared with all the sentence pairs between two sides), and thus we have to convert the identification task into an easier one. Our approach is to construct a multiple-choice problem for every argument from the plaintiff that occurs in at least one interactive pair, by adding four arguments from the defense that does not match the plaintiff's argument. That is to say, given an argument sc from the plaintiff, a candidate set of the defense's arguments consists of one positive reply bc + , four negative arguments bc -1 ~ bc -4 , along with their corresponding contexts, and our goal is to automatically identify which argument from the defense has interactive relationship with the one from the plaintiff. We formulated such a task as a 5-way multiple-choice problem. In practice, the participants' models calculated the matching score S(sc, bc) for each argument in the candidate set with the plaintiff's argument sc and treated the one with the highest matching score as the winner. Note that here we did not use the emotional tags we collected before, since we would like to focus mainly on the identification of the correct argument pair in this competition.
Note that naturally, this setting needs the number of sentences in the statement of the defense to be no less than 5 (or more if there are not only one argument from the defense interacting with the plaintiff's one), so some of the entries are discarded and finally our whole data set comprises of 4,080 interactive argument pairs (i.e., multiple-choice problems) from 976 judgement documents. An example is displayed in Table 2 below.

Scoring Metric and Data Set Division
For the released multiple-choice task, we take accuracy as the evaluation metric. Specifically, if the ground truth of the ith problem is y i , and the system predicts the answer to be ˆi y , then the average accuracy on the test data set of size n is calculated as below: For the purpose of testing the system's generalization more fairly, we organized two phases in the competition and thus dividing the data set into three parts, namely SMP-CAIL2020-Argmine_train, SMP-CAIL2020-Argmine_test1, and SMP-CAIL2020-Argmine_test2. The quantity of these data sets is roughly 3:1:1.
In the first phase of the competition, participants were provided with the SMP-CAIL2020-Argmine_train data set to train their systems, and were tested with the SMP-CAIL2020-Argmine_test1 data set. Those who exceeded the performance of the given BERT baseline models were admitted to the second phase. And in the second phase, participants were provided with the SMP-CAIL2020-Argmine_test1 data set and tested with the SMP-CAIL2020-Argmine_test2 data set. The participants' final score = 0.3 * Score 1 + 0.7 * Score 2 , in which the Score 1 and Score 2 means their score in two phases, respectively.

Baseline Models
Before we released the competition, we ran the following baseline models on the data set to obtain the border line for the admission to the second phase. Notice that for every baseline model, we only took the SMP-CAIL2020-Argmine_train data set as the training set.
• All 1 This model directly output answer "1", which was used to examine whether the distribution of the answers was shuffled randomly enough.

• Common Words
This model returned the candidate argument that had most common words with the given argument from the plaintiff, which was a simple and straightforward model leveraging lexical features.

• BiLSTM
This model first conducted word segmentation using Jieba [35], and then we concatenated the plaintiff's argument with candidate arguments separately. In this way, we converted the 5-way multiplechoice into 5 sentence-pair classification problems. Then we randomly abandoned three negative sentence pairs so as to make the two classes balanced. For each sentence pair, their embedding was sequentially fed into a BiLSTM [36,37] and took its final hidden state into a linear classifier to output the final prediction. The Figure 5(a) shows the model's overall framework.

Overview of SMP-CAIL2020-Argmine: The Interactive Argument-Pair Extraction in Judgement Document Challenge
• BERT BERT [34] is a pretrained language model based on transformers, and has proved to be exceedingly superior to many research aspects in NLP. In our experiment, we also converted the problem into the sentence-pair classification since it could be much easier to apply the BERT model to such a problem. The Figure 5(b) shows the model's overall framework. All baseline models' performance is shown in Table 3 below. Since the best baseline model gives out an accuracy of 0.7476, we set the border line of the first phase at 0.75.

Submissions
The SMP-CAIL2020-Argmine Challenge was hosted on CAIL  , which allowed submissions to be scored against the blind test set without the need to publish the correct labels. The two phases of the scoring system were open from June 1 to July 9, and July 10 to August 3, 2020. Participants were limited to 3 submissions per week.

Participants and Results
There are over 300 teams from various universities as well as enterprises who have registered for SMP-CAIL2020-Argmine, 63 teams who have submitted their models in the first phase, and 21 teams who have submitted their final models. The final accuracy shows that neural models can achieve considerable results on the task, especially when given a larger training set. In Table 4, we list the scores of Top 7 participants of the task. We have collected the technical reports of these contestants. In the following parts, we summarize their methods and tricks according to these reports. The performance of all participants on SMP-CAIL2020-Argmine will be found in Appendix A.

General Architecture
Pretrained Language Model. Ever since BERT [34] was publicly proposed, the whole NLP area has been pushed into a new era, with almost all tasks improved in performance. Also, among the baseline models above, BERT gives out the best performance on the task, and therefore makes the pretrained language model such as Sentence-BERT [38], RoBerta [39], and ERNIE [40] popular in submissions.
Fine-tuning Mechanisms. After leveraging the pretrained models mentioned above to obtain embedding for tokens and sentences, fine-tuning is needed to further improve the model's performance, including: • Attention. A natural idea to further fine-tune the representation of the arguments is to leverage the attention mechanism between the plaintiff's argument and five candidate arguments separately.
• RNN Layers. Note that after using the pretrained models, we have token-level, sentence-level as well as sentence-pair-level representation (the representation of [CLS]). Therefore, we can retain the sentence-pair-level representation, and feed the tokens' embedding into another BiLSTM layer and concatenate them before the linear classifier.
• Memory Networks. All the methods mentioned above only use the information of the arguments.
However, we have provided the whole context of both sides in the judgement documents. Hence, it is plausible to use memory networks [41] to retrieve the context information.

.2.2 Promising Tricks
Other than the standard "pretrained model + fine-tuning" mode, there are some useful tricks which can address the issues met in the task and improve the sentence pair classification models significantly. We summarize them as follows: Fine-tuning with external corpus. Teams such as "zero_point", "quanshuizhihuiguan" as well as "tiaodalanmao" all tried to fine-tune their pretrained model by adding external judicial corpus. Such a method helps improve the model since external judicial corpus enables the pretrained language models to learn more topic-specific language knowledge and therefore performs better in judicial settings. As is reported by them, this method enables the model to have an increase in accuracy by about 1%.
Data Augmentation and Data Balancing. The "a-U" team followed our way of constructing the multiple choices and generated more multiple-choice questions for training by retrieving more negative samples from the provided contexts of the defense, which helps the model to further leverage the context information and incorporate more textual knowledge. Moreover, to address the problem of data imbalance (too many negative samples), they used over-sampling on positive instances to avoid the model's getting lost in the overwhelming size of negative samples.
Loss Function. Most models use cross entropy as their loss functions. However, some models adopt more promising loss functions, such as focal loss [42] to enhance the performance on low frequency categories, and triplet loss to improve the model's ability of generalization. Besides, the loss weights of various categories and the activation functions of the output layer also have great influence on the final performance. As is reported by the competitors, such a method transforms the task into an argument pair ranking problem, instead of the classification problem, which helps the model to gain an improvement of over 4%.
Model Ensembling. Some participants trained several different classification models over different samples from the whole data set, and finally combined them with majority voting or weighted average strategies to combine their predicting results. Among all the participants using such a method, the "a-U" team trained five sequence classification models based on BERT and adopted the majority voting method to reduce the variance of a single model, therefore improving the robustness of the model, which finally helps their model to achieve the second prize of the competition.

Error Analysis
Here, we inspect the erroneous outputs of our model to identify major causes of mismatches. There are mainly two issues.

Sentence Length Limitation in Pretrained Models.
Since pretrained models like BERT have maximal length limitation, i.e., they will truncate sentence pairs that contain huge size, thus making the model unable to process all the information entailed in the sentence pair.

Overview of SMP-CAIL2020-Argmine: The Interactive Argument-Pair Extraction in Judgement Document Challenge
Entity Mismatch. Among many false cases, the error caused by entity mismatch is quite common. In the cases where there are multiple defences, the plaintiff may propose different prosecutions to different defences. However, some of them may share the same action mentioned by plaintiff, thus making the model confused when the negative candidate argument contains the detailed action while the positive one only includes simple denial.

CONCLUSION AND FUT URE WORK
In SMP-CAIL2020-Argmine, we employ the interactive argument-pair extraction in judgement document as the competition topic. In this competition, we construct and release a brand new data set for extracting the focus of dispute in the judgement documents. The performance on the task was significantly raised with the efforts of over 300 participants. In this paper, we summarize the general architecture and promising tricks they employed, which are expected to benefit further research on legal intelligence. However, there is still a long way to go to fully achieve the goal of automatically extracting the focus of dispute since the task is already a simplified one. Also, leveraging some more case-based features such as the type of case and type of crime and the semantic label of the interactive argument pairs may possibly further improve the model's performance.

DATA AVAILABILITY STATEMENT
The data sets generated and analyzed in the study are not currently available to the public due to the fact that the data sets are produced by judicial expert consultants of China Judicial Big Data Institute based on their professional knowledge and experience. The publicly released version of the data sets needs the consent of all expert consultants, and hence currently it can only be accessed from the corresponding author on reasonable request.

Overview of SMP-CAIL2020-Argmine: The Interactive Argument-Pair Extraction in Judgement Document Challenge
Overview of SMP-CAIL2020-Argmine: The Interactive Argument-Pair Extraction in Judgement Document Challenge

APPENDIX A: FULL RANK OF ALL PARTICIPANTS
Full rank of all participants in CAIL-SMP2020-Argimine. Score 1 and Score 2 refer to the score achieved by the participants in phase I and II, respectively, while Final Score refers to the weighted sum of Score 1 and Score 2 .