Overview of CCKS 2020 Task 3: Named Entity Recognition and Event Extraction in Chinese Electronic Medical Records

The China Conference on Knowledge Graph and Semantic Computing (CCKS) 2020 Evaluation Task 3 presented clinical named entity recognition and event extraction for the Chinese electronic medical records. Two annotated data sets and some other additional resources for these two subtasks were provided for participators. This evaluation competition attracted 354 teams and 46 of them successfully submitted the valid results. The pre-trained language models are widely applied in this evaluation task. Data argumentation and external resources are also helpful.


INTRODUCTION
China Conference on Knowledge Graph and Semantic Computing (CCKS), which was founded in 2016, is organized by the Chinese Information Processing Society of China.To promote the development of technologies in knowledge graph and semantic computing, CCKS provides 8 evaluation tasks in 2020.Of these tasks, Task 3 focuses on named entity recognition (NER) and event extraction (EE) in the Chinese electronic medical records (EMRs).

C o r r e c t e d P r o o f
Downloaded from http://direct.mit.edu/dint/article-pdf/doi/10.1162/dint_a_00093/1900485/dint_a_00093.pdf by guest on 09 April 2021 NER and EE are commonly used techniques to acquire useful information from free text.NER in EMRs is also known as clinical named entity recognition (CNER).We can recognize diseases, drugs or other medical entity names from EMRs with the help of the NER model.The most popular NER method is sequence labeling, which can be based on long short-term memory (LSTM) [1,2,3] or bidirectional encoder representation from transformers (BERT) [4].Clinical event extraction helps us identify medical events in EMRs, such as the tumor site, the tumor size and where the tumor transfers to.LSTM, BERT and other methods are applied in EE.
Traditional NER and EE are based on supervised models.However, the annotation of clinical information is much harder than the general domain information.Although there are some public medical data sets for the NER task, such as i2b2 [5], ShARe CLEF eHealth [6] and SemEval [7], there are barely public Chinese medical data sets.To promote the development of semantic analysis of the Chinese EMRs, the Knowledge Engineering Group of Tsinghua University and Yiducloud Beijing Technology Co., Ltd.organized this evaluation challenge at CCKS 2020.The data sets of this task provided by Yiducloud are restricted to CCKS evaluation only.

RELATED WORK
CCKS 2020 Task 3 focuses on NER and EE in the Chinese EMRs.NER and EE have been the core problems in natural language processing.

Chinese NER
NER is a task to locate and classify certain occurrences of words or expressions in unstructured text.In English NER, LSTM-CRF (Conditional Random Field) models [1,2,3] are a classic method to leverage both character-level and word-level representations, which can achieve the state-of-art results.Compared with NER in English, Chinese NER is more difficult since sentences in Chinese are not naturally segmented.A common practice for Chinese NER is to first perform word segmentation using an existing Chinese word segmentation (CWS) system and then apply a word-based NER model to infer the NER tags.However, the pipeline method suffers from error propagation, since the error of CWS may inevitably affect the performance of NER.Therefore, some approaches directly use a character-based NER model [8,9].A drawback of the purely character-based NER model is that the word information is not fully exploited.To incorporate word information in Chinese NER, some recent methods, such as [10,11,12,13,14], resort to an automatically constructed lexicon.

Event Extraction
Event is a common but non-negligible knowledge type.Therefore, identifying events from texts and extracting their arguments are important for many applications.DMCNN [15] is a classic EE model, which uses the convolutional neural network (CNN) method to learn semantic features from raw texts, including lexical-level and sentence-level features.JRNN [16] is a recurrent neural networks (RNNs) based method for EE, aiming to integrate the discrete features with the automatically learned features.JMEE [17] is a method based on graph convolution networks (GCNs), which jointly extracts multiple event triggers and arguments by introducing syntactic shortcut arcs to enhance information flow and using attention-based GCNs to model graph information.Recently, event extraction is explicitly casted as a machine reading comprehension (MRC) problem [18] and the MRC model is used to solve event extraction.

NER and EE in Clinical Text
The information extraction of clinical text is getting more and more important in recent years.The TREC is the first shared tasks in clinical natural language processing (NLP), which focus on identifying relevant and irrelevant documents.Other evaluation tasks inculde ImageCLEFmed [19] and i2B2 [5].For solving clinical NER, LSTM units and a conditional random field classifier [20] are used in the NER component.An unsupervised method [21] is used to build clinical NER systems which do not require any manual annotations and the models are trained on automatically annotated corpus followed by self-training iterations.For EE in clinical text, the bi-directional long short-term memory network assisted by the attention mechanism [22] is utilized to uncover the important aspects of the patient's medical conditions.

Clinical Named Entity Recognition
Given the free text from EMRs, this task aims to identify the clinical entity mentions and classify them into pre-defined categories.A novel method is presented for training clinical NER systems that do not require any manual annotations.It only requires a raw text corpus and a resource like Unified Medical Language System (UMLS) that can give a list of named entities along with their semantic types.Using these resources, annotations are automatically obtained to train machine learning methods.The methods were evaluated on the NER shared-task data sets of i2b2 2010 and SemEval 2014.

Formalized Definition
We define this task formally.The m i = (d i , b i , e i ) represent the entity mention in document d i , where b i and e i is the start and end position of m i , respectively.c mi ∈C represents the category of m i .The overlap between mentions is not allowed, which is e i < b i+1 .

Pre-defined Categories
There are 6 categories that are defined as follows.

Formalized Definition
This task is formally defined as follows.

Pre-defined Attributes
The 3 pre-defined attributes are:

DATA SETS
The data sets were provided by Yiducloud Beijing Technology Co., Ltd.Yiducloud organized a professional medical team to annotate these data.The data set is for CCKS evaluation only  .
Compared with the CNER task in CCKS 2019, the annotated data set is about 4 times larger.Besides, Yiducloud provided an entity vocabulary and lots of unannotated data as additional resources that participators can use during the evaluation.The statistics of CNER data set are shown in Table 1.
The clinical event extraction data set includes a labeled training set, an unlabeled set and a vocabulary, which makes this challenge closer to the real-word scene.The statistics of clinical event extraction data sets are shown in Table 2.

Strict Metric
There are two evaluation metrics, the strict metric and relaxed metric.The extracted entities set is denoted as S and the gold entities set is denoted as G.
For the strict metric, s i ∈S is equal to g j ∈G, which means they are exactly the same: 1).The start position of s i equals to g j 2).The end position of s i equals to g j 3).The category of s i equals to g j .
 To access the data sets, please contact the corresponding author after signing Data Usage Agreement.

Overview of CCKS 2020 Task 3: Named Entity Recognition and Event Extraction in Chinese Electronic Medical Records
The strict Precision, Recall and F1 can be calculated as follows:

Relaxed Metrics
The relaxed metric does not require that s i ∈S and g j ∈G are exactly the same, and they only need to meet the following requirements: 1).The maximum value of the start position of s i and g j is less or equal to the minimum value of the end position of s i and g j ; 2).The category of s i is equal to g j .
The relaxed Precision, Recall and F1 can be calculated as follows:

Clinical Event Extraction
There could be more than one attribute entity for an event attribute.The Precision, Recall and F1 are calculated based on the attribute entity rather then attribute.

RESULTS AND DISCUSSION
This evaluation attracted 354 teams, and 46 of them successfully submitted their results.There are 32 teams which submitted results and 5 evaluation papers on the clinical named entity recognition task.Fourteen teams submitted their results and 3 papers on the clinical event extraction task.We list the top teams in Table 3

Clinical Named Entity Recognition
For the clinical named entity recognition task, Top 1 team and Top 2 team achieved very close scores.Both of them focus on the label inconsistency problem in CNER.
Top 1 team comes from the Institute of Automation, Chinese Academy of Sciences (CA-SIA) and Unisound AI Technology Co., Ltd.They proposed a hybrid system composed of a semi-supervised noisy label learning model based on adversarial training and a rule based post-processing module.They adopted a five-fold cross-voting mechanism to handle the annotation inconsistency problem in the data set.They used model ensemble and semi-supervised training to alleviate the insufficient training data problem.They also applied adversarial training to decrease aleatoric uncertainty and epistemic uncertainty simultaneously.
Based on the submitted papers, we have come to the following conclusions.
1).The pre-trained language models (PLMs) like BERT or ELMO [23] are widely applied.Using PLMs in their models have been a common sense among participants.Most teams did not simply apply the general BERT model, but the model pre-trained on the Chinese documents, such as RoBERTawwm [24].Furthermore, some of them collected Chinese medical documents and pre-trained PLMs on these in-domain documents.The usage of PLMs in this year's evaluation challenge is more diverse than the previous competitions held at CCKS. 2).Model ensemble.Most teams applied this technique in their submission.The ensembled models usually achieve better results than a single model.3).Feature engineering and rules are still valuable.In the clinical domain, there are lots of regular patterns and less annotated data.Therefore, participants can benefit from feature engineering.Some of the teams added the features of Chinese words into their model and gained stable improvements.They also introduced some rules to alleviate the data noise.4).Semi-supervised methods.This evaluation provides 1,000 unlabeled data as additional resources.
Some participants generated pseudo labels with a supervised model for the unlabeled data and trained the final model with both supervised and pseudo data.5).Adversarial training.There are some unavoided label noises in the training data.To train a robust model not sensitive to the noises, some teams added turbulence to the word embeddings during training.6).Domain vocabulary.Vocabulary is usually an important resource for the CNER task.In the past CNER evaluation, participants usually collected and extended the clinical vocabularies in various ways.The most popular vocabularies include ICD-10, the DrugBank database and some health websites such as "haodf.com"and "xywy.com".However, the top 3 teams in this year did not apply any vocabularies in their models.The main reason is their sufficient usage of PLMs.It may be a trend to replace vocabularies by PLMs.

Clinical Event Extraction
For the clinical event extraction task, Top 1 team achieved 0.76234 F1 score and Top 2 team achieved 0.74597 F1 score.The competition is fierce.Top 1 team comes from Knowledge Graph Group, Baidu, Inc.They proposed a system mainly based on pre-trained language model.They applied domain adaption and task adaption during the pre-training, in order to improve the modeling ability of the pre-trained language model.To handle the insufficient training data challenge, they applied back translation to expand the training data.They also used entity vocabulary as the model input.
Based on the submitted papers, we have come to the following conclusions.

1 )
. Pre-trained language models are widely used.Like the CNER task, Top 2 team applied PLMs in their models.Both of them chose RoBERTa[25] as the backbone.The usefulness of PLMs has been proved.2).Data argumentation.The annotation of clinical documents is very difficult.Therefore, there are no sufficient labeled data for training.The top teams tried various data argumentation methods to enhance the model's robustness.Some of them doubled the training set by randomly re-arranging the sentence order in each training instance.Another team applied the back translation strategy to double the training set.They translated the training instance into English version and then back translated them into Chinese.Other teams tried random replacement of key information in the whole field.

Table 1 .
The statistics of clinical named entity recognition data set.

Table 2 .
The statistics of clinical event extraction data set.

Table 3 .
The results of clinical named entity recognition.

Table 4 .
The results of clinical event extraction.