Semi-Supervised Noisy Label Learning for Chinese Medical Named Entity Recognition

This paper describes our approach for the Chinese Medical named entity recognition(MER) task organized by the 2020 China conference on knowledge graph and semantic computing(CCKS) competition. In this task, we need to identify the entity boundary and category labels of six entities from Chinese electronic medical record(EMR). We construct a hybrid system composed of a semi-supervised noisy label learning model based on adversarial training and a rule post-processing module. The core idea of the hybrid system is to reduce the impact of data noise by optimizing the model results. Besides, we use post-processing rules to correct three cases of redundant labeling, missing labeling, and wrong labeling in the model prediction results. Our method proposed in this paper achieved strict criteria of 0.9156 and relax criteria of 0.9660 on the ﬁnal test set, ranking ﬁrst.


Evaluation task
This task is a continuation of the series of evaluations carried out by CCKS around the semantics of Chinese electronic medical records. It has been extended and expanded on the basis of the relevant evaluation tasks of CCKS 2017, 2018, and 2019. For a given set of plain text documents of EMR, this Chinese medical record MER task in 2020 is to extract entity mentions and classify them into six predefined types of entities: disease & diagnosis, imaging examination, laboratory examination, operation, drug, and anatomy.

Dataset
The CCKS 2020 Medical Named Entity Recognition Competition provides 1,050 labeled data as a training set. The data includes labels for six types of entities, including disease & diagnosis, imaging examination, laboratory examination, operation, drug, and anatomy. Besides, the evaluation task also provided 1,000 unlabeled corpora. The statistics of the number of entities in the training set are shown in Table 1:

Overview of Main Challenges and Solutions
Compared with named entity recognition(NER) in the general field [1], MER faces many new challenges. This paper introduces an algorithm modeling strategy towards the two significant challenges in this competition.
The first challenge is inconsistent entity labeling. Labelers from different medical departments may have a various understanding of labeling standard, so labeling results of different standards are likely to appear. In the dataset of this task, we do notice apparent inconsistencies in entity labeling. For example, 白细胞数(white blood cell count), this string in some samples is labeled wholly as 白细胞数(white blood cell), while in other samples is labeled partly as 白 细胞(white blood cell count). We do not know which standard is used in the test set. According to our estimation, about 13.69% of entities may be involved in inconsistent labeling, which seriously affects the model's final test performance. This phenomenon is difficult to circumvent with rules, nor can we directly correct the inconsistent entities in the training set.
The second challenge is that lacking training data leads to inconsistent model results. Due to data's social sensitivity in the medical field, it is often difficult for researchers to obtain sufficient labeled data. The lack of annotated data is generally considered to lead to long-tail phenomena and poor model generalization. When training data is insufficiency, the model prediction results may change drastically with different model parameters. How should we maintain the consistency of model results with the absence of training data?
This paper propose a hybrid system composed of a semi-supervised noisy label learning model based on adversarial training and a rule post-processing module. The overall process of the system is shown in Figure 1. We introduce a five-fold cross-voting mechanism to deal with annotation inconsistency in the dataset. A model ensemble mechanism and a semi-supervised training mechanism help cope with the unstable model results caused by lacking training data. Besides, an adversarial training mechanism is effective for the above two challenges. The official test set results to show that our method achieved the highest score of 0.9156 on the strict criteria and 0.9660 on the relax criteria in the CCKS 2020 MER task.

Adversarial Training
The adversarial sample [2] is that adding small disturbances to the input samples that are difficult for humans to detect. Such attacks will seriously interfere with the prediction results of 2 Figure 1: The overall process of our system. the neural network. The adversarial training is to train a more robust and generalized model by continuously defending against adversarial samples [3]. Madry et al [3]. defined adversarial training from an optimization perspective: The process of adversarial training is to find a small disturbance that can maximize the training loss and then optimize the model parameters θ to make the model loss smaller and continue to iterate to resist the current attack until it converges.

Semi-supervised Learning
Semi-supervised learning employs a small amount of labeled data as a supervised signal and combines numerous unlabeled data to achieve data augmentation. It has high application value and research value in fields where labeled data acquisition is expensive, such as medicine.
We use a semi-supervised training mechanism to incorporate the unlabeled data provided by the CCKS organizer into the training process, which reduces the lack of annotated data to a certain extent.

Basic Model Structure
Our basic model structure is shown in Figure 2. The sequence samples get their embedding representation through the pre-training model [4]. Then BiLSTM [5,6,7] is connected to the embedding representation for context encoding, and CRF [8,9] is used to decode the context representation. Finally the annotation result is obtained.
We tried five different pre-training models. The pre-training model can bring richer semantic representation, a large amount of world knowledge, common sense knowledge, and grammatical knowledge contained therein can play a similar role in data expansion. 3

Five-fold Cross-voting
We use five-fold cross-validation to divide the training set into five different datasets, and the inconsistencies of entity labeling in each dataset are various. We fix the same model structure, train five models on five training sets, and integrate their prediction results on the same test set by hard voting.

Model Ensemble
To further reduce the impact of the randomness of the model parameters on the prediction results, we ensemble a variety of models through voting to weaken the impact of performance fluctuations caused by a single model parameter change on the prediction results. Figure 3 shows the process of model ensemble combined with five-fold cross-voting. There are two voting sequences. The red box indicates that the five models trained on the same training set are first fused, and then the five fusion models obtained on the five-fold data set are continued to be fused, for a total of 25 models. The green box indicates that the five models obtained from the five-fold data set for each model structure are first obtained, and then the five models obtained from the five model results are continued to be merged, for a total of 25 models. Because the two sequences' results are similar, we follow the sequence represented by the green box by default.

Semi-supervised Training
The semi-supervised training process is divided into two stages: the first stage uses all 1050 labeled data for training and 1000 unlabeled data; the second stage adds the obtained pseudolabeled data to the training set to get the final model.

Adversarial Training
Referring to the FGM [10] adversarial training mechanism, we directly impose a small disturbance on the embedding representation of the model and assume the embedding representation of the input text sequence [v 1 , v 2 , . . . , v T ] as x. Then the small disturbance r adv applied each time is: The meaning of the formulas is to move the input one step further in the direction of rising loss, which will make the model loss rise in the fastest direction, thus forming an attack. In contrast, the model needs to find more robust parameters in the optimization process to deal with attacks against samples.
Among them, applying a small disturbance to the embedding characterization simulates the natural error of the dataset in the labeling to a certain extent. It encourages the model to find more robust parameters during the training process. Then the model's embedding representation will be optimized together with the model. Adversarial training will make the model more tolerant of changes brought about by model parameter fluctuations, thereby decreasing the impact of data noise.

Post-processing Rules
If an entity has multiple labeling standards, then the number ratio between each labeling standard of the test set should be consistent with the training set. Based on this assumption, entities in the prediction results inconsistent with the distribution in the training set can be directly screened out. For the selected entities, we continue to subdivide entities based on the three cases of redundant labeling, miss labeling, and wrong labeling and establish a redundant labeling dictionary, a missing labeling dictionary, and a wrong labeling dictionary for correction.

Evaluation Metrics
There are two F1 criteria for this task. The strict F1 criteria are right only when the entity boundary and entity type are consistent with the gold answer. The other relax F1 criteria are right when the entity type is consistent with the gold answer or the entity boundary overlaps with the gold answer boundary. To reflect model performance more accurately, we only use strict F1 criteria in the local evaluation.

Pre-processing
We perform the following pre-processing for each piece of data:

Sentence Segmentation
Since the maximum input sequence of the data BERT model is only 512, the input medical record text is segmented under the premise of ensuring the relatively complete semantic information in the office to ensure that each input's text length is less than 512.

Text Normalization
This part mainly realizes the unification of the text and symbols in the input medical record, the conversion of English cases, and the processing of invisible characters.

Results
We divided the 1050 training data into five data according to the five-fold cross method, and each data contains 840 training set and 210 development set. Table 3 shows the results of the local development set. The results in Table 3 are the average of F1 on five local development sets. In all tables of this paper, we abbreviate Semi-supervised Training as ST, Adversarial Training as AT, and Post-processing Rules as PR.
It can be noticed from Table 3 that the model ensemble mechanism and semi-supervised training mechanism, and the adversarial training mechanism have brought significant improvements to the basic model. Furthermore, after combining the three mechanisms, the best model result is achieved.
The results of the official test set are shown in Table 4. We call BERT-base+BiLSTM+CRF the Single Model. The Single Model score is 0.0384 higher than that of the local, indicating that the inconsistency of entity annotations on the official test set may be much less than that in the 6  training set. In the final model, we used a five-fold cross-voting mechanism for each model used for fusion to reduce the impact of lack of training data. It is worth noting that although the overall improvement brought by the post-processing rule is not apparent in the local development set, it has brought significant improvements of 0.0242 and 0.0414 in the inspection and verification of the two classes with fewer entities.

Conclusion and Future Work
To solve the two core challenges in the dataset of this task: inconsistent entity annotations and lack of annotated data, we introduced semi-supervised data augmentation and adversarial training methods, which achieved the best good performance.
The task of MER, precisely quantifying the inconsistency of entity annotations in data, and letting the model better overcome this noise, is our future research goal.