Abstract
When an entity name contains other names within it, the identification of all combinations of names can become difficult and expensive. We propose a new method to recognize not only outermost named entities but also inner nested ones. We design an objective function for training a neural model that treats the tag sequence for nested entities as the second best path within the span of their parent entity. In addition, we provide the decoding method for inference that extracts entities iteratively from outermost ones to inner ones in an outside-to-inside way. Our method has no additional hyperparameters to the conditional random field based model widely used for flat named entity recognition tasks. Experiments demonstrate that our method performs better than or at least as well as existing methods capable of handling nested entities, achieving F1-scores of 85.82%, 84.34%, and 77.36% on ACE-2004, ACE-2005, and GENIA datasets, respectively.
1 Introduction
Named entity recognition (NER) is the task of identifying text spans associated with proper names and classifying them according to their semantic class such as person or organization. NER, or in general the task of recognizing entity mentions, is one of the first stages in deep language understanding, and its importance has been well recognized in the NLP community (Nadeau and Sekine, 2007).
One popular approach to the NER task is to regard it as a sequence labeling problem. In this case, it is implicitly assumed that mentions are not nested in texts. However, names often contain entities nested within themselves, as illustrated in Figure 1, which contains 3 mentions of the same type (PROTEIN) in the span “… in Ca2+ -dependent PKC isoforms in …”, taken from the GENIA dataset (Kim et al., 2003). Name nesting is common, especially in technical domains (Alex et al., 2007; Byrne, 2007; Wang, 2009). The assumption of no nesting leads to loss of potentially important information and may negatively impact subsequent downstream tasks. For instance, a downstream entity linking system that relies on NER may fail to link the correct entity if the entity mention is nested.
Various approaches to recognizing nested entities have been proposed. Many of them rely on producing and rating all possible (sub)spans, which can be computationally expensive. Wang and Lu (2018) provided a hypergraph-based approach to consider all possible spans. Sohrab and Miwa (2018) proposed a neural exhaustive model that enumerates and classifies all possible spans. These methods, however, achieve high performance at the cost of time complexity. To reduce the running time, they set a threshold to discard longer entity mentions. If the hyperparameter is set low, running time is reduced but longer mentions are missed. In contrast, Muis and Lu (2017) proposed a sequence labeling approach that assigns tags to gaps between words, which efficiently handles sequences using Viterbi decoding. However, this approach suffers from structural ambiguity issues during inference, as explained by Wang and Lu (2018). Katiyar and Cardie (2018) proposed another hypergraph-based approach that learns the structure in a greedy manner. However, their method uses an additional hyperparameter as the threshold for selecting multiple mention candidates. This hyperparameter affects the trade-off between recall and precision.
In this paper, we propose new learning and decoding methods to extract nested entities without any additional hyperparameters. We summarize our contributions as follows:
- •
We describe a decoding method that iteratively recognizes entities from outermost ones to inner ones without structural ambiguity. It recursively searches a span of each extracted entity for inner nested entities using the Viterbi algorithm. This algorithm does not require hyperparameters for the maximal length or number of mentions considered.
- •
We also provide a novel learning method that ensures the aforementioned decoding. Models are optimized based on an objective function designed according to the decoding procedure.
- •
Empirically, we demonstrate that our method performs better than or at least as well as the current state-of-the-art methods with 85.82%, 84.34%, and 77.36% in F1-score on three standard datasets: ACE-2004,1 ACE-2005,2 and GENIA.
2 Method
We propose applying conditional random fields (CRFs) Lafferty et al. (2001), which is commonly used for flat NER (Lample et al., 2016; Ma and Hovy, 2016; Chiu and Nichols, 2016; Reimers and Gurevych, 2017; Strubell et al., 2017; Akbik et al., 2018), to nested NER in this study. We first explain our usage of CRF, which is the base of our decoding and training methods. Then, we introduce our decoding and training methods. Our decoding and training methods focus on the output layer of neural architectures and therefore can be combined with any neural model.
2.1 Usage of CRF
Our decoding and training methods are based on two key points about our usage of CRF. The first key point is that we prepare a separate CRF for each named entity type. This enables our method to handle the situation where the same mention span is assigned multiple entity types. The GENIA dataset indeed has such mention spans. In the literature, Muis and Lu (2017) demonstrated that this approach of multiple CRFs would perform better on nested NER datasets and even a flat NER dataset than the standard approach of a single CRF for all entity types. The second key point is that each element of the transition matrix of each CRF has a fixed value according to whether it corresponds to a legal transition (e.g., B-X to I-X in IOBES tagging scheme, where X is the name of entity type) or an illegal one (e.g., O to I-X). This is helpful for keeping the scores for tag sequences including outer entities higher than those of tag sequences including inner entities.
2.2 Decoding
We use three strategies for decoding. First, we consider each entity type separately using multiple CRFs in decoding, which makes it possible to handle the situation that the same mention span is assigned multiple entity types. Second, our decoder searches nested entities in an outside-to-inside way,3 which realizes efficient processing by eliminating the spans of non-entity at an early stage. More specifically, our method recursively narrows down the spans to Viterbi-decode. The spans to Viterbi-decode are dynamically decided according to the preceding Viterbi-decoding result. Only the spans that have just been recognized as entity mentions are Viterbi-decoded again. Third, we use the same scores of Equation (1) to extract outermost entities and even inner entities without re-encoding, which makes inference more efficient and faster. These three strategies are deployed and completed only in the output layer of neural architectures.
We describe the pseudo-code of our decoding method in Algorithm 1. Also, we depict the overview of our decoding method with an example in Figure 2. We use the term level in the sense of the depth of entity nesting. [S] and [E] in Figure 2 stand for the START and END tags, respectively. We always attach these tags to both ends of every sequence of IOBES tags in Viterbi-decoding.
Overview of our second-best path decoding algorithm to iteratively find nested entities.
Overview of our second-best path decoding algorithm to iteratively find nested entities.
We explain the decoding procedure and mechanism in detail below. We consider each entity type separately and iterate the same decoding process regarding distinct entity types as described in Algorithm 1. In the decoding process for each entity type k, we first calculate the CRF scores over the entire sentence. Next, we decode a sequence with the standard 1-best Viterbi decoding as with the conventional linear-chain CRF. “Ca2+ -dependent PKC isoforms” is extracted at the 1st level with regard to the example of Figure 2.
Then, we start our recursive decoding to extract nested entities within previously extracted entity spans by finding the 2nd best path. In Figure 2, the span “Ca2+ -dependent PKC isoforms” is processed at the 2nd level. Here, if we search for the best path within each span, the same tag sequence will be obtained, even though the processed span is different. This is because we continue using the same scores and because all the values of A(k) corresponding to legal transitions are equal to 0. Regarding the example of Figure 2, the score of the transition from [S] to B-P at the 2nd level is equal to the score of the transition from O to B-P at the 1st level. This is true for the transition from E-P to [E] at the 2nd level and the one from E-P to O at the 1st level. The best path between the [S] and [E] tags is identical to the best path between the two O tags under our restriction about the transition matrix of CRF. Therefore, we search for the 2nd best path within the span by utilizing the N-best Viterbi A* algorithm (Seshadri and Sundberg, 1994; Huang et al., 2012).4 Note that our situation is different from normal situations where N-best decoding is needed. We already know the best path within the span and want to find only the 2nd best path. Thus, we can extract nested entities by finding the 2nd best path within each extracted entity. Regarding the example of Figure 2, “PKC isoforms” is extracted from the span “Ca2+ -dependent PKC isoforms” at the 2nd level.
We continue this recursive decoding until no multi-token entities are detected within a span. In Figure 2, the span “PKC isoforms” is processed at the 3rd level. At the 3rd or deeper levels, the tag sequence of its grandparent level is no longer either the best path or the 2nd best path because the start or end position of the current span is in the middle of the entity mention span at the grandparent level. As for the example shown in Figure 2, the word “PKC” is tagged I-P at the 1st level, and the transition from [S] to I-P is illegal. The scores of the paths that includes illegal transitions cannot be larger than those of the paths that consist of only legal transitions because the elements of the transition matrix A(k) corresponding to illegal transitions are set to . That is why at all levels below the 1st level we only need to find the 2nd best path.
This recursive processing is stopped when no entities are predicted or when only single-token entities are detected within a span.5 In Figure 2, the span “PKC” is not processed any more because it is a single-token entity.
Only one nested entity is extracted within each decoded span in Figure 2, but there can be cases where multiple multi-token entities are detected within a decoded span. In such cases, our algorithm Viterbi-decodes each of their spans in the way of the depth-first search algorithm. The aforementioned processing is executed on all entity types, and all detected entities are returned as an output result.
2.3 Training
To extract entities from outside to inside successfully, a model has to be trained in a way that the scores for the paths including outer entities will be higher than those for the paths including inner entities. We propose a new objective function to achieve this requirement.
However, to the best of our knowledge, the way of efficiently computing the second term of Equation (5) has not been proposed yet in the literature. Simply subtracting the exponential score of the best path from the summation of the exponential scores of all possible paths causes underflow, overflow, or loss of significant digits.
We introduce a way of accurately computing it with the same time complexity as Algorithm 2 for Equation (4). For explanation, we use the simplified example of the lattice depicted in Figure 3, in which the span length is 4 and the number of states is 3. The special nodes for start and end states are attached to the both ends of the span. There are 81(= 34) paths in this lattice. We assume that the path that consists of top nodes of all time steps are the best path as shown in Figure 3. No generality is lost by making this assumption. To calculate the second term of Equation (5), we have to consider the exponential scores for all the possible paths except the best path, 80(= 81 − 1) paths.
We first give a way of thinking, which is not our algorithm itself but helpful to understand it. In the example, we can further group these 80 paths according to the steps where the best path is not taken. In this way, we have 4 spaces in total as illustrated in Figure 4. In Space 1, the top node of time step 4 is excluded from consideration. 54(= 33 × 2) paths are taken into account here. Since this space covers all paths that do not go through the top node of time step 4, we only have to consider the paths that go through this node in other spaces. In Space 2, this node is always passed through, and instead the top node of time step 3 is excluded. 18(= 32 × 2) paths are considered in this space. Similarly, 6(= 31 × 2) paths and 2(= 30 × 2) paths are taken into consideration in Space 3 and Space 4, respectively. Thus, we can consider all the possible paths except the best path, 80(= 54 + 18 + 6 + 2) paths. However, this is not our algorithm itself as we mentioned.
We introduce two tricks for making the calculation more efficient. We explain them with Figure 5, in which Spaces 2 and 3 are picked up. The first trick is that the separated two spaces can be merged at time step 4 because the paths later than time step 3 are identical. When we reach time step 4 in the forward iteration in each of the two spaces, we can merge them using the calculation results at time step 3, as shown with the red edges in Figure 5. The second trick is that the blue nodes in Figure 5 can be copied from Space 2 to Space 3 at time step 2 since the considered paths until that time step are also the same. These two tricks can be applied to other pairs of two adjacent spaces, which relieves the need to separately calculate the summation of the exponential scores for each space. Therefore, the second term of Equation (5) can be calculated as shown in Algorithm 3.
2.4 Characteristics
Time complexity. Regarding the time complexity of decoder, the worst case for our method is when our decoder narrows down the spans one by one, from n tokens (a whole sentence) to 2 tokens. The time complexity for the worst case is therefore for each entity type, in total, where m denotes the number of entity types. However, this rarely happens. The ideal average processing time in the case where our decoding method narrows down spans successfully according to gold labels is , where d is the average number of gold IOBES tags of each entity type assigned to a word. The average numbers calculated from the gold labels of ACE-2004, ACE-2005, and GENIA are 1.06, 1.06, and 1.05, respectively.
Usability. Some existing methods have hyperparameters, such as the maximal length of considered entities or the threshold that affects the number of detected entities, beyond those of the conventional CRF-based model used for flat NER tasks. These hyperparameters must be tuned depending on datasets. On the other hand, our method does not have such hyperparameters and is easy to use from this viewpoint. In addition, our method focuses on the output layer of neural architectures; therefore our method can be combined with any neural model.
We verify the empirical performances of our methods in the successive sections.
3 Experimental Settings
3.1 Datasets
We perform nested entity extraction experiments intensively on ACE-2005 (Doddington et al., 2004) and GENIA (Kim et al., 2003). For ACE-2005, we use the same splits of documents as Lu and Roth (2015), published on their website.6 For GENIA, we use GENIAcorpus3.02p,7 in which sentences are already tokenized (Tateisi and Tsujii, 2004). Following previous work (Finkel and Manning, 2009; Lu and Roth, 2015), we first split the last 10% of sentences as the test set. Next, we use the first 81% and the subsequent 9% for training and development sets, respectively. We make the same modifications as described by Finkel and Manning (2009) by collapsing all DNA, RNA, and protein subtypes into DNA, RNA, and protein, keeping cell line and cell type, and removing other entity types, resulting in 5 entity types. The statistics of each dataset are shown in Table 1.
. | ACE-2005 . | GENIA . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | Train . | (%) . | Dev . | (%) . | Test . | (%) . | Train . | (%) . | Dev . | (%) . | Test . | (%) . |
# documents | 370 | 43 | 51 | – | – | – | ||||||
# sentences | (7,285) | (968) | (1,058) | 15,022 | 1,669 | 1,855 | ||||||
# mentions | 24,827 | 3,234 | 3,041 | 47,027 | 4,469 | 5,600 | ||||||
- 1st level | 21,966 | (88) | 2,900 | (90) | 2,686 | (88) | 44,611 | (95) | 4,239 | (95) | 5,273 | (94) |
- 2nd level | 2,635 | (11) | 316 | (10) | 323 | (11) | 2393 | (5) | 230 | (5) | 327 | (6) |
- 3rd level | 215 | (1) | 18 | (1) | 30 | (1) | 23 | (0) | 0 | (0) | 0 | (0) |
- 4th level | 9 | (0) | 0 | (0) | 2 | (0) | 0 | (0) | 0 | (0) | 0 | (0) |
# labels per token (d) | 1.06 | 1.05 | 1.05 | 1.05 | 1.05 | 1.05 |
. | ACE-2005 . | GENIA . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | Train . | (%) . | Dev . | (%) . | Test . | (%) . | Train . | (%) . | Dev . | (%) . | Test . | (%) . |
# documents | 370 | 43 | 51 | – | – | – | ||||||
# sentences | (7,285) | (968) | (1,058) | 15,022 | 1,669 | 1,855 | ||||||
# mentions | 24,827 | 3,234 | 3,041 | 47,027 | 4,469 | 5,600 | ||||||
- 1st level | 21,966 | (88) | 2,900 | (90) | 2,686 | (88) | 44,611 | (95) | 4,239 | (95) | 5,273 | (94) |
- 2nd level | 2,635 | (11) | 316 | (10) | 323 | (11) | 2393 | (5) | 230 | (5) | 327 | (6) |
- 3rd level | 215 | (1) | 18 | (1) | 30 | (1) | 23 | (0) | 0 | (0) | 0 | (0) |
- 4th level | 9 | (0) | 0 | (0) | 2 | (0) | 0 | (0) | 0 | (0) | 0 | (0) |
# labels per token (d) | 1.06 | 1.05 | 1.05 | 1.05 | 1.05 | 1.05 |
3.2 Model and Training
In this study, we adopt as baseline a BiLSTM-CRF model, which is widely used for NER tasks (Lample et al., 2016; Ma and Hovy, 2016; Chiu and Nichols, 2016; Reimers and Gurevych, 2017). We apply our usage of CRF to this baseline. We prepare three types of models for fair comparisons with existing methods. The first one is the model to which is fed conventional word embeddings and CNN-based character-level representation (Ma and Hovy, 2016; Chiu and Nichols, 2016; Reimers and Gurevych, 2017).8 We initialize word embeddings with the pretrained embeddings GloVe (Pennington et al., 2014) of dimension 100 in ACE-2005. For GENIA, we adopt the pretrained embeddings trained on MEDLINE abstracts (Chiu et al., 2016) instead. The initialized word embeddings are fixed during training. The vectors of the word embeddings and the character-level representation are concatenated and then input into the BiLSTM layer. The second model is the model combined with the pretrained BERT model (Devlin et al., 2019).9 We use the uncased version of BERT large model as a contextual word embeddings generator without fine-tuning and stack the BiLSTM layers on top of the BERT model. The third model is the BiLSTM-CRF model to which is fed word embeddings, character-level representation, BERT embeddings, and FLAIR embeddings (Akbik et al., 2018) using FLAIR framework (Akbik et al., 2019).10 All our models have 2 BiLSTM hidden layers, and the dimensionality of each hidden unit is 256 in all our experiments. Table 2 lists the hyperparameters used for our experimental evaluations. We adopt AdaBound (Luo et al., 2019) as an optimizer. Early stopping is used based on the performance of development set. We repeat the experiment 5 times with different random seeds and report average and standard deviation of F1-scores on a test set as the final performance.
Hyperparameter . | Value . |
---|---|
word dropout rate | 0.05 |
character embedding dimension | 128 |
CNN window size | 3 |
CNN filter number | 256 |
batch size | 32 |
LSTM hidden size | 256 |
LSTM dropout rate | 0.2 (w/o BERT) |
0.5 (w/ BERT) | |
gradient clipping | 5.0 |
Hyperparameter . | Value . |
---|---|
word dropout rate | 0.05 |
character embedding dimension | 128 |
CNN window size | 3 |
CNN filter number | 256 |
batch size | 32 |
LSTM hidden size | 256 |
LSTM dropout rate | 0.2 (w/o BERT) |
0.5 (w/ BERT) | |
gradient clipping | 5.0 |
4 Experimental Results
4.1 Comparison with Existing Methods
Table 3 presents comparisons of our model with existing methods. Note that some existing methods use embeddings of POS tags as an additional input feature whereas our method does not. Our method outperforms the existing methods with 76.83% and 77.19% in terms of F1-score when using only word embeddings and character-level representation. Especially, our method brings much higher recall values than the other methods. The recall scores are improved by 3.1% and 2.4% on ACE-2005 and GENIA datasets, respectively. These results demonstrate that our training and decoding algorithms are quite effective for extracting nested entities. Moreover, when we use BERT and FLAIR as contextual word embeddings, we achieve an F1-score of 83.99% with BERT and 84.34% with BERT and FLAIR on ACE-2005. On the other hand, BERT does not perform well on GENIA. We assume that this is because the domain of GENIA is quite different from that of the corpus used for training the BERT model. Regardless, it is demonstrated that our method performs better than or at least as well as existing methods.
. | ACE-2005 . | GENIA . | ||||
---|---|---|---|---|---|---|
Method . | Precision (%) . | Recall (%) . | F1 (%) . | Precision (%) . | Recall (%) . | F1 (%) . |
Katiyar and Cardie (2018) | 70.6 | 70.4 | 70.5 | 79.8 | 68.2 | 73.6 |
Ju et al. (2018)11 | 74.2 | 70.3 | 72.2 | 78.5 | 71.3 | 74.7 |
Wang et al. (2018)†12 | 74.5 | 71.5 | 73.0 | 78.0 | 70.2 | 73.9 |
Wang and Lu (2018)† | 76.8 | 72.3 | 74.5 | 77.0 | 73.3 | 75.1 |
Sohrab and Miwa (2018) | – | – | – | 93.2 | 64.0 | 77.1 |
Zheng et al. (2019) | – | – | – | 75.9 | 73.6 | 74.7 |
Fisher and Vlachos (2019) | 75.1 | 74.1 | 74.6 | – | – | – |
Lin et al. (2019)† | 76.2 | 73.6 | 74.9 | 75.8 | 73.9 | 74.8 |
Straková et al. (2019)†13 | 76.35 | 74.39 | 75.36 | 79.60 | 73.53 | 76.44 |
This work | 78.27 ± 0.81 | 75.44 ± 0.37 | 76.83 ± 0.36 | 78.70 ± 0.69 | 75.74 ± 0.64 | 77.19 ± 0.10 |
Fisher and Vlachos (2019) [BERT] | 82.7 | 82.1 | 82.4 | − | − | − |
Straková et al. (2019) [BERT]† | 82.58 | 84.29 | 83.42 | 79.92 | 76.55 | 78.20 |
This work [BERT] | 83.30 ± 0.22 | 84.69 ± 0.37 | 83.99 ± 0.27 | 77.46 ± 0.65 | 76.65 ± 0.58 | 77.05 ± 0.12 |
Straková et al. (2019) [BERT+FLAIR]† | 83.48 | 85.21 | 84.33 | 80.11 | 76.60 | 78.31 |
This work [BERT+FLAIR] | 83.83 ± 0.39 | 84.87 ± 0.09 | 84.34 ± 0.20 | 77.81 ± 0.69 | 76.94 ± 1.12 | 77.36 ± 0.26 |
. | ACE-2005 . | GENIA . | ||||
---|---|---|---|---|---|---|
Method . | Precision (%) . | Recall (%) . | F1 (%) . | Precision (%) . | Recall (%) . | F1 (%) . |
Katiyar and Cardie (2018) | 70.6 | 70.4 | 70.5 | 79.8 | 68.2 | 73.6 |
Ju et al. (2018)11 | 74.2 | 70.3 | 72.2 | 78.5 | 71.3 | 74.7 |
Wang et al. (2018)†12 | 74.5 | 71.5 | 73.0 | 78.0 | 70.2 | 73.9 |
Wang and Lu (2018)† | 76.8 | 72.3 | 74.5 | 77.0 | 73.3 | 75.1 |
Sohrab and Miwa (2018) | – | – | – | 93.2 | 64.0 | 77.1 |
Zheng et al. (2019) | – | – | – | 75.9 | 73.6 | 74.7 |
Fisher and Vlachos (2019) | 75.1 | 74.1 | 74.6 | – | – | – |
Lin et al. (2019)† | 76.2 | 73.6 | 74.9 | 75.8 | 73.9 | 74.8 |
Straková et al. (2019)†13 | 76.35 | 74.39 | 75.36 | 79.60 | 73.53 | 76.44 |
This work | 78.27 ± 0.81 | 75.44 ± 0.37 | 76.83 ± 0.36 | 78.70 ± 0.69 | 75.74 ± 0.64 | 77.19 ± 0.10 |
Fisher and Vlachos (2019) [BERT] | 82.7 | 82.1 | 82.4 | − | − | − |
Straková et al. (2019) [BERT]† | 82.58 | 84.29 | 83.42 | 79.92 | 76.55 | 78.20 |
This work [BERT] | 83.30 ± 0.22 | 84.69 ± 0.37 | 83.99 ± 0.27 | 77.46 ± 0.65 | 76.65 ± 0.58 | 77.05 ± 0.12 |
Straková et al. (2019) [BERT+FLAIR]† | 83.48 | 85.21 | 84.33 | 80.11 | 76.60 | 78.31 |
This work [BERT+FLAIR] | 83.83 ± 0.39 | 84.87 ± 0.09 | 84.34 ± 0.20 | 77.81 ± 0.69 | 76.94 ± 1.12 | 77.36 ± 0.26 |
4.2 Ablation Study
We conduct an ablation study to verify the effectiveness of our learning and decoding methods. We first replace our objective function for training with the standard objective function of the linear-chain CRF. The methods for decoding N-best paths have been well studied because such algorithms have been required in many domains (Soong and Huang, 1990; Kaji et al., 2010; Huang et al., 2012). However, we hypothesize that our learning method, as well as our decoding method, helps to improve performance. That is why we first remove only our learning method. Then, we also replace our decoding algorithm with the standard decoding algorithm of the linear-chain CRF. It is equivalent to preparing the conventional CRF for each entity type separately.
The results are shown in Table 4. They demonstrate that introducing only our decoding algorithm results in high recall scores but hurts precision. This suggests that our learning method should be necessary for achieving high precision. Besides, removing the decoding algorithm results in lower recall. This is natural because it does not intend to find any nested entity after extracting outermost entities. Thus, it is demonstrated that both our learning and decoding algorithms contribute much to good performance.
. | ACE-2005 . | GENIA . | ||||
---|---|---|---|---|---|---|
. | Precision (%) . | Recall (%) . | F1 (%) . | Precision (%) . | Recall (%) . | F1 (%) . |
This work | 78.27 ± 0.81 | 75.44 ± 0.37 | 76.83 ± 0.36 | 78.70 ± 0.69 | 75.74 ± 0.64 | 77.19 ± 0.10 |
– L | 60.89 ± 1.30 | 75.38 ± 1.27 | 67.34 ± 0.37 | 70.72 ± 0.39 | 79.20 ± 1.27 | 74.71 ± 0.18 |
– L&D | 77.77 ± 0.31 | 67.42 ± 0.29 | 72.22 ± 0.13 | 79.70 ± 0.56 | 73.41 ± 0.35 | 76.43 ± 0.28 |
. | ACE-2005 . | GENIA . | ||||
---|---|---|---|---|---|---|
. | Precision (%) . | Recall (%) . | F1 (%) . | Precision (%) . | Recall (%) . | F1 (%) . |
This work | 78.27 ± 0.81 | 75.44 ± 0.37 | 76.83 ± 0.36 | 78.70 ± 0.69 | 75.74 ± 0.64 | 77.19 ± 0.10 |
– L | 60.89 ± 1.30 | 75.38 ± 1.27 | 67.34 ± 0.37 | 70.72 ± 0.39 | 79.20 ± 1.27 | 74.71 ± 0.18 |
– L&D | 77.77 ± 0.31 | 67.42 ± 0.29 | 72.22 ± 0.13 | 79.70 ± 0.56 | 73.41 ± 0.35 | 76.43 ± 0.28 |
4.3 Analysis of Behavior
To further understand how our method handles nested entities, we investigate the performance for entities of each level. Table 5 shows the recall scores for gold entities of each level when using conventional word embeddings. Among all levels, our model results in the best performance at the 1st level that consists of only gold outermost entities. The deeper a level, the lower recall scores. On the other hand, Table 6 shows the precision scores for predicted entities in each level of one trial on each dataset. Because the number of levels in the predictions vary between trials, taking macro average of precision scores over multiple trials is not representative. Therefore, we show only the precision scores from one trial in Table 6. The precision score for the 5th level on ACE-2005 is as high as or higher than those of other levels. Precision scores are less dependent on level. This tendency is also shown in other trials.
. | ACE-2005 . | GENIA . | ||
---|---|---|---|---|
Level . | Recall (%) . | Num. . | Rcall (%) . | Num. . |
1st | 76.10 ± 0.50 | 2,686 | 77.92 ± 0.72 | 5,273 |
2nd | 71.70 ± 0.70 | 323 | 40.61 ± 1.74 | 327 |
3rd | 58.00 ± 5.42 | 30 | – | 0 |
4th | 50.00 ± 0.00 | 2 | – | 0 |
. | ACE-2005 . | GENIA . | ||
---|---|---|---|---|
Level . | Recall (%) . | Num. . | Rcall (%) . | Num. . |
1st | 76.10 ± 0.50 | 2,686 | 77.92 ± 0.72 | 5,273 |
2nd | 71.70 ± 0.70 | 323 | 40.61 ± 1.74 | 327 |
3rd | 58.00 ± 5.42 | 30 | – | 0 |
4th | 50.00 ± 0.00 | 2 | – | 0 |
. | ACE-2005 . | GENIA . | ||
---|---|---|---|---|
Level . | Precision (%) . | Num. . | Precision (%) . | Num. . |
1st | 80.36 | 2,500 | 80.29 | 5,038 |
2nd | 72.35 | 311 | 57.06 | 326 |
3rd | 79.07 | 43 | 66.67 | 3 |
4th | 66.67 | 9 | – | 0 |
5th | 83.33 | 6 | – | 0 |
. | ACE-2005 . | GENIA . | ||
---|---|---|---|---|
Level . | Precision (%) . | Num. . | Precision (%) . | Num. . |
1st | 80.36 | 2,500 | 80.29 | 5,038 |
2nd | 72.35 | 311 | 57.06 | 326 |
3rd | 79.07 | 43 | 66.67 | 3 |
4th | 66.67 | 9 | – | 0 |
5th | 83.33 | 6 | – | 0 |
In addition, we compare the tendency of our method with that of an existing method. We select Wang and Lu (2018) method for comparison.14 We train their model with the ACE-2005 dataset using their original implementation and repeat that 5 times. The recall scores from the 1st level to the 4th level are 66.52%, 65.34%, 42.14%, and 50.00%, respectively. The tendency about the difference across levels is common to Wang and Lu (2018) method and our method, and the scores from our method (Table 5) are entirely higher than those from their method. It is demonstrated that our method can extract both outer and inner entities better. On the other hand, their method can extract crossing entities (two entities overlap but neither is contained in the other), although our method cannot. Actually, their model outputs some crossing spans in our experiments. In this case, we cannot analyze the results regarding precision scores in the same manner as Table 6. There are cases where one cannot uniquely decide the level of an span nested within multiple crossing spans. Regardless, our method cannot handle crossing entities. However, crossing entities are very rare (Lu and Roth, 2015; Wang et al., 2018). The test sets of ACE-2005 and GENIA have no crossing entities. This property of our method does not have a negative impact on performance, at least on the ACE-2005 and GENIA datasets.
4.4 Error Analysis
We manually scan the test set predictions on ACE-2005. We find that many of the errors can be classified into two types.
The first type is partial prediction error. Given the following sentence: “Let me set aside the hypocrisy of a man who became president because of a lawsuit trying to eliminate everybody else’s lawsuits, but instead focus on his own experience”. The annotation marks “a man who became president because of a lawsuit”, but our model extracts a shorter or longer span. It is difficult to extract the proper spans of clauses that contain numerous modifiers.
The second type is error derived from pronominal mention. Consider the following example: “They roar, they screech.”. These “They”s refer to “tanks” in another sentence of the same document and are indeed annotated as VEH (Vehicle). Our model fails to detect these pronominal mentions or wrongly labels them as PER (Person). Document context should be taken into consideration to solve this problem.
4.5 Running Time
We investigate how our recursive decoding method impacts on the decoding speed in terms of the number of words processed per second. We use the model trained with ACE-2005 used for Table 6 and change the maximal depth of decoding to 1, 2, 3, 4, 5, and . When the maximal depth is n, our decoder Viterbi-decodes only from the 1st level to the n-th level. Note that, when the maximal depth is 1, the decoding process is completely the same as the Viterbi decoding of the standard CRF. We run them on an Intel i7 (2.7 GHz) CPU.
Results are listed in Table 7. The processed words per second decrease by 38% when the maximal depth varies from 1 to 2. There are two main reasons for this phenomenon. First, our decoder needs the processing for moving across different levels. That processing is not necessary when the maximal depth is 1. Second, the number of the extracted spans at the 2nd level is large and not negligible (12.5% of that of the extracted spans at the 1st level as shown in Table 6). The numbers of the extracted spans at the 3rd and lower levels are small, and then the processed words do not largely decrease when the maximal depth increases over 2. Regardless, our decoder does not take twice as long as the standard CRF on ACE-2005.
4.6 Comparison on ACE-2004
We also compare our method with existing methods on the ACE-2004 dataset. We use the same splits as Lu and Roth (2015). The setups are the same as those of our experiment on ACE-2005. Table 8 shows the results. As shown, our method significantly outperforms existing methods. Note that most of them use POS tags as an additional input feature whereas our method does not.
Method . | P (%) . | R (%) . | F1 (%) . |
---|---|---|---|
Katiyar and Cardie (2018) | 72.3 | 66.8 | 69.7 |
Wang et al. (2018)†15 | 74.9 | 71.8 | 73.3 |
Wang and Lu (2018)† | 78.0 | 72.4 | 75.1 |
Straková et al. (2019)†16 | 78.92 | 75.33 | 77.08 |
This work | 79.93 | 75.10 | 77.44 |
Straková et al. (2019) [BERT]† | 84.71 | 83.96 | 84.33 |
This work [BERT] | 85.23 | 84.72 | 84.97 |
Straková et al. (2019) [BERT+FLAIR]† | 84.51 | 84.29 | 84.40 |
This work [BERT+FLAIR] | 85.94 | 85.69 | 85.82 |
Method . | P (%) . | R (%) . | F1 (%) . |
---|---|---|---|
Katiyar and Cardie (2018) | 72.3 | 66.8 | 69.7 |
Wang et al. (2018)†15 | 74.9 | 71.8 | 73.3 |
Wang and Lu (2018)† | 78.0 | 72.4 | 75.1 |
Straková et al. (2019)†16 | 78.92 | 75.33 | 77.08 |
This work | 79.93 | 75.10 | 77.44 |
Straková et al. (2019) [BERT]† | 84.71 | 83.96 | 84.33 |
This work [BERT] | 85.23 | 84.72 | 84.97 |
Straková et al. (2019) [BERT+FLAIR]† | 84.51 | 84.29 | 84.40 |
This work [BERT+FLAIR] | 85.94 | 85.69 | 85.82 |
4.7 Flat NER
To assess how our model works on flat NER task, we additionally evaluate our model on CoNLL-2003 (Tjong Kim Sang and De Meulder, 2003), which is annotated with outermost entities only. The setups here are the same as those of our experiment on ACE-2005. We not only prepare our proposed model but also the ablated model without our training nor decoding method, as in Section 4.2. The former model can extract spans nested within other extracted spans regardless of the property of the dataset, but the latter model never extracts spans within other extracted spans. We use the 100-dimensional GloVe embeddings for both models as in our previous experiments.
The results are in Table 9. We compare our method with existing methods that do not adopt any contextual word embeddings (the upside of Table 9) here, although we also show results from recent work with contextual word embeddings for reference. First, in comparison with the methods designed for nested NER (Wang and Lu, 2018; Straková et al., 2019), our method performs better even on CoNLL-2003. This means that our method works well on not only nested NER but also flat NER. Next, we compare with methods that can handle only flat NER. Table 9 shows that our method is comparable to the standard BiLSTM-CRF models (Lample et al., 2016; Ma and Hovy, 2016) on CoNLL-2003. However, note that there are some differences between the experiments of the previous studies (Lample et al., 2016; Ma and Hovy, 2016) and our experiment. For example, different word embeddings are used, or the hidden size of LSTM is not aligned. Nevertheless, we can compare our proposed model to the ablated model. As shown in Table 9, there is a significant gap (p < 0.005 with the permutation test) between the two scores, 91.14(±0.04)% and 90.84(±0.10)%. We analyze this gap in detail and find that our proposed model performs well especially in the cases where it is difficult to decide which is suitable, an inner span or an outer span. Given the following sentence: “An assessment group made up of the State Council’s Port Office, the Civil Aviation Administration of China, the General Administration of Customs and other authorities had granted the airport permission to handle foreign aircraft, Xinhua said .”. In the CoNLL-2003 dataset, the four spans “State Council”, “Civil Aviation Administration of China”, “General Administration of Customs”, and “Xinhua” are annotated as ORG (Organization). Both models correctly detect the latter three entities in most trials, but the ablated model tends to extract “State Council ’s Port Office” instead of “State Council”. On the other hand, our proposed model tends to extract both “State Council ’s Port Office” and “State Council”. “State Council ’s Port Office” is indeed a false-positive, but our model can detect the correct entity span “State Council” more steadily than the ablated model. Thus, our proposed model achieves the higher F1-score.
Method . | F1 (%) . |
---|---|
Wang and Lu (2018)† | 90.5 |
Straková et al. (2019)† | 90.77 |
This work | 91.14 ± 0.04 |
Lample et al. (2016)‡ | 90.94 |
Ma and Hovy (2016)‡ | 91.21 |
Liu et al. (2019)‡ | 91.96 ± 0.04 |
This work − L&D‡ | 90.84 ± 0.10 |
Devlin et al. (2019)‡ | 92.80 |
Akbik et al. (2018)‡ | 93.09 ± 0.12 |
Liu et al. (2019)‡ | 93.47 ± 0.03 |
Jiang et al. (2019)‡ | 93.47 |
Baevski et al. (2019)‡ | 93.5 |
Method . | F1 (%) . |
---|---|
Wang and Lu (2018)† | 90.5 |
Straková et al. (2019)† | 90.77 |
This work | 91.14 ± 0.04 |
Lample et al. (2016)‡ | 90.94 |
Ma and Hovy (2016)‡ | 91.21 |
Liu et al. (2019)‡ | 91.96 ± 0.04 |
This work − L&D‡ | 90.84 ± 0.10 |
Devlin et al. (2019)‡ | 92.80 |
Akbik et al. (2018)‡ | 93.09 ± 0.12 |
Liu et al. (2019)‡ | 93.47 ± 0.03 |
Jiang et al. (2019)‡ | 93.47 |
Baevski et al. (2019)‡ | 93.5 |
Recently, Liu et al. (2019) proposed a new architecture for sequence labeling, which can capture global information at the sentence level better than BiLSTM, and reported an F1-score of 91.96% when using conventional word embeddings (93.47% when using BERT). It is true that our model based on BiLSTM does not perform as well as their model, but our decoder can be combined with their proposed encoder. We leave it for future work.
5 Related Work
Alex et al. (2007) proposed several ways to combine multiple CRFs for such tasks. They found that, when they cascaded separate CRFs of each entity type by using the output from the previous CRF as the input features of the current CRF, best performance was yielded. However, their method could not handle nested entities of the same entity type. In contrast, Ju et al. (2018) dynamically stacked multiple layers that recognize entities sequentially from innermost ones to outermost ones. Their method can deal with nested entities of the same entity type.
Finkel and Manning (2009) proposed a CRF-based constituency parser for this task such that each named entity is a node in the parse tree. However, its time complexity is the cube of the length of a given sentence, making it not scalable to large datasets involving long sentences. Later on, Wang et al. (2018) proposed a scalable transition-based approach, a constituency forest (a collection of constituency trees). Its time complexity is linear in the sentence length.
Lu and Roth (2015) introduced a mention hypergraph representation for capturing nested entities as well as crossing entities (two entities overlap but neither is contained in the other). One issue in their approach is the spurious structures of the representation. Muis and Lu (2017) incorporated mention separators to address the spurious structures issue, but it still suffers from the structural ambiguity issue. Wang and Lu (2018) proposed a hypergraph representation free of structural ambiguity. However, they introduced a hyperparameter, the maximal length of an entity, to reduce the time complexity. Setting the hyperparameter to a small number results in speeding up but ignoring longer entity segments.
Katiyar and Cardie (2018) proposed another hypergraph-based approach that learns the structure using an LSTM network in a greedy manner. However, their method has a hyperparameter that sets a threshold for selecting multiple candidate mentions. It must be carefully tuned for adjusting the trade-off between recall and precision.
Sohrab and Miwa (2018) proposed a neural exhaustive model that enumerates all possible spans as potential entity mentions and classifies them. However, they also use the maximal-length hyperparameter to reduce time complexity.
Fisher and Vlachos (2019) proposed a novel neural network architecture that merges tokens or entities into entities forming nested structures and then labels each of them. Their architecture, however, needs the maximal nesting level hyperparameter. Lin et al. (2019) proposed a sequence-to-nuggets architecture that first identify anchor words of all mentions and then recognize the mention boundaries for each anchor word. Their method also use the maximal-length hyperparameter to reduce time complexity.
Straková et al. (2019) proposed an encoding algorithm to allow the modeling of multiple named entity labels in a linearized scheme and proposed a neural model that predicts sequential labels for each token. Zheng et al. (2019) proposed a method that can detect entities boundaries with sequence labeling models. These two methods do not require special hyperparameters. They can also deal with crossing entities as well as nested entities in contrast to our method, but our experiments demonstrate that our method can perform well because crossing entities are very rare (Lu and Roth, 2015; Wang et al., 2018).
6 Conclusion
We propose learning and decoding methods for extracting nested entities. Our decoding method iteratively recognizes entities from outermost ones to inner ones in an outside-to-inside way. It recursively searches a span of each extracted entity for nested entities with second-best sequence decoding. We also design an objective function for training that ensures our decoding algorithm. Our method has no hyperparameters beyond those of conventional CRF-based models. Our method achieves 85.82%, 84.34%, and 77.36% F1-scores on ACE-2004, ACE-2005, and GENIA datasets, respectively.
For future work, one interesting direction is joint modeling of NER with entity linking or coreference resolution. Previous studies (Durrett and Klein, 2014; Luo et al., 2015; Nguyen et al., 2016; Martins et al., 2019) demonstrated that leveraging mutual dependency of the NER, linking, and coreference tasks could boost each performance. We would like to address this issue while taking nested entities into account.
Acknowledgments
We thank Aldrian Obaja Muis for helpful comments, and many anonymous reviewers and the action editor for helpful feedback on various drafts of the paper. We are also grateful to Jana Straková for sharing experimental results. Eduard Hovy was supported in part by DARPA grant FA8750-18-2-0018 funded under the AIDA program.
Notes
Our usage of inside/outside is different from the inside-outside algorithm in dynamic programming.
Without our restriction about the transition matrix of CRF, we would have to watch both the best path and the 2nd best path. Besides, if a single CRF was used for all entity types, the decoder could not always narrow down spans with the 2nd best path. The 2nd best path in a single CRF could result in the same span tagged a different entity type. We would have to watch lower-ranked paths.
We do not need to recursively decode the span of each extracted single-token entity because a single-token entity cannot contain another entity of the same entity type.
Straková et al. (2019) did not report precision and recall scores in their paper. We requested this information from the authors, and they provided their score data.
We do not use POS tags as one of input features for a fair comparison with our method.
Straková et al. (2019) did not report precision and recall scores in their paper. We requested this information from the authors, and they provided their score data.