Nested Named Entity Recognition via Second-best Sequence Learning and Decoding

When an entity name contains other names within it, the identification of all combinations of names can become difficult and expensive. We propose a new method to recognize not only outermost named entities but also inner nested ones. We design an objective function for training a neural model that treats the tag sequence for nested entities as the second best path within the span of their parent entity. In addition, we provide the decoding method for inference that extracts entities iteratively from outermost ones to inner ones in an outside-to-inside way. Our method has no additional hyperparameters to the conditional random field based model widely used for flat named entity recognition tasks. Experiments demonstrate that our method outperforms existing methods capable of handling nested entities, achieving the F1-scores of 82.81%, 82.70%, and 77.25% on ACE-2004, ACE-2005, and GENIA datasets, respectively.


Introduction
Named entity recognition (NER) is the task of identifying text spans associated with proper names and classifying them according to their semantic class such as person or organization. NER, or in general the task of recognizing entity mentions, is one of the first stages in deep language understanding and its importance has been well recognized in the NLP community (Nadeau and Sekine, 2007).
One popular approach to the task of NER is to regard it as a sequence labeling problem. In this case, it is implicitly assumed that mentions are not nested in texts. However, names often contain entities nested within themselves, as illustrated in Fig. 1, which contains 3 mentions of the same type (PROTEIN) in the span "... in Ca2+ -dependent PKC isoforms in ...", taken from the GENIA dataset (Kim et al., 2003). Name nesting is common, especially in technical domains (Alex et al., 2007;Byrne, 2007;Wang, 2009). The assumption of no nesting leads to loss of potentially important information and may negatively impact subsequent downstream tasks. For instance, a downstream entity linking system that relies on NER may fail to link the correct entity if the entity mention is nested.
Various approaches to recognizing nested entities have been proposed. Many of them rely on producing and rating all possible (sub)spans, which can be computationally expensive.  provided a hypergraph-based approach to consider all possible spans. Sohrab and Miwa (2018) proposed a neural exhaustive model that enumerates and classifies all possible spans. These methods, however, achieve high performance at the cost of time complexity. To reduce the running time, they set a threshold to discard longer entity mentions. If the hyperparameter is set low, running time is reduced but longer mentions are missed. In contrast, Muis and Lu (2017) proposed a sequence labeling approach that assigns tags to gaps between words, which efficiently handles sequences using Viterbi decoding. However, this approach suffers from structural ambiguity issues during inference as explained by . Katiyar and Cardie (2018) proposed another hypergraph-based approach that learns the structure in a greedy manner. However, their method uses an additional hyperparameter as the threshold for selecting multiple mention can-didates. This hyperparameter affects the trade-off between recall and precision.
In this paper, we propose new learning and decoding methods to extract nested entities without any additional hyperparameters. We summarize our contributions as: • We describe a decoding method that iteratively recognizes entities from outermost ones to inner ones without structural ambiguity. It recursively searches a span of each extracted entity for inner nested entities using the Viterbi algorithm. This algorithm does not require hyperparameters for the maximal length or number of mentions considered.
• We also provide a novel learning method that ensures the aforementioned decoding. Models are optimized based on an objective function designed according to the decoding procedure.

Related Work
The success of neural networks has boosted the performance of NER (Strubell et al., 2017;Akbik et al., 2018). However, few attempts have explored nested entity recognition. Alex et al. (2007) proposed several ways to combine multiple CRFs for such tasks. They separately built inside-out and outside-in layered CRFs that used the current guesses as the input for the next layer. They also cascaded separate CRFs of each entity type by using the output from the previous CRF as the input features of the current CRF, yielding best performance in their work. However, their method could not handle nested entities of the same entity type. In contrast, Ju et al. (2018) dynamically stacked multiple layers that recognize entities sequentially from innermost ones to outermost ones. Their method can deal with nested entities of the same entity type.  Finkel and Manning (2009) proposed a CRFbased constituency parser for this task such that each named entity is a node in the parse tree. Their model achieved state-of-the-art performance on the GENIA dataset of the time. However, its time complexity is the cube of the length of a given sentence, making it not scalable to large datasets involving long sentences. Later on,  proposed a scalable transition-based approach, a constituency forest (a collection of constituency trees). Its time complexity is linear in the sentence length. Lu and Roth (2015) introduced a mention hypergraph representation for capturing nested entities as well as crossing entities (two entities overlap but neither is contained in the other). One issue in their approach is the spurious structures of the representation. To address the spurious structures issue, Muis and Lu (2017) incorporated mention separators to yield better performance on nested entities. On the other hand, it still suffers from the structural ambiguity issue.  proposed a hypergraph representation free of structural ambiguity, and their method achieved state-of-the-art performance on the ACE-2004 and ACE-2005 datasets. However, they introduced a hyperparameter, the maximal length of an entity, to reduce the time complexity. Setting the hyperparameter to a small number results in speeding up but ignoring longer entity segments. Katiyar and Cardie (2018) proposed another hypergraph-based approach that learns the structure using an LSTM network in a greedy manner. However, their method has a hyperparameter that sets a threshold for selecting multiple candidate mentions. It must be carefully tuned for adjusting the trade-off between recall and precision.
Sohrab and Miwa (2018) proposed a neural exhaustive model that enumerates all possible spans as potential entity mentions and classifies them. Their model achieved state-of-the-art performance on the GENIA dataset in terms of F1-score. However, they also use the maximal-length hyperparameter to reduce time complexity.
We show that our model outperforms all these methods on three datasets: ACE-2004, ACE-2005, and GENIA. In addition, our model has no additional hyperparameters to the conventional CRFbased model commonly used for flat NER. Also, our decoding algorithm is fast in practice as shown in Subsection 5.4. Figure 2: Overview of our second-best path decoding algorithm to iteratively find nested entities.

Method
We first explain our usage of CRF, which is the base of our decoding and training methods. Then, we introduce our decoding and training methods. Our decoding and training methods focus on the output layer of neural architectures and therefore can be combined with any neural model.

Usage of CRF
We first explain two key points about our usage of CRF. The first key point is that we prepare a separate CRF for each named entity type. This enables our method to handle the situation where the same mention span is assigned multiple entity types. The second one is that each element of the transition matrix of each CRF has a fixed value according to whether it corresponds to a legal transition (e.g., B-XXX to I-XXX, where XXX is the name of entity type) or an illegal one (e.g., O to I-XXX). This is helpful for keeping the scores for tag sequences including outer entities higher than those of tag sequences including inner entities.
Formally, we use Z = {z 1 , . . . , z n } to represent a sequence output from the last hidden layer of a neural model, where z i is the vector for the i-th word, and n is the number of tokens. y (k) = {y (k) 1 , . . . , y (k) n } represents a sequence of IOBES tags of entity type k for Z. Here, we define the score function to be denote the weight matrix and the bias vector corresponding to y (k) i , respectively. A (k) stands for the transition matrix from the previous token to the current token, and A

Decoding
We employ three strategies for decoding. First, we consider each entity type separately using multiple CRFs in decoding, which makes it possible to handle the situation that the same mention span is assigned multiple entity types. Second, our decoder searches nested entities in an outside-to-inside way 3 , which realizes efficient processing by eliminating the spans of non-entity at an early stage. More specifically, our method recursively narrows down the spans to Viterbi-decode. The spans to Viterbi-decode are dynamically decided according to the preceding Viterbi-decoding result. Only the spans that have just been recognized as entity mentions are Viterbi-decoded again. Third, we use the same scores φ k y i , z i of Eq. 1 to extract outermost entities and even inner entities without re-encoding, which makes inference more efficient and faster. These three strategies are deployed and completed only in the output layer of neural architectures.
We describe the pseudo-code of our decoding method in Algorithm 1. Also, we depict the overview of our decoding method with an example in Fig. 2 . We use the term level in the sense of the depth of entity nesting.
[S] and [E] in Fig. 2 stand for the START and END tags respectively. We always attach these tags to both ends of every Algorithm 1: Nested NER via 2nd-best sequence decoding K = the set of the entity types; Function main(z i ) M = {}; # the set of detected mentions. Each element of M is a tuple (s, e, k) regarding a mention. # s, e, and k are the start position, the end position, and the entity type of the mention, respectively. foreach k ∈ K do calculate CRF scores Φ for entity type k with the score function φ k y sequence of IOBES tags in Viterbi-decoding. We explain the decoding procedure and mechanism in detail below. We consider each entity type separately and iterate the same decoding process regarding distinct entity types as described in Algorithm 1. In the decoding process for each entity type k, we first calculate the CRF scores i , z i over the entire sentence. Next, we decode a sequence with the standard 1-best Viterbi decoding as with the conventional linearchain CRF. "Ca2+ -dependent PKC isoforms" is extracted at the 1st level with regard to the example of Fig. 2.
Then, we start our recursive decoding to extract nested entities within previously extracted entity spans by finding the 2nd best path. In Fig. 2, the span "Ca2+ -dependent PKC isoforms" is processed at the 2nd level. Here, if we search for the best path within each span, the same tag sequence will be obtained, even though the processed span is different. This is because we continue using the same scores φ k y i , z i and because all the values of A (k) corresponding to legal transitions are equal to 0. Regarding the example of Fig. 2, the score of the transition from [S] to B-P at the 2nd level is equal to the score of the transition from O to B-P at the 1st level. This is true for the transition from E-P to [E] at the 2nd level and the one from E-P to O at the 1st level. The best path between the [S] and [E] tags is identical to the best path between the two O tags under our restriction about the transition matrix of CRF. Therefore, we search for the 2nd best path within the span by utilizing the N -best Viterbi A* algorithm (Seshadri and Sundberg, 1994;Huang et al., 2012). 4 Note that our situation is different from normal situations where N -best decoding is needed. We already know the best path within the span and want to find only the 2nd best path. Thus, we can extract nested entities by finding the 2nd best path within each extracted entity. Regarding the example of Fig. 2, "PKC isoforms" is extracted from the span "Ca2+ -dependent PKC isoforms" at the 2nd level.
We continue this recursive decoding until no entities are predicted or when only single-token entities are detected within a span. In Fig. 2, the span "PKC isoforms" is processed at the 3rd level. At the 3rd or deeper levels, the tag sequence of its grandparent level is no longer either the best path or the 2nd best path because the start or end po-all possible paths C = {B-XXX, I-XXX, E-XXX, S-XXX, O}; s = 1; # the start position e = n; # the end position sition of the current span is in the middle of the entity mention at the grandparent level. As for the example shown in Fig. 2, "PKC" is tagged I-P at the 1st level, and the transition from [S] to I-P is illegal. The scores of the paths that includes illegal transitions cannot be larger than those of the paths that consist of only legal transitions because the elements of the transition matrix A (k) corresponding to illegal transitions are set to −∞. That is why at all levels below the 1st level we only need to find the 2nd best path. This recursive processing is stopped when no entities are predicted or when only single-token entities are detected within a span. In Fig. 2, the span "PKC" is not processed any more because it is a single-token entity. The aforementioned processing is executed on all entity types, and all detected entities are returned as an output result.

Training
To extract entities from outside to inside successfully, a model has to be trained in a way that the scores for the paths including outer entities will be higher than those for the paths including inner entities. We propose a new objective function to achieve this requirement.
We maximize the log-likelihood of the correct tag sequence as with the conventional CRF-based model. Considering that our model has a separate CRF for each entity type, the log-likelihood for one training data, L (θ), is as follows: where θ is the set of parameters of a neural model, and Y (k) denotes the collection of the gold IOBES tags for all levels regarding the entity type k. Let s (k) l,j and e (k) l,j denote the start and end positions of the j-th span at the l-th level. With regard to the 1st level, s 1,1 = 1 and e (k) 1,1 = n because we consider the whole span of a sentence. The spans considered at each deeper level, l > 1, are determined according to the spans of multitoken entities at its immediate parent level. As for the example of Fig. 2, only the span of "Ca2+dependent PKC isoforms" is considered at the 2nd level. Here, the log-likelihood for each entity type can be expressed as follows: where L 1st (. . . ) and L 2nd (. . . ) are the loglikelihoods of the (1st) best and 2nd best paths for each span, respectively. y s:e denotes the set of all possible tag sequences from position s to position e of the entity type k. It is well known that the second term of Eq. 4 can be efficiently calculated by the algorithm shown in Algorithm 2. 2nd best path. L 2nd (. . . ) given the best path can be calculated by excluding the best path from all  possible paths. This concept is also adopted by ListNet (Cao et al., 2007), which is used for ranking tasks such as document retrieval or recommendation. L 2nd (. . . ) can be expressed by the following equation: whereỸ (k) s:e denotes the set of all possible tag sequences except the best path within the span from position s to position e of the entity type k.
However, to the best of our knowledge, the way of efficiently computing the second term of Eq. 5 has not been proposed yet in the literature. Simply subtracting the exponential score of the best path from the summation of the exponential scores of all possible paths causes underflow, overflow, or loss of significant digits. We introduce the way of accurately computing it with the same time complexity as Eq. 4. For explanation, we use the simplified example of the lattice depicted in Fig. 4, in which the span length is 4 and the number of states is 3. The special nodes for start and end states are attached to the both ends of the span. There are 81(= 3 4 ) paths in this lattice. We assume that the path that consists of top nodes of all time steps are the best path as shown in Fig. 4. No generality is lost by making this assumption. To calculate the second term of Eq. 5, we have to consider the exponential scores for all the possible paths except the best path, 80(= 81 − 1) paths.
We first give a way of thinking, which is not our algorithm itself but helpful to understand it. In the example, we can further group these 80 paths according to the steps where the best path is not taken. In this way, we have 4 spaces in total as illustrated in Fig. 3. In Space 1, the top node of time step 4 is excluded from consideration. 54(= 3 3 × 2) paths are taken into account here. Since this space covers all paths that do not go through the top node of time step 4, we only have to consider the paths that go through this node in other spaces. In Space 2, this node is always passed through, and instead the top node of time step 3 is excluded. 18(= 3 2 × 2) paths are considered in this space. Similarly, 6(= 3 1 × 2) paths and 2(= 3 0 × 2) paths are taken into consideration in Space 3 and Space 4, respectively. Thus, we can consider all the possible paths except the best path, 80(= 54 + 18 + 6 + 2) paths. However, this is not our algorithm itself as we mentioned.
We introduce two tricks for making the calculation more efficient. We explain them with Fig. 5, in which Spaces 2 and 3 are picked up. The first trick is that the separated two spaces can be merged at time step 4 because the paths later than time step 3 are identical. When we reach time step 4 in the forward iteration in each of the two spaces, we can merge them using the calculation results at time step 3, as shown with the red edges in Fig. 5. The second trick is that the blue nodes in Fig. 5 can be copied from Space 2 to Space 3 at time step 2 since the considered paths until that time step are also the same. These two tricks can be applied to other pairs of two adjacent spaces, which relieves the need to separately calculate the summation of the exponential scores for each space. Therefore, the second term of Eq. 5 can be calculated as shown in Algorithm 3.
Thus, we can train a model using the objective function of Eqs. 2, 3, 4, and 5.

Characteristics
We explain some theoretical characteristics of our method here. Regarding the time complexity of decoder, the worst case for our method is when our decoder narrows down the spans one by one, from n tokens (a whole sentence) to 2 tokens. The time complexity for the worst case is therefore O (n + · · · + 2) = O n 2 for each entity type, O mn 2 in total, where m denotes the number of entity types. However, this rarely happens. The average processing time in the case where our de-coding method narrows down spans successfully according to gold labels is O (dmn), where d is the average number of gold IOBES tags of each entity type assigned to a word. The average numbers calculated from the gold labels of ACE-2004, ACE-2005, and GENIA are 1.06, 1.06, and 1.05, respectively. For these datasets, the expected processing time is nearly equal to O (mn).
Further, our method focuses on the output layer of neural architectures; therefore our method can be combined with any neural model. While some existing methods have hyperparameters beyond those of the conventional CRF-based model used for flat NER tasks, our method does not. 's method and Sohrab and Miwa (2018)'s method use a hyperparameter for the maximal length of considered entities. Katiyar and Cardie (2018)'s method uses a threshold, which affects the number of detected entities. These hyperparameters must be tuned depending on datasets. Our method is easy to use from this viewpoint.
We verify the empirical performances of our methods in the successive sections.

Datasets
We perform nested entity extraction experiments on ACE-2005(Doddington et al., 2004 and GE-NIA (Kim et al., 2003). For ACE-2005, we use the same splits of documents as Lu and Roth (2015), published on their website 5 . For GENIA, we use GENIAcorpus3.02p 6 in which sentences are already tokenized (Tateisi and Tsujii, 2004). Following previous works (Finkel and Manning, 2009;Lu and Roth, 2015), we first split the last 10% of sentences as the test set. Next, we use the first 81% and the subsequent 9% for training and development sets, respectively. We make the same modifications as described by Finkel and Manning (2009) by collapsing all DNA, RNA, and protein subtypes into DNA, RNA, and protein, keeping cell line and cell type, and removing other entity types, resulting in 5 entity types. The statistics of each dataset are shown in Table 1.

Model and Training
In this study, we adopt as baseline a BiLSTM-CRF model, which is widely used for NER tasks (Lample et al., 2016;Chiu and Nichols, 2016;Ma and Hovy, 2016). We apply our usage of CRF to this baseline. We prepare two types of models for our experiments. The first one is the model combined with conventional word embeddings, 7 which is prepared for fairly comparing our method with existing methods. We initialize word embeddings with the pretrained embeddings GloVe (Pennington et al., 2014) of dimension 100 in ACE-2005. For GENIA, we initialize each word with the pretrained embeddings trained on MEDLINE abstracts . The initialized word embeddings are fixed during training. The words that do not appear in embedding vocabulary but appear in the training set are initialized with zero vectors and tuned during training. We also adopt a CNN-based character-level representation (Chiu and Nichols, 2016;Ma and Hovy, 2016). The vectors of the word embeddings and the characterlevel representation are concatenated and then input into the BiLSTM layer. The second model is the model combined with the pretrained BERT 7 https://github.com/yahshibu/ nested-ner-2019 model (Devlin et al., 2018). 8 We use the cased version of BERT large model as a contextual word embeddings generator without fine-tuning and stack the BiLSTM layers on top of the BERT model. Both models have 2 BiLSTM hidden layers, and the dimensionality of each hidden unit is 200 in all our experiments. Table 2 lists the hyperparameters used for our experimental evaluations. We adopt AdaBound (Luo et al., 2019) as an optimizer. Early stopping is used based on the performance of development set. We repeat the experiment 5 times with different random seeds and report average and standard deviation of F1-scores on a test set as the final performance.   GENIA is quite different from that of the corpus used for training the BERT model. Regardless, it is demonstrated that our method outperforms existing methods.

Ablation Study
We conduct an ablation study to verify the effectiveness of our learning and decoding methods. We first replace our objective function for training with the standard objective function of the liner-chain CRF. The methods for decoding N -best paths have been well studied because such algorithms have been required in many domains (Soong and Huang, 1990;Kaji et al., 2010;Huang et al., 2012). However, we hypothesize that our learning method, as well as our decoding method, helps to improve performance. That is why we first remove only our learning method. Then, we also replace our decoding algorithm with the standard decoding algorithm of the linearchain CRF. It is equivalent to preparing the conventional CRF for each entity type separately. The results are shown in Table 4. They demonstrate that introducing only our decoding algorithm surely brings high recall scores but hurts precision. This suggests that our learning method should be necessary for achieving high precision. Besides, removing the decoding algorithm results 9 Note that in ACE-2005, Ju et al. (2018 did their experiments with a different split from Lu and Roth (2015) that we follow. 10 Wang et al. (2018) did not report precision and recall scores. Instead of ,  reported the scores for the model of .

ACE-2005 GENIA Level
Recall ( 50.00 ± 0.00 2 -0  in lower recall. This is natural because it does not intend to find any nested entity after extracting outermost entities. Thus, it is demonstrated that both our learning and decoding algorithms contribute much to good performance across these datasets.

Analysis of Behavior
To further understand how our method handles nested entities, we investigate the performances for entities of each level. Table 5 shows the recall scores for gold entities of each level when using conventional word embeddings. Among all levels, our model results in the best performance at the 1st level that consists of only gold outermost entities. The deeper a level, the lower recall scores. On the other hard, Table 6 shows the precision scores Method # tokens per second  459  2,059 This work 5,594  for predicted entities in each level of one trial on each dataset. Because the number of levels in the predictions vary between trials, taking macro average of precision scores over multiple trials is not representative. Therefore, we show only the precision scores from one trial in Table 6. The precision score for the 4th level on ACE-2005 is as high as or higher than those of other levels. Precision scores are less dependent on level. This tendency is also shown in other trials.

Running Time
We examine the decoding speed in terms of the number of words processed per second. We compare our model with the models of  and  using their implementations. The used platform is the same (Py-Torch), and the machine on which we run them is also the same (CPU: Intel i7 2.7 GHz). Results on ACE-2005 are listed in Table 7. Our model turns out to be faster than theirs, showing its scalability.

Comparison on ACE-2004
We compare our method with existing methods also on the ACE-2004 dataset. We use the same splits as Lu and Roth (2015). The setups are the same as those of our experiment on ACE-2005. Table 8 shows the results. As shown, our method significantly outperforms existing methods. Note that most of them use embeddings of POS tags as an additional input feature whereas our method does not. 11   did not report precision and recall scores. Instead of ,  reported the scores for the model of .

Error Analysis
We manually scan the test set predictions on ACE-2005. We find out that many of the errors can be classified into two types.
The first type is partial prediction error. Given the following sentence: "Let me set aside the hypocrisy of a man who became president because of a lawsuit trying to eliminate everybody else's lawsuits, but instead focus on his own experience". Our model predicts "a man who became president" or "a man who became president because of a lawsuit trying to eliminate everybody else's lawsuits" as PER (Person) while the annotation marks "a man who became president because of a lawsuit". It is difficult for our model to extract the proper spans of clauses or independent sentences that contain numerous modifiers.
The second type is error derived from pronominal mention, as reported by Katiyar and Cardie (2018). Consider the following example: "They roar, they screech.". These "They"s refer to "tanks" in another sentence of the same document and are indeed annotated as VEH (Vehicle). Our model fails to detect these pronominal mentions or wrongly labels them as PER. Document context should be taken into consideration to solve this problem.