Abstract

When an entity name contains other names within it, the identification of all combinations of names can become difficult and expensive. We propose a new method to recognize not only outermost named entities but also inner nested ones. We design an objective function for training a neural model that treats the tag sequence for nested entities as the second best path within the span of their parent entity. In addition, we provide the decoding method for inference that extracts entities iteratively from outermost ones to inner ones in an outside-to-inside way. Our method has no additional hyperparameters to the conditional random field based model widely used for flat named entity recognition tasks. Experiments demonstrate that our method performs better than or at least as well as existing methods capable of handling nested entities, achieving F1-scores of 85.82%, 84.34%, and 77.36% on ACE-2004, ACE-2005, and GENIA datasets, respectively.

1 Introduction

Named entity recognition (NER) is the task of identifying text spans associated with proper names and classifying them according to their semantic class such as person or organization. NER, or in general the task of recognizing entity mentions, is one of the first stages in deep language understanding, and its importance has been well recognized in the NLP community (Nadeau and Sekine, 2007).

One popular approach to the NER task is to regard it as a sequence labeling problem. In this case, it is implicitly assumed that mentions are not nested in texts. However, names often contain entities nested within themselves, as illustrated in Figure 1, which contains 3 mentions of the same type (PROTEIN) in the span “… in Ca2+ -dependent PKC isoforms in …”, taken from the GENIA dataset (Kim et al., 2003). Name nesting is common, especially in technical domains (Alex et al., 2007; Byrne, 2007; Wang, 2009). The assumption of no nesting leads to loss of potentially important information and may negatively impact subsequent downstream tasks. For instance, a downstream entity linking system that relies on NER may fail to link the correct entity if the entity mention is nested.

Figure 1: 

Example of nested entities.

Figure 1: 

Example of nested entities.

Various approaches to recognizing nested entities have been proposed. Many of them rely on producing and rating all possible (sub)spans, which can be computationally expensive. Wang and Lu (2018) provided a hypergraph-based approach to consider all possible spans. Sohrab and Miwa (2018) proposed a neural exhaustive model that enumerates and classifies all possible spans. These methods, however, achieve high performance at the cost of time complexity. To reduce the running time, they set a threshold to discard longer entity mentions. If the hyperparameter is set low, running time is reduced but longer mentions are missed. In contrast, Muis and Lu (2017) proposed a sequence labeling approach that assigns tags to gaps between words, which efficiently handles sequences using Viterbi decoding. However, this approach suffers from structural ambiguity issues during inference, as explained by Wang and Lu (2018). Katiyar and Cardie (2018) proposed another hypergraph-based approach that learns the structure in a greedy manner. However, their method uses an additional hyperparameter as the threshold for selecting multiple mention candidates. This hyperparameter affects the trade-off between recall and precision.

In this paper, we propose new learning and decoding methods to extract nested entities without any additional hyperparameters. We summarize our contributions as follows:

  • • 

    We describe a decoding method that iteratively recognizes entities from outermost ones to inner ones without structural ambiguity. It recursively searches a span of each extracted entity for inner nested entities using the Viterbi algorithm. This algorithm does not require hyperparameters for the maximal length or number of mentions considered.

  • • 

    We also provide a novel learning method that ensures the aforementioned decoding. Models are optimized based on an objective function designed according to the decoding procedure.

  • • 

    Empirically, we demonstrate that our method performs better than or at least as well as the current state-of-the-art methods with 85.82%, 84.34%, and 77.36% in F1-score on three standard datasets: ACE-2004,1 ACE-2005,2 and GENIA.

2 Method

We propose applying conditional random fields (CRFs) Lafferty et al. (2001), which is commonly used for flat NER (Lample et al., 2016; Ma and Hovy, 2016; Chiu and Nichols, 2016; Reimers and Gurevych, 2017; Strubell et al., 2017; Akbik et al., 2018), to nested NER in this study. We first explain our usage of CRF, which is the base of our decoding and training methods. Then, we introduce our decoding and training methods. Our decoding and training methods focus on the output layer of neural architectures and therefore can be combined with any neural model.

2.1 Usage of CRF

Our decoding and training methods are based on two key points about our usage of CRF. The first key point is that we prepare a separate CRF for each named entity type. This enables our method to handle the situation where the same mention span is assigned multiple entity types. The GENIA dataset indeed has such mention spans. In the literature, Muis and Lu (2017) demonstrated that this approach of multiple CRFs would perform better on nested NER datasets and even a flat NER dataset than the standard approach of a single CRF for all entity types. The second key point is that each element of the transition matrix of each CRF has a fixed value according to whether it corresponds to a legal transition (e.g., B-X to I-X in IOBES tagging scheme, where X is the name of entity type) or an illegal one (e.g., O to I-X). This is helpful for keeping the scores for tag sequences including outer entities higher than those of tag sequences including inner entities.

Formally, we use Z=z1,,zn to represent a sequence output from the last hidden layer of a neural model, where zi is the vector for the i-th word, and n is the number of tokens. y(k)={y1(k),,yn(k)} represents a sequence of IOBES tags of entity type k for Z. Here, we define the score function to be
φkyi1(k),yi(k),zi=Pyi(k),i(k)+Ayi1(k),yi(k)(k),
(1)
wherePyi(k),i(k)=Wyi(k)(k)zi+byi(k)(k),
Ayi1(k),yi(k)(k)=ll,ifyi1(k)yi(k)is illegal,0,otherwise.
Wyi(k)(k) and byi(k)(k) denote the weight matrix and the bias vector corresponding to yi(k), respectively. A(k) stands for the transition matrix from the previous token to the current token, and Ayi1(k),yi(k)(k) is the transition scores from yi1(k) to yi(k). Z is shared between all of the multiple CRFs as their input.

2.2 Decoding

We use three strategies for decoding. First, we consider each entity type separately using multiple CRFs in decoding, which makes it possible to handle the situation that the same mention span is assigned multiple entity types. Second, our decoder searches nested entities in an outside-to-inside way,3 which realizes efficient processing by eliminating the spans of non-entity at an early stage. More specifically, our method recursively narrows down the spans to Viterbi-decode. The spans to Viterbi-decode are dynamically decided according to the preceding Viterbi-decoding result. Only the spans that have just been recognized as entity mentions are Viterbi-decoded again. Third, we use the same scores φkyi1(k),yi(k),zi of Equation (1) to extract outermost entities and even inner entities without re-encoding, which makes inference more efficient and faster. These three strategies are deployed and completed only in the output layer of neural architectures.

We describe the pseudo-code of our decoding method in Algorithm 1. Also, we depict the overview of our decoding method with an example in Figure 2. We use the term level in the sense of the depth of entity nesting. [S] and [E] in Figure 2 stand for the START and END tags, respectively. We always attach these tags to both ends of every sequence of IOBES tags in Viterbi-decoding.

Figure 2: 

Overview of our second-best path decoding algorithm to iteratively find nested entities.

Figure 2: 

Overview of our second-best path decoding algorithm to iteratively find nested entities.

graphic

We explain the decoding procedure and mechanism in detail below. We consider each entity type separately and iterate the same decoding process regarding distinct entity types as described in Algorithm 1. In the decoding process for each entity type k, we first calculate the CRF scores φkyi1(k),yi(k),zi over the entire sentence. Next, we decode a sequence with the standard 1-best Viterbi decoding as with the conventional linear-chain CRF. “Ca2+ -dependent PKC isoforms” is extracted at the 1st level with regard to the example of Figure 2.

Then, we start our recursive decoding to extract nested entities within previously extracted entity spans by finding the 2nd best path. In Figure 2, the span “Ca2+ -dependent PKC isoforms” is processed at the 2nd level. Here, if we search for the best path within each span, the same tag sequence will be obtained, even though the processed span is different. This is because we continue using the same scores φkyi1(k),yi(k),zi and because all the values of A(k) corresponding to legal transitions are equal to 0. Regarding the example of Figure 2, the score of the transition from [S] to B-P at the 2nd level is equal to the score of the transition from O to B-P at the 1st level. This is true for the transition from E-P to [E] at the 2nd level and the one from E-P to O at the 1st level. The best path between the [S] and [E] tags is identical to the best path between the two O tags under our restriction about the transition matrix of CRF. Therefore, we search for the 2nd best path within the span by utilizing the N-best Viterbi A* algorithm (Seshadri and Sundberg, 1994; Huang et al., 2012).4 Note that our situation is different from normal situations where N-best decoding is needed. We already know the best path within the span and want to find only the 2nd best path. Thus, we can extract nested entities by finding the 2nd best path within each extracted entity. Regarding the example of Figure 2, “PKC isoforms” is extracted from the span “Ca2+ -dependent PKC isoforms” at the 2nd level.

We continue this recursive decoding until no multi-token entities are detected within a span. In Figure 2, the span “PKC isoforms” is processed at the 3rd level. At the 3rd or deeper levels, the tag sequence of its grandparent level is no longer either the best path or the 2nd best path because the start or end position of the current span is in the middle of the entity mention span at the grandparent level. As for the example shown in Figure 2, the word “PKC” is tagged I-P at the 1st level, and the transition from [S] to I-P is illegal. The scores of the paths that includes illegal transitions cannot be larger than those of the paths that consist of only legal transitions because the elements of the transition matrix A(k) corresponding to illegal transitions are set to . That is why at all levels below the 1st level we only need to find the 2nd best path.

This recursive processing is stopped when no entities are predicted or when only single-token entities are detected within a span.5 In Figure 2, the span “PKC” is not processed any more because it is a single-token entity.

Only one nested entity is extracted within each decoded span in Figure 2, but there can be cases where multiple multi-token entities are detected within a decoded span. In such cases, our algorithm Viterbi-decodes each of their spans in the way of the depth-first search algorithm. The aforementioned processing is executed on all entity types, and all detected entities are returned as an output result.

2.3 Training

To extract entities from outside to inside successfully, a model has to be trained in a way that the scores for the paths including outer entities will be higher than those for the paths including inner entities. We propose a new objective function to achieve this requirement.

We maximize the log-likelihood of the correct tag sequence as with the conventional CRF-based model. Considering that our model has a separate CRF for each entity type, the log-likelihood for one training data, Lθ, is as follows:
Lθ=klogpY(k)|Z;θ,
(2)
where θ is the set of parameters of a neural model, and Y(k) denotes the collection of the gold IOBES tags for all levels regarding the entity type k. As we mentioned in Section 2.1, Z is a sequence output from the last hidden layer of a neural model and is shared between all of the multiple CRFs. Therefore, θ is updated through a backpropagation process so that Z can represent information about all entity types.
In the following, we decompose the log-likelihood for all levels into the ones for each level. Let sl,j(k) and el,j(k) denote the start and end positions of the j-th span at the l-th level. With regard to the 1st level, s1,1(k)=1 and e1,1(k)=n because we consider the whole span of a sentence. The spans considered at each deeper level, l > 1, are determined according to the spans of multi-token entities at its immediate parent level. As for the example of Figure 2, only the span of “Ca2+ -dependent PKC isoforms” is considered at the 2nd level. Here, the log-likelihood for each entity type can be expressed as follows:
logpY(k)|Z;θ=L1sty1,1(k),,y1,n(k)|Z;θ+l>1jL2ndyl,sl,j(k)(k),,yl,el,j(k)(k)|Z;θ,
(3)
where L1st and L2nd are the log-likelihoods of the (1st) best and 2nd best paths for each span, respectively. yl,i(k) denotes the correct IOBES tag of the position i of the l-th level of the entity type k.
Best path.L1st can be calculated in the same manner as the conventional linear-chain CRF:
L1sty1,1(k),,y1,n(k)|Z;θ=ψ1:n(k)y1,1(k),ZlogyY1:n(k)expψ1:n(k)y,Z,
(4)
wherepsis:e(k)y,Z=i=seφkyi1,yi,zi+Aye,ye+1(k),ys1=[S],ye+1=[E].
Ys:e(k) denotes the set of all possible tag sequences from position s to position e of the entity type k. The first term of Equation (4) is the score of the gold tag sequence, and the second term is the logarithm of the summation of the exponential scores of all possible tag sequences. It is well known that the second term of Equation (4) can be efficiently calculated by the algorithm shown in Algorithm 2.

graphic

2nd best path.L2nd given the best path can be calculated by excluding the best path from all possible paths. This concept is also adopted by ListNet (Cao et al., 2007), which is used for ranking tasks such as document retrieval or recommendation. L2nd can be expressed by the following equation:
L2ndyl,sl,j(k)(k),,yl,el,j(k)(k)|Z;θ=ψsl,j(k):el,j(k)(k)yl,j(k),ZlogyY~sl,j(k):el,j(k)(k)expψsl,j(k):el,j(k)(k)y,Z,
(5)
where Y~s:e(k) denotes the set of all possible tag sequences except the best path within the span from position s to position e of the entity type k.

However, to the best of our knowledge, the way of efficiently computing the second term of Equation (5) has not been proposed yet in the literature. Simply subtracting the exponential score of the best path from the summation of the exponential scores of all possible paths causes underflow, overflow, or loss of significant digits.

We introduce a way of accurately computing it with the same time complexity as Algorithm 2 for Equation (4). For explanation, we use the simplified example of the lattice depicted in Figure 3, in which the span length is 4 and the number of states is 3. The special nodes for start and end states are attached to the both ends of the span. There are 81(= 34) paths in this lattice. We assume that the path that consists of top nodes of all time steps are the best path as shown in Figure 3. No generality is lost by making this assumption. To calculate the second term of Equation (5), we have to consider the exponential scores for all the possible paths except the best path, 80(= 81 − 1) paths.

Figure 3: 

Lattice and best path.

Figure 3: 

Lattice and best path.

We first give a way of thinking, which is not our algorithm itself but helpful to understand it. In the example, we can further group these 80 paths according to the steps where the best path is not taken. In this way, we have 4 spaces in total as illustrated in Figure 4. In Space 1, the top node of time step 4 is excluded from consideration. 54(= 33 × 2) paths are taken into account here. Since this space covers all paths that do not go through the top node of time step 4, we only have to consider the paths that go through this node in other spaces. In Space 2, this node is always passed through, and instead the top node of time step 3 is excluded. 18(= 32 × 2) paths are considered in this space. Similarly, 6(= 31 × 2) paths and 2(= 30 × 2) paths are taken into consideration in Space 3 and Space 4, respectively. Thus, we can consider all the possible paths except the best path, 80(= 54 + 18 + 6 + 2) paths. However, this is not our algorithm itself as we mentioned.

Figure 4: 

Divided search spaces.

Figure 4: 

Divided search spaces.

We introduce two tricks for making the calculation more efficient. We explain them with Figure 5, in which Spaces 2 and 3 are picked up. The first trick is that the separated two spaces can be merged at time step 4 because the paths later than time step 3 are identical. When we reach time step 4 in the forward iteration in each of the two spaces, we can merge them using the calculation results at time step 3, as shown with the red edges in Figure 5. The second trick is that the blue nodes in Figure 5 can be copied from Space 2 to Space 3 at time step 2 since the considered paths until that time step are also the same. These two tricks can be applied to other pairs of two adjacent spaces, which relieves the need to separately calculate the summation of the exponential scores for each space. Therefore, the second term of Equation (5) can be calculated as shown in Algorithm 3.

Figure 5: 

Merge of search spaces.

Figure 5: 

Merge of search spaces.

graphic

Thus, we can train a model using the objective function of Equations 2, 3, 4, and 5.

2.4 Characteristics

Time complexity. Regarding the time complexity of decoder, the worst case for our method is when our decoder narrows down the spans one by one, from n tokens (a whole sentence) to 2 tokens. The time complexity for the worst case is therefore On++2=On2 for each entity type, Omn2 in total, where m denotes the number of entity types. However, this rarely happens. The ideal average processing time in the case where our decoding method narrows down spans successfully according to gold labels is O(dmn), where d is the average number of gold IOBES tags of each entity type assigned to a word. The average numbers calculated from the gold labels of ACE-2004, ACE-2005, and GENIA are 1.06, 1.06, and 1.05, respectively.

Usability. Some existing methods have hyperparameters, such as the maximal length of considered entities or the threshold that affects the number of detected entities, beyond those of the conventional CRF-based model used for flat NER tasks. These hyperparameters must be tuned depending on datasets. On the other hand, our method does not have such hyperparameters and is easy to use from this viewpoint. In addition, our method focuses on the output layer of neural architectures; therefore our method can be combined with any neural model.

We verify the empirical performances of our methods in the successive sections.

3 Experimental Settings

3.1 Datasets

We perform nested entity extraction experiments intensively on ACE-2005 (Doddington et al., 2004) and GENIA (Kim et al., 2003). For ACE-2005, we use the same splits of documents as Lu and Roth (2015), published on their website.6 For GENIA, we use GENIAcorpus3.02p,7 in which sentences are already tokenized (Tateisi and Tsujii, 2004). Following previous work (Finkel and Manning, 2009; Lu and Roth, 2015), we first split the last 10% of sentences as the test set. Next, we use the first 81% and the subsequent 9% for training and development sets, respectively. We make the same modifications as described by Finkel and Manning (2009) by collapsing all DNA, RNA, and protein subtypes into DNA, RNA, and protein, keeping cell line and cell type, and removing other entity types, resulting in 5 entity types. The statistics of each dataset are shown in Table 1.

Table 1: 
Statistics of the datasets used in the experiments. Note that in ACE-2005, sentences are not originally split. We report the numbers of sentences based on the preprocessing with the Stanford CoreNLP Manning et al. (2014).
ACE-2005GENIA
Train(%)Dev(%)Test(%)Train(%)Dev(%)Test(%)
# documents 370  43  51  –  –  –  
# sentences (7,285)  (968)  (1,058)  15,022  1,669  1,855  
# mentions 24,827  3,234  3,041  47,027  4,469  5,600  
- 1st level 21,966 (88) 2,900 (90) 2,686 (88) 44,611 (95) 4,239 (95) 5,273 (94) 
- 2nd level 2,635 (11) 316 (10) 323 (11) 2393 (5) 230 (5) 327 (6) 
- 3rd level 215 (1) 18 (1) 30 (1) 23 (0) (0) (0) 
- 4th level (0) (0) (0) (0) (0) (0) 
# labels per token (d1.06  1.05  1.05  1.05  1.05  1.05  
ACE-2005GENIA
Train(%)Dev(%)Test(%)Train(%)Dev(%)Test(%)
# documents 370  43  51  –  –  –  
# sentences (7,285)  (968)  (1,058)  15,022  1,669  1,855  
# mentions 24,827  3,234  3,041  47,027  4,469  5,600  
- 1st level 21,966 (88) 2,900 (90) 2,686 (88) 44,611 (95) 4,239 (95) 5,273 (94) 
- 2nd level 2,635 (11) 316 (10) 323 (11) 2393 (5) 230 (5) 327 (6) 
- 3rd level 215 (1) 18 (1) 30 (1) 23 (0) (0) (0) 
- 4th level (0) (0) (0) (0) (0) (0) 
# labels per token (d1.06  1.05  1.05  1.05  1.05  1.05  

3.2 Model and Training

In this study, we adopt as baseline a BiLSTM-CRF model, which is widely used for NER tasks (Lample et al., 2016; Ma and Hovy, 2016; Chiu and Nichols, 2016; Reimers and Gurevych, 2017). We apply our usage of CRF to this baseline. We prepare three types of models for fair comparisons with existing methods. The first one is the model to which is fed conventional word embeddings and CNN-based character-level representation (Ma and Hovy, 2016; Chiu and Nichols, 2016; Reimers and Gurevych, 2017).8 We initialize word embeddings with the pretrained embeddings GloVe (Pennington et al., 2014) of dimension 100 in ACE-2005. For GENIA, we adopt the pretrained embeddings trained on MEDLINE abstracts (Chiu et al., 2016) instead. The initialized word embeddings are fixed during training. The vectors of the word embeddings and the character-level representation are concatenated and then input into the BiLSTM layer. The second model is the model combined with the pretrained BERT model (Devlin et al., 2019).9 We use the uncased version of BERT large model as a contextual word embeddings generator without fine-tuning and stack the BiLSTM layers on top of the BERT model. The third model is the BiLSTM-CRF model to which is fed word embeddings, character-level representation, BERT embeddings, and FLAIR embeddings (Akbik et al., 2018) using FLAIR framework (Akbik et al., 2019).10 All our models have 2 BiLSTM hidden layers, and the dimensionality of each hidden unit is 256 in all our experiments. Table 2 lists the hyperparameters used for our experimental evaluations. We adopt AdaBound (Luo et al., 2019) as an optimizer. Early stopping is used based on the performance of development set. We repeat the experiment 5 times with different random seeds and report average and standard deviation of F1-scores on a test set as the final performance.

Table 2: 
Hyperparameters in our experiments.
HyperparameterValue
word dropout rate 0.05 
character embedding dimension 128 
CNN window size 
CNN filter number 256 
 
batch size 32 
LSTM hidden size 256 
LSTM dropout rate 0.2 (w/o BERT) 
 0.5 (w/ BERT) 
gradient clipping 5.0 
HyperparameterValue
word dropout rate 0.05 
character embedding dimension 128 
CNN window size 
CNN filter number 256 
 
batch size 32 
LSTM hidden size 256 
LSTM dropout rate 0.2 (w/o BERT) 
 0.5 (w/ BERT) 
gradient clipping 5.0 

4 Experimental Results

4.1 Comparison with Existing Methods

Table 3 presents comparisons of our model with existing methods. Note that some existing methods use embeddings of POS tags as an additional input feature whereas our method does not. Our method outperforms the existing methods with 76.83% and 77.19% in terms of F1-score when using only word embeddings and character-level representation. Especially, our method brings much higher recall values than the other methods. The recall scores are improved by 3.1% and 2.4% on ACE-2005 and GENIA datasets, respectively. These results demonstrate that our training and decoding algorithms are quite effective for extracting nested entities. Moreover, when we use BERT and FLAIR as contextual word embeddings, we achieve an F1-score of 83.99% with BERT and 84.34% with BERT and FLAIR on ACE-2005. On the other hand, BERT does not perform well on GENIA. We assume that this is because the domain of GENIA is quite different from that of the corpus used for training the BERT model. Regardless, it is demonstrated that our method performs better than or at least as well as existing methods.

Table 3: 
Main results. We group methods into three types. The first group consists of the methods that do not use any contextual word embeddings. The second group consists of the methods that use BERT but do not use any other contextual word embeddings. The third group consists of the methods that use both BERT and FLAIR. “†” indicates the methods using POS tags.
ACE-2005GENIA
MethodPrecision (%)Recall (%)F1 (%)Precision (%)Recall (%)F1 (%)
Katiyar and Cardie (201870.6 70.4 70.5 79.8 68.2 73.6 
Ju et al. (2018)11 74.2 70.3 72.2 78.5 71.3 74.7 
Wang et al. (2018)12 74.5 71.5 73.0 78.0 70.2 73.9 
Wang and Lu (2018) 76.8 72.3 74.5 77.0 73.3 75.1 
Sohrab and Miwa (2018– – – 93.2 64.0 77.1 
Zheng et al. (2019– – – 75.9 73.6 74.7 
Fisher and Vlachos (201975.1 74.1 74.6 – – – 
Lin et al. (2019) 76.2 73.6 74.9 75.8 73.9 74.8 
Straková et al. (2019)13 76.35 74.39 75.36 79.60 73.53 76.44 
This work 78.27 ± 0.81 75.44 ± 0.37 76.83 ± 0.36 78.70 ± 0.69 75.74 ± 0.64 77.19 ± 0.10 
 
Fisher and Vlachos (2019) [BERT] 82.7 82.1 82.4 − − − 
Straková et al. (2019) [BERT] 82.58 84.29 83.42 79.92 76.55 78.20 
This work [BERT] 83.30 ± 0.22 84.69 ± 0.37 83.99 ± 0.27 77.46 ± 0.65 76.65 ± 0.58 77.05 ± 0.12 
 
Straková et al. (2019) [BERT+FLAIR] 83.48 85.21 84.33 80.11 76.60 78.31 
This work [BERT+FLAIR] 83.83 ± 0.39 84.87 ± 0.09 84.34 ± 0.20 77.81 ± 0.69 76.94 ± 1.12 77.36 ± 0.26 
ACE-2005GENIA
MethodPrecision (%)Recall (%)F1 (%)Precision (%)Recall (%)F1 (%)
Katiyar and Cardie (201870.6 70.4 70.5 79.8 68.2 73.6 
Ju et al. (2018)11 74.2 70.3 72.2 78.5 71.3 74.7 
Wang et al. (2018)12 74.5 71.5 73.0 78.0 70.2 73.9 
Wang and Lu (2018) 76.8 72.3 74.5 77.0 73.3 75.1 
Sohrab and Miwa (2018– – – 93.2 64.0 77.1 
Zheng et al. (2019– – – 75.9 73.6 74.7 
Fisher and Vlachos (201975.1 74.1 74.6 – – – 
Lin et al. (2019) 76.2 73.6 74.9 75.8 73.9 74.8 
Straková et al. (2019)13 76.35 74.39 75.36 79.60 73.53 76.44 
This work 78.27 ± 0.81 75.44 ± 0.37 76.83 ± 0.36 78.70 ± 0.69 75.74 ± 0.64 77.19 ± 0.10 
 
Fisher and Vlachos (2019) [BERT] 82.7 82.1 82.4 − − − 
Straková et al. (2019) [BERT] 82.58 84.29 83.42 79.92 76.55 78.20 
This work [BERT] 83.30 ± 0.22 84.69 ± 0.37 83.99 ± 0.27 77.46 ± 0.65 76.65 ± 0.58 77.05 ± 0.12 
 
Straková et al. (2019) [BERT+FLAIR] 83.48 85.21 84.33 80.11 76.60 78.31 
This work [BERT+FLAIR] 83.83 ± 0.39 84.87 ± 0.09 84.34 ± 0.20 77.81 ± 0.69 76.94 ± 1.12 77.36 ± 0.26 

4.2 Ablation Study

We conduct an ablation study to verify the effectiveness of our learning and decoding methods. We first replace our objective function for training with the standard objective function of the linear-chain CRF. The methods for decoding N-best paths have been well studied because such algorithms have been required in many domains (Soong and Huang, 1990; Kaji et al., 2010; Huang et al., 2012). However, we hypothesize that our learning method, as well as our decoding method, helps to improve performance. That is why we first remove only our learning method. Then, we also replace our decoding algorithm with the standard decoding algorithm of the linear-chain CRF. It is equivalent to preparing the conventional CRF for each entity type separately.

The results are shown in Table 4. They demonstrate that introducing only our decoding algorithm results in high recall scores but hurts precision. This suggests that our learning method should be necessary for achieving high precision. Besides, removing the decoding algorithm results in lower recall. This is natural because it does not intend to find any nested entity after extracting outermost entities. Thus, it is demonstrated that both our learning and decoding algorithms contribute much to good performance.

Table 4: 
Results when ablating away the learning (L) and decoding (D) components of our proposed method.
ACE-2005GENIA
Precision (%)Recall (%)F1 (%)Precision (%)Recall (%)F1 (%)
This work 78.27 ± 0.81 75.44 ± 0.37 76.83 ± 0.36 78.70 ± 0.69 75.74 ± 0.64 77.19 ± 0.10 
– L 60.89 ± 1.30 75.38 ± 1.27 67.34 ± 0.37 70.72 ± 0.39 79.20 ± 1.27 74.71 ± 0.18 
– L&D 77.77 ± 0.31 67.42 ± 0.29 72.22 ± 0.13 79.70 ± 0.56 73.41 ± 0.35 76.43 ± 0.28 
ACE-2005GENIA
Precision (%)Recall (%)F1 (%)Precision (%)Recall (%)F1 (%)
This work 78.27 ± 0.81 75.44 ± 0.37 76.83 ± 0.36 78.70 ± 0.69 75.74 ± 0.64 77.19 ± 0.10 
– L 60.89 ± 1.30 75.38 ± 1.27 67.34 ± 0.37 70.72 ± 0.39 79.20 ± 1.27 74.71 ± 0.18 
– L&D 77.77 ± 0.31 67.42 ± 0.29 72.22 ± 0.13 79.70 ± 0.56 73.41 ± 0.35 76.43 ± 0.28 

4.3 Analysis of Behavior

To further understand how our method handles nested entities, we investigate the performance for entities of each level. Table 5 shows the recall scores for gold entities of each level when using conventional word embeddings. Among all levels, our model results in the best performance at the 1st level that consists of only gold outermost entities. The deeper a level, the lower recall scores. On the other hand, Table 6 shows the precision scores for predicted entities in each level of one trial on each dataset. Because the number of levels in the predictions vary between trials, taking macro average of precision scores over multiple trials is not representative. Therefore, we show only the precision scores from one trial in Table 6. The precision score for the 5th level on ACE-2005 is as high as or higher than those of other levels. Precision scores are less dependent on level. This tendency is also shown in other trials.

Table 5: 
Recall scores for gold annotations of each level.
ACE-2005GENIA
LevelRecall (%)Num.Rcall (%)Num.
1st 76.10 ± 0.50 2,686 77.92 ± 0.72 5,273 
2nd 71.70 ± 0.70 323 40.61 ± 1.74 327 
3rd 58.00 ± 5.42 30 – 
4th 50.00 ± 0.00 – 
ACE-2005GENIA
LevelRecall (%)Num.Rcall (%)Num.
1st 76.10 ± 0.50 2,686 77.92 ± 0.72 5,273 
2nd 71.70 ± 0.70 323 40.61 ± 1.74 327 
3rd 58.00 ± 5.42 30 – 
4th 50.00 ± 0.00 – 
Table 6: 
Precision scores for predictions of each level of one trial.
ACE-2005GENIA
LevelPrecision (%)Num.Precision (%)Num.
1st 80.36 2,500 80.29 5,038 
2nd 72.35 311 57.06 326 
3rd 79.07 43 66.67 
4th 66.67 – 
5th 83.33 – 
ACE-2005GENIA
LevelPrecision (%)Num.Precision (%)Num.
1st 80.36 2,500 80.29 5,038 
2nd 72.35 311 57.06 326 
3rd 79.07 43 66.67 
4th 66.67 – 
5th 83.33 – 

In addition, we compare the tendency of our method with that of an existing method. We select Wang and Lu (2018) method for comparison.14 We train their model with the ACE-2005 dataset using their original implementation and repeat that 5 times. The recall scores from the 1st level to the 4th level are 66.52%, 65.34%, 42.14%, and 50.00%, respectively. The tendency about the difference across levels is common to Wang and Lu (2018) method and our method, and the scores from our method (Table 5) are entirely higher than those from their method. It is demonstrated that our method can extract both outer and inner entities better. On the other hand, their method can extract crossing entities (two entities overlap but neither is contained in the other), although our method cannot. Actually, their model outputs some crossing spans in our experiments. In this case, we cannot analyze the results regarding precision scores in the same manner as Table 6. There are cases where one cannot uniquely decide the level of an span nested within multiple crossing spans. Regardless, our method cannot handle crossing entities. However, crossing entities are very rare (Lu and Roth, 2015; Wang et al., 2018). The test sets of ACE-2005 and GENIA have no crossing entities. This property of our method does not have a negative impact on performance, at least on the ACE-2005 and GENIA datasets.

4.4 Error Analysis

We manually scan the test set predictions on ACE-2005. We find that many of the errors can be classified into two types.

The first type is partial prediction error. Given the following sentence: “Let me set aside the hypocrisy of a man who became president because of a lawsuit trying to eliminate everybody else’s lawsuits, but instead focus on his own experience”. The annotation marks “a man who became president because of a lawsuit”, but our model extracts a shorter or longer span. It is difficult to extract the proper spans of clauses that contain numerous modifiers.

The second type is error derived from pronominal mention. Consider the following example: “They roar, they screech.”. These “They”s refer to “tanks” in another sentence of the same document and are indeed annotated as VEH (Vehicle). Our model fails to detect these pronominal mentions or wrongly labels them as PER (Person). Document context should be taken into consideration to solve this problem.

These types of errors have been reported by Katiyar and Cardie (2018), Ju et al. (2018), and Lin et al. (2019) and still remain as challenges.

4.5 Running Time

We investigate how our recursive decoding method impacts on the decoding speed in terms of the number of words processed per second. We use the model trained with ACE-2005 used for Table 6 and change the maximal depth of decoding to 1, 2, 3, 4, 5, and . When the maximal depth is n, our decoder Viterbi-decodes only from the 1st level to the n-th level. Note that, when the maximal depth is 1, the decoding process is completely the same as the Viterbi decoding of the standard CRF. We run them on an Intel i7 (2.7 GHz) CPU.

Results are listed in Table 7. The processed words per second decrease by 38% when the maximal depth varies from 1 to 2. There are two main reasons for this phenomenon. First, our decoder needs the processing for moving across different levels. That processing is not necessary when the maximal depth is 1. Second, the number of the extracted spans at the 2nd level is large and not negligible (12.5% of that of the extracted spans at the 1st level as shown in Table 6). The numbers of the extracted spans at the 3rd and lower levels are small, and then the processed words do not largely decrease when the maximal depth increases over 2. Regardless, our decoder does not take twice as long as the standard CRF on ACE-2005.

Table 7: 
Decoding speed on ACE-2005.
Maximal depth# tokens per second
6,083 
3,761 
3,655 
3,742 
3,723 
(no restriction) 3,701 
Maximal depth# tokens per second
6,083 
3,761 
3,655 
3,742 
3,723 
(no restriction) 3,701 

4.6 Comparison on ACE-2004

We also compare our method with existing methods on the ACE-2004 dataset. We use the same splits as Lu and Roth (2015). The setups are the same as those of our experiment on ACE-2005. Table 8 shows the results. As shown, our method significantly outperforms existing methods. Note that most of them use POS tags as an additional input feature whereas our method does not.

Table 8: 
Comparison on ACE-2004. “†” indicates the methods using POS tags.
MethodP (%)R (%)F1 (%)
Katiyar and Cardie (201872.3 66.8 69.7 
Wang et al. (2018)15 74.9 71.8 73.3 
Wang and Lu (2018) 78.0 72.4 75.1 
Straková et al. (2019)16 78.92 75.33 77.08 
This work 79.93 75.10 77.44 
 
Straková et al. (2019) [BERT] 84.71 83.96 84.33 
This work [BERT] 85.23 84.72 84.97 
 
Straková et al. (2019) [BERT+FLAIR] 84.51 84.29 84.40 
This work [BERT+FLAIR] 85.94 85.69 85.82 
MethodP (%)R (%)F1 (%)
Katiyar and Cardie (201872.3 66.8 69.7 
Wang et al. (2018)15 74.9 71.8 73.3 
Wang and Lu (2018) 78.0 72.4 75.1 
Straková et al. (2019)16 78.92 75.33 77.08 
This work 79.93 75.10 77.44 
 
Straková et al. (2019) [BERT] 84.71 83.96 84.33 
This work [BERT] 85.23 84.72 84.97 
 
Straková et al. (2019) [BERT+FLAIR] 84.51 84.29 84.40 
This work [BERT+FLAIR] 85.94 85.69 85.82 

4.7 Flat NER

To assess how our model works on flat NER task, we additionally evaluate our model on CoNLL-2003 (Tjong Kim Sang and De Meulder, 2003), which is annotated with outermost entities only. The setups here are the same as those of our experiment on ACE-2005. We not only prepare our proposed model but also the ablated model without our training nor decoding method, as in Section 4.2. The former model can extract spans nested within other extracted spans regardless of the property of the dataset, but the latter model never extracts spans within other extracted spans. We use the 100-dimensional GloVe embeddings for both models as in our previous experiments.

The results are in Table 9. We compare our method with existing methods that do not adopt any contextual word embeddings (the upside of Table 9) here, although we also show results from recent work with contextual word embeddings for reference. First, in comparison with the methods designed for nested NER (Wang and Lu, 2018; Straková et al., 2019), our method performs better even on CoNLL-2003. This means that our method works well on not only nested NER but also flat NER. Next, we compare with methods that can handle only flat NER. Table 9 shows that our method is comparable to the standard BiLSTM-CRF models (Lample et al., 2016; Ma and Hovy, 2016) on CoNLL-2003. However, note that there are some differences between the experiments of the previous studies (Lample et al., 2016; Ma and Hovy, 2016) and our experiment. For example, different word embeddings are used, or the hidden size of LSTM is not aligned. Nevertheless, we can compare our proposed model to the ablated model. As shown in Table 9, there is a significant gap (p < 0.005 with the permutation test) between the two scores, 91.14(±0.04)% and 90.84(±0.10)%. We analyze this gap in detail and find that our proposed model performs well especially in the cases where it is difficult to decide which is suitable, an inner span or an outer span. Given the following sentence: “An assessment group made up of the State Council’s Port Office, the Civil Aviation Administration of China, the General Administration of Customs and other authorities had granted the airport permission to handle foreign aircraft, Xinhua said .”. In the CoNLL-2003 dataset, the four spans “State Council”, “Civil Aviation Administration of China”, “General Administration of Customs”, and “Xinhua” are annotated as ORG (Organization). Both models correctly detect the latter three entities in most trials, but the ablated model tends to extract “State Council ’s Port Office” instead of “State Council”. On the other hand, our proposed model tends to extract both “State Council ’s Port Office” and “State Council”. “State Council ’s Port Office” is indeed a false-positive, but our model can detect the correct entity span “State Council” more steadily than the ablated model. Thus, our proposed model achieves the higher F1-score.

Table 9: 
Comparison on CoNLL-2003. We group methods into two types. The first group consists of the methods that do not use any contextual word embeddings. The second one consists of the methods that use contextual word embeddings such as BERT and FLAIR. “†” indicates the methods using POS tags. “‡” indicates the methods not designed to extract nested entities.
MethodF1 (%)
Wang and Lu (2018) 90.5 
Straková et al. (2019) 90.77 
This work 91.14 ± 0.04 
 
Lample et al. (2016) 90.94 
Ma and Hovy (2016) 91.21 
Liu et al. (2019) 91.96 ± 0.04 
This work − L&D 90.84 ± 0.10 
 
Devlin et al. (2019) 92.80 
Akbik et al. (2018) 93.09 ± 0.12 
Liu et al. (2019) 93.47 ± 0.03 
Jiang et al. (2019) 93.47 
Baevski et al. (2019) 93.5 
MethodF1 (%)
Wang and Lu (2018) 90.5 
Straková et al. (2019) 90.77 
This work 91.14 ± 0.04 
 
Lample et al. (2016) 90.94 
Ma and Hovy (2016) 91.21 
Liu et al. (2019) 91.96 ± 0.04 
This work − L&D 90.84 ± 0.10 
 
Devlin et al. (2019) 92.80 
Akbik et al. (2018) 93.09 ± 0.12 
Liu et al. (2019) 93.47 ± 0.03 
Jiang et al. (2019) 93.47 
Baevski et al. (2019) 93.5 

Recently, Liu et al. (2019) proposed a new architecture for sequence labeling, which can capture global information at the sentence level better than BiLSTM, and reported an F1-score of 91.96% when using conventional word embeddings (93.47% when using BERT). It is true that our model based on BiLSTM does not perform as well as their model, but our decoder can be combined with their proposed encoder. We leave it for future work.

5 Related Work

Alex et al. (2007) proposed several ways to combine multiple CRFs for such tasks. They found that, when they cascaded separate CRFs of each entity type by using the output from the previous CRF as the input features of the current CRF, best performance was yielded. However, their method could not handle nested entities of the same entity type. In contrast, Ju et al. (2018) dynamically stacked multiple layers that recognize entities sequentially from innermost ones to outermost ones. Their method can deal with nested entities of the same entity type.

Finkel and Manning (2009) proposed a CRF-based constituency parser for this task such that each named entity is a node in the parse tree. However, its time complexity is the cube of the length of a given sentence, making it not scalable to large datasets involving long sentences. Later on, Wang et al. (2018) proposed a scalable transition-based approach, a constituency forest (a collection of constituency trees). Its time complexity is linear in the sentence length.

Lu and Roth (2015) introduced a mention hypergraph representation for capturing nested entities as well as crossing entities (two entities overlap but neither is contained in the other). One issue in their approach is the spurious structures of the representation. Muis and Lu (2017) incorporated mention separators to address the spurious structures issue, but it still suffers from the structural ambiguity issue. Wang and Lu (2018) proposed a hypergraph representation free of structural ambiguity. However, they introduced a hyperparameter, the maximal length of an entity, to reduce the time complexity. Setting the hyperparameter to a small number results in speeding up but ignoring longer entity segments.

Katiyar and Cardie (2018) proposed another hypergraph-based approach that learns the structure using an LSTM network in a greedy manner. However, their method has a hyperparameter that sets a threshold for selecting multiple candidate mentions. It must be carefully tuned for adjusting the trade-off between recall and precision.

Sohrab and Miwa (2018) proposed a neural exhaustive model that enumerates all possible spans as potential entity mentions and classifies them. However, they also use the maximal-length hyperparameter to reduce time complexity.

Fisher and Vlachos (2019) proposed a novel neural network architecture that merges tokens or entities into entities forming nested structures and then labels each of them. Their architecture, however, needs the maximal nesting level hyperparameter. Lin et al. (2019) proposed a sequence-to-nuggets architecture that first identify anchor words of all mentions and then recognize the mention boundaries for each anchor word. Their method also use the maximal-length hyperparameter to reduce time complexity.

Straková et al. (2019) proposed an encoding algorithm to allow the modeling of multiple named entity labels in a linearized scheme and proposed a neural model that predicts sequential labels for each token. Zheng et al. (2019) proposed a method that can detect entities boundaries with sequence labeling models. These two methods do not require special hyperparameters. They can also deal with crossing entities as well as nested entities in contrast to our method, but our experiments demonstrate that our method can perform well because crossing entities are very rare (Lu and Roth, 2015; Wang et al., 2018).

6 Conclusion

We propose learning and decoding methods for extracting nested entities. Our decoding method iteratively recognizes entities from outermost ones to inner ones in an outside-to-inside way. It recursively searches a span of each extracted entity for nested entities with second-best sequence decoding. We also design an objective function for training that ensures our decoding algorithm. Our method has no hyperparameters beyond those of conventional CRF-based models. Our method achieves 85.82%, 84.34%, and 77.36% F1-scores on ACE-2004, ACE-2005, and GENIA datasets, respectively.

For future work, one interesting direction is joint modeling of NER with entity linking or coreference resolution. Previous studies (Durrett and Klein, 2014; Luo et al., 2015; Nguyen et al., 2016; Martins et al., 2019) demonstrated that leveraging mutual dependency of the NER, linking, and coreference tasks could boost each performance. We would like to address this issue while taking nested entities into account.

Acknowledgments

We thank Aldrian Obaja Muis for helpful comments, and many anonymous reviewers and the action editor for helpful feedback on various drafts of the paper. We are also grateful to Jana Straková for sharing experimental results. Eduard Hovy was supported in part by DARPA grant FA8750-18-2-0018 funded under the AIDA program.

Notes

3

Our usage of inside/outside is different from the inside-outside algorithm in dynamic programming.

4

Without our restriction about the transition matrix of CRF, we would have to watch both the best path and the 2nd best path. Besides, if a single CRF was used for all entity types, the decoder could not always narrow down spans with the 2nd best path. The 2nd best path in a single CRF could result in the same span tagged a different entity type. We would have to watch lower-ranked paths.

5

We do not need to recursively decode the span of each extracted single-token entity because a single-token entity cannot contain another entity of the same entity type.

11

Note that in ACE-2005, Ju et al. (2018) did their experiments with a different split from Lu and Roth (2015) that we follow.

12

Wang et al. (2018) did not report precision and recall scores. Instead of Wang et al. (2018), Wang and Lu (2018) reported the scores for the model of Wang et al. (2018).

13

Straková et al. (2019) did not report precision and recall scores in their paper. We requested this information from the authors, and they provided their score data.

14

We do not use POS tags as one of input features for a fair comparison with our method.

15

Wang et al. (2018) did not report precision and recall scores. Instead of Wang et al. (2018), Wang and Lu (2018) reported the scores for the model of Wang et al. (2018).

16

Straková et al. (2019) did not report precision and recall scores in their paper. We requested this information from the authors, and they provided their score data.

References

References
Alan
Akbik
,
Tanja
Bergmann
,
Duncan
Blythe
,
Kashif
Rasul
,
Stefan
Schweter
, and
Roland
Vollgraf
.
2019
.
FLAIR: An easy-to-use framework for state-of-the-art NLP
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)
, pages
54
59
,
Minneapolis, Minnesota
,
Association for Computational Linguistics
.
Alan
Akbik
,
Duncan
Blythe
, and
Roland
Vollgraf
.
2018
.
Contextual string embeddings for sequence labeling
. In
Proceedings of the 27th International Conference on Computational Linguistics
, pages
1638
1649
,
Santa Fe, New Mexico, USA
,
Association for Computational Linguistics
.
Beatrice
Alex
,
Barry
Haddow
, and
Claire
Grover
.
2007
.
Recognising nested named entities in biomedical text
. In
Biological, Translational, and Clinical Language Processing
, pages
65
72
,
Prague, Czech Republic
.
Association for Computational Linguistics
.
Alexei
Baevski
,
Sergey
Edunov
,
Yinhan
Liu
,
Luke
Zettlemoyer
, and
Michael
Auli
.
2019
.
Cloze-driven pretraining of self-attention networks
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
5360
5369
,
Hong Kong, China
.
Association for Computational Linguistics
.
Kate
Byrne
.
2007
.
Nested named entity recognition in historical archive text
. In
International Conference on Semantic Computing (ICSC 2007)
, pages
589
596
.
Zhe
Cao
,
Tao
Qin
,
Tie-Yan
Liu
,
Ming-Feng
Tsai
, and
Hang
Li
.
2007
.
Learning to rank: From pairwise approach to listwise approach
. In
Proceedings of the 24th International Conference on Machine Learning
, pages
129
136
.
Billy
Chiu
,
Gamal
Crichton
,
Anna
Korhonen
, and
Sampo
Pyysalo
.
2016
.
How to train good word embeddings for biomedical NLP
. In
Proceedings of the 15th Workshop on Biomedical Natural Language Processing
, pages
166
174
,
Berlin, Germany
.
Association for Computational Linguistics
.
Jason P. C.
Chiu
and
Eric
Nichols
.
2016
.
Named entity recognition with bidirectional LSTM-CNNs
.
Transactions of the Association for Computational Linguistics
,
4
:
357
370
.
Jacob
Devlin
,
Ming-Wei
Chang
,
Kenton
Lee
, and
Kristina
Toutanova
.
2019
.
BERT: Pre-training of deep bidirectional transformers for language understanding
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
4171
4186
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
George
Doddington
,
Alexis
Mitchell
,
Mark
Przybocki
,
Lance
Ramshaw
,
Stephanie
Strassel
, and
Ralph
Weischedel
.
2004
.
The automatic content extraction (ACE) program – tasks, data, and evaluation
. In
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
,
Lisbon, Portugal
.
European Language Resources Association (ELRA)
.
Greg
Durrett
and
Dan
Klein
.
2014
.
A joint model for entity analysis: Coreference, typing, and linking
.
Transactions of the Association for Computational Linguistics
,
2
:
477
490
.
Jenny Rose
Finkel
and
Christopher D.
Manning
.
2009
.
Nested named entity recognition
. In
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
, pages
141
150
,
Singapore
.
Association for Computational Linguistics
.
Joseph
Fisher
and
Andreas
Vlachos
.
2019
.
Merge and label: A novel neural network architecture for nested NER
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
5840
5850
,
Florence, Italy
.
Association for Computational Linguistics
.
Zhiheng
Huang
,
Yi
Chang
,
Bo
Long
,
Jean-Francois
Crespo
,
Anlei
Dong
,
Sathiya
Keerthi
, and
Su-Lin
Wu
.
2012
.
Iterative Viterbi A* algorithm for k-best sequential decoding
. In
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
611
619
,
Jeju Island, Korea
.
Association for Computational Linguistics
.
Yufan
Jiang
,
Chi
Hu
,
Tong
Xiao
,
Chunliang
Zhang
, and
Jingbo
Zhu
.
2019
.
Improved differentiable architecture search for language modeling and named entity recognition
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
3585
3590
,
Hong Kong, China
.
Association for Computational Linguistics
.
Meizhi
Ju
,
Makoto
Miwa
, and
Sophia
Ananiadou
.
2018
.
A neural layered model for nested named entity recognition
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
, pages
1446
1459
,
New Orleans, Louisiana
.
Association for Computational Linguistics
.
Nobuhiro
Kaji
,
Yasuhiro
Fujiwara
,
Naoki
Yoshinaga
, and
Masaru
Kitsuregawa
.
2010
.
Efficient staggered decoding for sequence labeling
. In
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
, pages
485
494
,
Uppsala, Sweden
,
Association for Computational Linguistics
.
Arzoo
Katiyar
and
Claire
Cardie
.
2018
.
Nested named entity recognition revisited
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
, pages
861
871
,
New Orleans, Louisiana
,
Association for Computational Linguistics
.
J.-D.
Kim
,
T.
Ohta
,
Y.
Tateisi
, and
J.
Tsujii
.
2003
.
GENIA corpus—a semantically annotated corpus for bio-textmining
.
Bioinformatics
,
19
(
Suppl_1
):
i180
i182
.
John D.
Lafferty
,
Andrew
McCallum
, and
Fernando C. N.
Pereira
.
2001
.
Conditional random fields: Probabilistic models for segmenting and labeling sequence data
. In
Proceedings of the Eighteenth International Conference on Machine Learning
, pages
282
289
.
Guillaume
Lample
,
Miguel
Ballesteros
,
Sandeep
Subramanian
,
Kazuya
Kawakami
, and
Chris
Dyer
.
2016
.
Neural architectures for named entity recognition
. In
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
260
270
,
San Diego, California
.
Association for Computational Linguistics
.
Hongyu
Lin
,
Yaojie
Lu
,
Xianpei
Han
, and
Le
Sun
.
2019
.
Sequence-to-nuggets: Nested entity mention detection via anchor-region networks
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
5182
5192
,
Florence, Italy
.
Association for Computational Linguistics
.
Yijin
Liu
,
Fandong
Meng
,
Jinchao
Zhang
,
Jinan
Xu
,
Yufeng
Chen
, and
Jie
Zhou
.
2019
.
GCDT: A global context enhanced deep transition architecture for sequence labeling
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
2431
2441
,
Florence, Italy
.
Association for Computational Linguistics
.
Wei
Lu
and
Dan
Roth
.
2015
.
Joint mention extraction and classification with mention hypergraphs
. In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pages
857
867
,
Lisbon, Portugal
.
Association for Computational Linguistics
.
Gang
Luo
,
Xiaojiang
Huang
,
Chin-Yew
Lin
, and
Zaiqing
Nie
.
2015
.
Joint entity recognition and disambiguation
. In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pages
879
888
,
Lisbon, Portugal
.
Association for Computational Linguistics
.
Liangchen
Luo
,
Yuanhao
Xiong
,
Yan
Liu
, and
Xu
Sun
.
2019
.
Adaptive gradient methods with dynamic bound of learning rate
.
CoRR
,
abs/1902.09843. Version 1
.
Xuezhe
Ma
and
Eduard
Hovy
.
2016
.
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1064
1074
,
Berlin, Germany
.
Association for Computational Linguistics
.
Christopher
Manning
,
Mihai
Surdeanu
,
John
Bauer
,
Jenny
Finkel
,
Steven
Bethard
, and
David
McClosky
.
2014
.
The Stanford CoreNLP natural language processing toolkit
. In
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations
, pages
55
60
,
Baltimore, Maryland
.
Association for Computational Linguistics
.
Pedro Henrique
Martins
,
Zita
Marinho
, and
André F. T.
Martins
.
2019
.
Joint learning of named entity recognition and entity linking
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
, pages
190
196
,
Florence, Italy
.
Association for Computational Linguistics
.
Aldrian Obaja
Muis
and
Wei
Lu
.
2017
.
Labeling gaps between words: Recognizing overlapping mentions with mention separators
. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages
2608
2618
,
Copenhagen, Denmark
.
Association for Computational Linguistics
.
David
Nadeau
and
Satoshi
Sekine
.
2007
.
A survey of named entity recognition and classification
.
Lingvisticæ Investigationes
,
30
(
1
):
3
26
.
Dat Ba
Nguyen
,
Martin
Theobald
, and
Gerhard
Weikum
.
2016
.
J-NERD: Joint named entity recognition and disambiguation with rich linguistic features
.
Transactions of the Association for Computational Linguistics
,
4
:
215
229
.
Jeffrey
Pennington
,
Richard
Ocher
, and
Christopher
Manning
.
2014
.
GloVe: Global vectors for word representation
. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
1532
1543
,
Doha, Qatar
.
Association for Computational Linguistics
.
Nils
Reimers
and
Iryna
Gurevych
.
2017
.
Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging
. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages
338
348
,
Copenhagen, Denmark
.
Association for Computational Linguistics
.
Nambirajan
Seshadri
and
Carl-Erik W.
Sundberg
.
1994
.
List Viterbi decoding algorithms with applications
.
IEEE Transactions on Communications
,
42
(
234
):
313
323
.
Mohammad Golam
Sohrab
and
Makoto
Miwa
.
2018
.
Deep exhaustive model for nested named entity recognition
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
2843
2849
,
Brussels, Belgium
.
Association for Computational Linguistics
.
Frank K.
Soong
and
Eng-Fong
Huang
.
1990
.
A Tree.Trellis based fast search for finding the n best sentence hypotheses in continuous speech recognition
. In
Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24–27,1990
.
Jana
Straková
,
Milan
Straka
, and
Jan
Hajic
.
2019
.
Neural architectures for nested NER through linearization
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
5326
5331
,
Florence, Italy
.
Association for Computational Linguistics
.
Emma
Strubell
,
Patrick
Verga
,
David
Belanger
, and
Andrew
McCallum
.
2017
.
Fast and accurate entity recognition with iterated dilated convolutions
. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages
2670
2680
,
Copenhagen, Denmark
.
Association for Computational Linguistics
.
Yuka
Tateisi
and
Jun-ichi
Tsujii
.
2004
.
Part-of-speech annotation of biology research abstracts
. In
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
,
Lisbon, Portugal
.
European Language Resources Association (ELRA)
.
Erik F.
Tjong Kim Sang
and
Fien
De Meulder
.
2003
.
Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition
. In
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003
, pages
142
147
.
Bailin
Wang
and
Wei
Lu
.
2018
.
Neural segmental hypergraphs for overlapping mention recognition
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
204
214
,
Brussels, Belgium
.
Association for Computational Linguistics
.
Bailin
Wang
,
Wei
Lu
,
Yu
Wang
, and
Hongxia
Jin
.
2018
.
A neural transition-based model for nested mention recognition
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
1011
1017
,
Brussels, Belgium
.
Association for Computational Linguistics
.
Yefeng
Wang
.
2009
.
Annotating and recognising named entities in clinical notes
. In
Proceedings of the ACL-IJCNLP 2009 Student Research Workshop
, pages
18
26
,
Suntec, Singapore
.
Association for Computational Linguistics
.
Changmeng
Zheng
,
Yi
Cai
,
Jingyun
Xu
,
Ho-fung
Leung
, and
Guandong
Xu
.
2019
.
A boundary-aware neural model for nested named entity recognition
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
357
366
,
Hong Kong, China
.
Association for Computational Linguistics
.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode