Abstract
Named Entity Recognition (NER) has so far evolved from the traditional flat NER to overlapped and discontinuous NER. They have mostly been solved separately, with only several exceptions that concurrently tackle three tasks with a single model. The current best-performing method formalizes the unified NER as word-word relation classification, which barely focuses on mention content learning and fails to detect entity mentions comprising a single word. In this paper, we propose a two-stage span-based framework with templates, namely, T2-NER, to resolve the unified NER task. The first stage is to extract entity spans, where flat and overlapped entities can be recognized. The second stage is to classify over all entity span pairs, where discontinuous entities can be recognized. Finally, multi-task learning is used to jointly train two stages. To improve the efficiency of span-based model, we design grouped templates and typed templates for two stages to realize batch computations. We also apply an adjacent packing strategy and a latter packing strategy to model discriminative boundary information and learn better span (pair) representation. Moreover, we introduce the syntax information to enhance our span representation. We perform extensive experiments on eight benchmark datasets for flat, overlapped, and discontinuous NER, where our model beats all the current competitive baselines, obtaining the best performance of unified NER.
1 Introduction
Named entity recognition (NER) is the task of recognizing mentions that represent entities in text. It has been a fundamental task in natural language processing (NLP), due to its wide application in various knowledge-based tasks like entity linking and data mining (Le and Titov, 2018; Cao et al., 2019).
Research on NER has evolved early from flat NER (Sang and Meulder, 2003; Pradhan et al., 2013b), later to overlapped NER (Doddington et al., 2004; Walker et al., 2006), and recently to discontinuous NER (Karimi et al., 2015; Pradhan et al., 2013a). As shown in Figure 1, flat NER simply detects the entity mentions and their types, while the problems of overlapped and discontinuous NER are more complicated, i.e., overlapped entities contain overlapping fragments, and discontinuous entities may contain several nonadjacent fragments. Regarding unified NER, this refers to recognize all types of named entities in the input text, regardless of whether they are flat, overlapping, or discontinuous.
Many methods have been developed to solve three NER tasks (Lu and Roth, 2015; Wang and Lu, 2018; Ju et al., 2018; Wang et al., 2018). The majority of them focus on flat and overlapped NER, with only several exceptions that center on unified NER. Yan et al. (2021) adopt a generative method to obtain position indexes of entity spans. Yet generative models potentially suffer from the exposure bias issue. Li et al. (2022) achieve the current best performance. They use convolution neural networks to obtain two types of word pair relation, and respectively fill them into the upper and lower triangular regions of a word-pair grid for decoding. However, their method decodes entities word by word, without integrally modeling the entity content, which is important for entity classification. It also fails to detect one-word entities as they are only assigned one relation in the grid diagonal. Moreover, it needs to classify over all word pairs, including redundant non-entity ones. Therefore, designing an effective unified NER model is still challenging.
By contrast, span-based methods (Luan et al., 2019; Sohrab and Miwa, 2018) directly model the span content for entity classification and naturally recognize one-word entities. In light of this, we investigate an alternative unified NER formalism with a two-stage span-based framework. This framework resolves the unified NER by modeling it as span pair relation classification. Such relation is pivotal for recognizing overlapped and discontinuous entities as they describe the semantic relations between entity fragments. To illustrate: In Figure 1(d), in order to recognize the discontinuous entity “aching in shoulders”, effectively capturing the discontinuous relation between spans “aching in” and “shoulders” is indispensable.
Specifically, the proposed framework works as follows. In the first stage, Span Extraction classifies the enumerated spans to find all entity spans, which are defined as text spans that either form entity mentions on their own or present as fragments of discontinuous entity mentions. In the second stage, Span Pair Classification classifies entity span pair relation to merge spans into entities. We define two types of relations for this goal: Next-Fragment and Overlapped, which are used for discontinuous and overlapped mentions respectively, as shown in Figure 1(d). We adopt multi-task learning to jointly train these two stages. This framework naturally solves the problems in Li et al. (2022).
However, it is still a toy framework, as we find three key issues that need to be considered, which may greatly facilitate unified NER.
Improving the efficiency: This toy framework needs to classify over all candidate spans and entity span pairs, which inevitably suffers from the inefficiency issue and considerable model complexity.
Modeling the discriminative span boundary: Span boundary is essential to discriminate different entity spans. However, this toy framework uses embeddings learned from span content to classify spans, which is inadequate in learning from other spans to acquire discriminative span boundary information.
Learning the discriminative span pair representation: Span pair representation directly affects the results of span pair classification. However, this toy framework reuses representations of candidate span pairs from the first stage for relation classification, failing to learn from other pairs to get discriminative pair representation.
Based on the above observations, we introduce the full model T2-NER, which defuses these issues from the following aspects:
To speed up the inference process, we are inspired by Zhong and Chen (2021) and propose to equip the toy model with templates, which are packs of markers highlighting the corresponding spans. Templates enable the model to re-use the computations of text tokens and realize an efficient batch computation.
To model the discriminative span boundary, we design an adjacent packing strategy in the first stage. This strategy integrally models adjacent spans (i.e., spans with the same start tokens) by packing adjacent markers into a training instance. As the model learns from adjacent spans, more precise span boundary information can be learned.
To learn the discriminative span pair representation, we design a latter packing strategy in the second stage. This strategy integrally models the interrelation between span pairs by packing the former markers with the related latter ones into a training instance. It enables the model to compare same-former spans to learn discriminative span pair representation. Moreover, we utilize syntax information to enhance the span (pair) representation.
Our main contributions are as follows:
We identify the deficiency of existing NER methods in solving the unified NER task, and propose a new solution T2-NER, to boost the performance. This is done by (1) modeling the unified NER as span pair relation classification, and offering a two-stage span-based framework to realize it; (2) exploiting templates to realize accelerating; and (3) designing adjacent packing strategy and latter packing strategy to get better span (pair) representations.
We empirically evaluate our proposal on flat, overlapped, and discontinuous NER tasks against 13 baselines. The comparative results demonstrate the superiority of T2-NER.
2 Related Work
Sequential Labeling-based Methods
These solve NER tasks through various tagging schemes. These studies usually use neural models such as CNN (Collobert et al., 2011) and Transformer (Yan et al., 2019) for representation, followed by a CRF layer (Lafferty et al., 2001) for classification. However, it is hard for them to directly detect overlapped or discontinuous entity mentions. Shibuya and Hovy (2020) try to decode the tags in a layered manner for overlapped entity mentions. Tang et al. (2018) adopt a BIOHD tagging scheme to resolve the discontinuous NER task. Despite the fact that sequence labeling is reconciled with various NER tasks, it fails to solve these tasks with a unified scheme.
Hypergraph-based Methods These are first introduced into the NER task in Lu and Roth (2015), where they construct hypergraphs by the structure of overlapped mentions and exponentially represent them. Muis and Lu (2016) further explore the application of hypergraphs on discontinuous NER. Wang and Lu (2018) utilize deep neural networks to enhance the hypergraph representation, and decode overlapped entity mentions with hypergraphs. Although hypergraphs can represent all types of entity mentions, they require careful manual design of graph nodes and edges to avoid the structural ambiguity issue. Moreover, these models gradually generate graphs along the words, which may lead to error propagation issue.
Transition-based Methods These are first proposed for nested NER in Wang et al. (2018). They design transition actions, and maintain a stack as well as a buffer to store entities, enabling the representation of nested mentions. Follow-up work includes Dai et al. (2020), which further extends this model for discontinuous NER, through using multiplicative attention to capture the discontinuous dependency. However, these methods need manual intervention for the design of transition actions. And they also face the error propagation issue as transitions are conducted word by word along the sentence.
Generative Methods These adopt generative models such as BART (Lewis et al., 2020) and pointer networks to directly get structured results. Cui et al. (2021) design templates and employ BART for pre-training and fine-tuning on templates to obtain entity types of given spans. Fei et al. (2021) use a generative model with pointer network for discontinuous NER, directly getting a list of entity mentions. Yan et al. (2021) solve the unified NER with BART and pointer network, aiming to generate a sequence of entity start-end indexes and types. Generative methods unavoidably face the decoding efficiency issue as well as the exposure bias issue.
Span-based Methods Recently, NER has been frequently formulated as a span enumeration and classification task. To overcome the enumeration inefficiency issue, Sohrab and Miwa (2018) propose to set the maximum span length. Li et al. (2020) convert NER to a machine reading comprehension (MRC) task and extract entity spans with a MRC model. Shen et al. (2021) design a filter and a regressor to select span proposals. Li et al. (2021) formalize discontinuous NER as a subgraph finding task. Yu et al. (2020) use dependent embeddings as input to a multi-layer BiLSTM and a biaffine model to score spans in a sentence. Although the span-based framework has the innate ability to cope with the overlapped NER, the above methods are subject to enumeration nature, failing to balance the performance and efficiency well.
Other MethodsWang et al. (2021) formulate discontinuous NER as a task of discovering maximal cliques in a segment graph. Li et al. (2022) model the unified NER as word-word relation classification. They achieve the current state-of-the-art (SOTA) performance on the unified NER. However, this work fails to learn the mention content integrally and suffers from the efficiency issue as it needs to classify over all word pairs, including redundant ones.
To sum up, existing work mainly focuses on the flat and overlapped NER, only a few studies seek to resolve the unified NER (Yan et al., 2021; Li et al., 2022). We reconcile the span-based framework to the unified NER with a formalism as span pair relation classification. Our model substantially avoids the drawbacks in previous baselines, by effectively modeling span boundary and learning better span pair representation.
3 Preliminaries
In this section we first introduce the definition of entity span pair relation and the marker. Then we formalize the unified NER problem.
Definition 1 (Entity Span Pair Relation). We define the following three kinds of entity span pair relation and we also give an example as demonstrated in Figure 1(d) for better understanding.
OTHER, indicating that the entity span pair has an other relation or does not have any relation defined in this paper.
Next-Fragment, indicating that two entity spans belong to a single entity mention and they are successive.
Overlapped, indicating that two entity spans are overlapped.
Definition 2 (Span Marker). We explicitly insert a pair of span markers before and after the candidate span to highlight it. We define those of the ith span as <Mi > and </Mi >.
Definition 3 (Span Pair Marker). We explicitly insert two pairs of typed span markers before and after two candidate spans to highlight them. We define these markers as <F : ex >, </F : ex > and <Li : ei >, </Li : ei >, where and denotes a pre-defined entity type set.
Problem 1 (T2-NER: Unified NER as A Two-Stage Span-Based Framework). The unified NER can be formalized as follows: Given an input text of N tokens X = {x1, x2,…, xN}, the first stage aims to perform span extraction, i.e., representing and classifying the span markers to detect out all entity spans in X of up to length L, e.g., sa, b = {xa, xa +1,...xb}, as well as its entity type . After this stage, both the flat and overlapped entity spans could be recognized.
Then, the second stage aims to perform span pair classification by taking sa, b and sc, d as input and utilizing span pair markers to find their relation , where is predefined, including Next-Fragment, Overlapped, and OTHER. Through this stage, the discontinuous entity mentions could be detected. Also, the flat and overlapped entity mentions would be double checked by recognizing the Overlapped and Other relations between span pairs.
4 T2-NER Model
As shown in Figure 2, T2-NER contains four main components, which are elaborated by the following subsections.
4.1 Span Extraction with Grouped Templates
Span extraction aims to find all text spans and determine whether these spans are entity spans. Different from prior span-based work which simply follows the line of enumerating and classifying (Lee et al., 2017; Luan et al., 2019), we resort to construct templates for each input text to realize an approximate operation, thus accelerating the inference process. In form, these templates are composed of span markers.
Specifically, given an input text with N tokens, X = {x1, x2,…, xN}, and a maximum span length L, we first enumerate all text spans in X, obtaining a candidate span set as S(X) = {s(1,1),…, s(1, L),…, s(N−L +1, N),…, s(N, N)}.
Dispersedly training and inferring for all candidate spans requires large computation costs (Luan et al., 2019). To alleviate this issue, we neatly pack these spans into multiple instances at first. To fully utilize the boundary characteristics of spans, we propose an adjacent packing strategy. It clusters adjacent spans with the same start tokens in order into a group. For instance, we cluster spans, {s(1,1), s(1,2)…, s(1, L)} into the group S1. As shown in Figure 2(a), for the sample sentence “aching in legs and shoulders” with a maximum span length 5, the group S1 = {aching, achingin, aching in legs, aching inlegs and, aching in legs and shoulders}.
Then we construct a template for each group, which is the sequentially concatenation of marker pairs of all spans in this group. Specifically, for the candidate span si, we make the corresponding marker pairs share the same position embeddings with the start and end tokens of this span, i.e., p(<Mi >), p(</Mi >) : = p(xstart(i)), p(xend(i)). In this way, the position embeddings of original tokens will remain unchanged after the template insertion. As shown in Figure 2(a), the template for group S1 is <M1 ></M1 > <M2 > </M2 > <M3 > </M3 > <M4 > </M4 > <M5 > </M5 >.
Finally, we separately append each template to the input text and feed the sequence into a pretrained BERT module. In order to re-use the representations of text tokens, we harness a directional attention mask matrix in the attention layer (Zhong and Chen, 2021). Specifically, the text tokens only attend to text tokens and not attend to span markers while a span marker can attend to all the text tokens and its partner marker associated with the same span. As a result, we can dispersedly process all groups in multiple runs, and batch spans in each group in one run.
4.2 Span Pair Classification with Typed Templates
Span pair classification takes candidate span pairs as the input and determine relations for them. Prior work simply re-uses and shares span representations for classifying span pair relations (Luan et al., 2019; Li et al., 2021), failing to learn discriminative representations of different span pairs. Considering this, we propose to compensate by constructing typed templates for each input sequence and learning span pair representations from typed templates.
Similarly, we propose to pack all candidate span pairs into multiple groups. To learn the discriminant span pair representation, we propose a latter packing strategy. As shown in Figure 2(b), it clusters span pairs with the same former span into a group. This strategy enables an integral modeling for the same-former spans. Thus, these same-former spans can be compared to obtain discriminative representations.
Specifically, given all of the recognized entity spans in the input text, we sequentially take one of them as the former span, and the others that appear after the former one as the latter spans. Then, we cluster the former span s(a, b) and the corresponding latter spans into a group. As shown in Figure 2(b), the former span achingin and the latter spans {achingin legs, shoulders} are packed into a group.
We then construct a typed template for each group, which is the sequential concatenation of span pair markers of all pairs in this group. Specifically, for the group , we keep the markers <F : ex >, </F : ex > of the former span s(a, b) fixedly located before and after it in the original sentence. We then concatenate marker pairs of the latter spans, i.e., <L1 : e1 > </L1 : e1 > <L2 : e2 > </L2 : e2 >… <Lm : em > </Lm : e3 >. These marker pairs also share the same position embedding with the start and end tokens of corresponding entity spans. The typed template is composed of the fixed markers and the concatenated markers. We argue that these particular templates could capture the dependencies between candidate span pairs, bridging the problem of the prior work that only capture contextual information around each individual candidate span.
Then, we append each typed template to the input text. The whole sequence is denoted as:
.
Correspondingly, the appended sequence for the sample sentence in Figure 2(b) is <F: Dis >aching in < /F: Dis > legs and shoulders <L1:Dis ></L1:Dis ><L2:Dis ></L2:Dis >.
Moreover, we introduce a reverse setting from the latter to the former for a bidirectional prediction. For the candidate span pair s(a, b), s(c, d), we follow the above operation, obtaining former marker representations and , and latter marker representations xc−1 and xd +1. These embeddings are added in (3) to get the final span pair representation. We argue that this reverse setting would also provide supplemental information with integral modeling of the same-latter spans.
4.3 Enhancing Span Representation with Syntax Information
Dependency syntax information is commonly neglected in the unified NER work. It has been explored in flat NER (Finkel and Manning, 2009). In this paper, we use it to enhance our model. Specifically, we harness a dependency parser to transform the appended sequence into an adjacency matrix A, where Aij = 1 indicates that there is a dependency edge going from token xi to token xj, otherwise Aij = 0. Notably, each marker token shares the same syntax information with the corresponding text token.
4.4 Joint Training and Decoding
In the training process, three types of relations are all needed. In the inference process, the flat and overlapped entities would be recognized in the first stage, and the discontinuous entities would be recognized through Next-Fragment relation in the second stage. Thus, we choose Next-Fragment relation for decoding.
The predictions of our model are the entity spans and their relations, which can be considered as a directional span graph. The decoding object is to find sub-graphs in which each entity span connects with any other span by the Next-Fragment relation. Each sub-graph corresponds to an entity mention and the entities which are made of more than two spans are also covered in the decoding process. The sub-graph containing a single entity span composes an entity mention by itself. Figure 3 gives the cases for the inference and decoding of a sample sentence.
5 Experimental Setting
5.1 Datasets and Evaluations
To evaluate T2-NER for three NER tasks, we experiment on eight benchmark datasets. Statistics of these datasets are presented in Table 1.
Flat NER Datasets, include CoNLL2003 (Sang and Meulder, 2003) and OntoNotes 5.0 (Pradhan et al., 2013b). CoNLL2003 is an English dataset with four types of flat entities. We follow the data processing in Lin et al. (2019). For OntoNotes 5.0, we use the same dataset settings as Yan et al. (2021).
Overlapped NER Datasets, include ACE20041 (Doddington et al., 2004), ACE20052 (Walker et al., 2006), and GENIA (Kim et al., 2003). ACE2004 and ACE2005 are derived from various domains, such as newswire and online forums. We follow Lu and Roth (2015), splitting the train/dev/test as 8:1:1. For GENIA, we follow Yan et al. (2021), collapsing all entity subtypes into five types and splitting the train/dev/test as 8.1:0.9:1.
Discontinuous NER Datasets, include CADEC (Karimi et al., 2015), ShARe13 (Pradhan et al., 2013a), and ShARe14 (Mowery et al., 2014), all of which are collected from biomedical or clinical domain. We use the same data processing as Dai et al. (2020).
Datasets . | Sentences . | Entity Mention . | ||||||
---|---|---|---|---|---|---|---|---|
# Train . | # Dev . | # Test . | # Avg.Len . | # All . | # Ovlp. . | # Dis. . | # Avg.Len . | |
CoNLL2003 | 17291 | − | 3453 | 14.38 | 35089 | − | − | 1.45 |
OntoNotes5.0 | 59924 | 8528 | 8262 | 18.11 | 104151 | − | − | 1.83 |
ACE2004 | 6802 | 813 | 897 | 20.12 | 27604 | 12626 | − | 2.50 |
ACE2005 | 7606 | 1002 | 1089 | 17.77 | 30711 | 12404 | − | 2.28 |
GENIA | 15023 | 1669 | 1854 | 25.41 | 56015 | 10263 | − | 1.97 |
CADEC | 5340 | 1097 | 1160 | 16.18 | 6316 | 920 | 679 | 2.72 |
ShARe13 | 8508 | 1250 | 9009 | 14.86 | 11148 | 663 | 1088 | 1.82 |
ShARe14 | 17404 | 1360 | 15850 | 15.06 | 19070 | 1058 | 1656 | 1.74 |
Datasets . | Sentences . | Entity Mention . | ||||||
---|---|---|---|---|---|---|---|---|
# Train . | # Dev . | # Test . | # Avg.Len . | # All . | # Ovlp. . | # Dis. . | # Avg.Len . | |
CoNLL2003 | 17291 | − | 3453 | 14.38 | 35089 | − | − | 1.45 |
OntoNotes5.0 | 59924 | 8528 | 8262 | 18.11 | 104151 | − | − | 1.83 |
ACE2004 | 6802 | 813 | 897 | 20.12 | 27604 | 12626 | − | 2.50 |
ACE2005 | 7606 | 1002 | 1089 | 17.77 | 30711 | 12404 | − | 2.28 |
GENIA | 15023 | 1669 | 1854 | 25.41 | 56015 | 10263 | − | 1.97 |
CADEC | 5340 | 1097 | 1160 | 16.18 | 6316 | 920 | 679 | 2.72 |
ShARe13 | 8508 | 1250 | 9009 | 14.86 | 11148 | 663 | 1088 | 1.82 |
ShARe14 | 17404 | 1360 | 15850 | 15.06 | 19070 | 1058 | 1656 | 1.74 |
We use strict evaluations that a predicted entity is counted as true positive mention if both its span and type match those of a gold entity. As for a discontinuous entity, each span should match a span of the gold entity. We use span-level micro-averaged Precision (P), Recall (R), and F1 score (F1) as evaluation metrics.
5.2 Implementation Details
Considering the dataset domains, we use BioBERT (Lee et al., 2020) for GENIA and CADEC, ClinicalBERT (Alsentzer et al., 2019) for ShARe13 and ShARe14, and vanilla BERT (Devlin et al., 2019) for the other datasets. To obtain the dependency syntax information, we harness the Stanford CoreNLP parser 4.5.4 (Manning et al., 2014), which is based on Shift-Reduce parsing neural model (Zhu et al., 2013). This parser is trained with a collection of syntactically annotated data, i.e., the Penn Treebank corpus.3 We directly apply it without using additional training datasets to train this parser. The parser training data do not overlap with the evaluation datasets. Thus, there is no data contamination issue in our work.
We incorporate cross-sentence information by expanding the input to a fixed window size with its context and ensuring that each text is located in the middle of the expanded sequence. In practice, we use grid search to find the best experimental configuration. Concretely, the window size is chosen from [128, 256, 384], the size of span width feature w from [150, 200], the loss weight α and β from [0.6, 0.8, 1.0], the MLP layer from [1, 2, 3]. As for the AGGCN part, the GCN layer l is chosen from [1, 2], the GCN head Nhead from [2, 4]. Balancing effectiveness and efficiency, we choose the following configurations to produce the experimental results reported in the following sections. The window size is set to 256 for both stages, the span width feature size w to 150, the loss weight α and β to 1.0, the MLP layer to 2, the MLP size to 150, the GCN layer l to 2 and the GCN head Nhead to 4. These hyperparameters are the same for all datasets. In the enumeration of the span, we set the maximum span length L as 16 for OntoNotes5.0 and GENIA, and 8 for the other datasets.
During the training process, we adopt the Adam Weight Decay optimizer (Loshchilov and Hutter, 2019) with a learning rate of 2e-5 to finetune BERT and 1e-3 to finetune other parts of the model. For the batch size, we search from [4,8,16,32] and choose 8 for all datasets. For the training epochs, we search from [3,5,8,10,15,50], and finally choose 8 for CoNLL2003, 5 for OntoNotes5.0, 15 for ACE2004, 10 for ACE2005, 50 for GENIA and CADEC, and 10 for ShARe13 and ShARe14. We run each experiment for 5 times and report the averaged score.
5.3 Baselines
Sequential Labeling-based Methods
assign a tag to each token with various tagging schemes, including Seq2Seq (Straková et al., 2019) and Second-Path (Shibuya and Hovy, 2020). Hypergraph-based methods construct hypergraphs to represent and extract entities, including Seg-Graph (Wang and Lu, 2018) and Two-Stage (Wang and Lu, 2019). Transition-based methods maintain a stack and a buffer to represent and infer entity mentions, including Transition (Dai et al., 2020) and ARN (Lin et al., 2019). Generative methods generate entity index or word sequences with the decoder, including BART-Large (Yan et al., 2021) and MAPtr (Fei et al., 2021). Span-based methods enumerate spans and combine them into entities, including BERT-MRC (Li et al., 2020), Locate-Label (Shen et al., 2021), and Extract-Select (Huang et al., 2022). Other methods include MAC (Wang et al., 2021) and W2NER (Li et al., 2022) approaches.
We directly adopt the best parameter setup reported in papers that originally introduced the methods listed. The best results on each dataset are denoted in bold. We use the two-tailed t-test to measure the statistical significance. Significant improvements of T2-NER over the second best model for p < 0.05 are marked with .
6 Results and Analyses
6.1 Results for Flat NER
Table 2 presents the experimental results on two flat NER datasets. As seen: (1) T2-NER achieves the SOTA performance with at least 1.06% and 0.87% F1 score improvements on CoNLL2003 and OntoNotes 5.0 datasets, respectively. (2) T2-NER outperforms other unified NER models (i.e., BART-Large and W2NER) for the flat NER task. We attribute this to the two-stage architecture, which further double-checks the flat NER results in the second stage. (3) T2-NER outperforms other span-based models (i.e., BERT-MRC, Locate-Label and Extract-Select). The reason may be that we introduce the syntax information to enhance the span representation, and our templates are more advantageous in learning span-wise representation for span extraction.
Model . | CoNLL2003 . | OntoNotes 5.0 . | ||||
---|---|---|---|---|---|---|
P . | R . | F1 . | P . | R . | F1 . | |
Seq2Seq (Straková et al., 2019) | − | − | 92.98 | − | − | − |
Seg-Graph (Wang and Lu, 2018) | − | − | 90.50 | − | − | − |
BART-Large (Yan et al., 2021) | 92.56 | 93.56 | 93.05 | 89.62 | 90.92 | 90.27 |
BERT-MRC (Li et al., 2020) | 92.33 | 94.61 | 93.04 | 92.98 | 89.95 | 91.11 |
Locate-Label (Shen et al., 2021) | 92.13 | 93.73 | 92.94 | − | − | − |
Extract-Select (Huang et al., 2022) | 92.10 | 94.03 | 93.05 | − | − | − |
W2NER (Li et al., 2022) | 92.71 | 93.44 | 93.07 | 90.03 | 90.97 | 90.50 |
T2-NER | 93.78 | 94.48 | 94.13 | 91.83 | 92.13 | 91.98 |
Model . | CoNLL2003 . | OntoNotes 5.0 . | ||||
---|---|---|---|---|---|---|
P . | R . | F1 . | P . | R . | F1 . | |
Seq2Seq (Straková et al., 2019) | − | − | 92.98 | − | − | − |
Seg-Graph (Wang and Lu, 2018) | − | − | 90.50 | − | − | − |
BART-Large (Yan et al., 2021) | 92.56 | 93.56 | 93.05 | 89.62 | 90.92 | 90.27 |
BERT-MRC (Li et al., 2020) | 92.33 | 94.61 | 93.04 | 92.98 | 89.95 | 91.11 |
Locate-Label (Shen et al., 2021) | 92.13 | 93.73 | 92.94 | − | − | − |
Extract-Select (Huang et al., 2022) | 92.10 | 94.03 | 93.05 | − | − | − |
W2NER (Li et al., 2022) | 92.71 | 93.44 | 93.07 | 90.03 | 90.97 | 90.50 |
T2-NER | 93.78 | 94.48 | 94.13 | 91.83 | 92.13 | 91.98 |
6.2 Results for Overlapped NER
Table 3 presents the result comparisons on three overlapped NER datasets. As seen: (1) T2-NER can effectively deal with the nested NER task, achieving the SOTA performances with 1.01%, 2.58%, and 0.43% F1 score improvements on ACE2004, ACE52005, and GENIA datasets, respectively. (2) Different from other unified NER models, T2-NER achieves higher Precision while relatively lower Recall, since it double-checks the entity spans with “Overlapped” relation and further filter spans that it believes to be low confident. In contrast, the results of T2-NER on the flat NER task do not show a similar trend. We argue that this occurs because the second stage does not conduct particular classification of flat entities. (3) Moreover, the superiority of T2-NER over other span-based models also proves the effectiveness of T2-NER in obtaining better span representations, which we attribute to the adjacent packing strategy and the utilization of syntax information.
Model . | ACE2004 . | ACE2005 . | GENIA . | ||||||
---|---|---|---|---|---|---|---|---|---|
P . | R . | F1 . | P . | R . | F1 . | P . | R . | F1 . | |
Seq2Seq (Straková et al., 2019) | − | − | 84.33 | − | − | 83.42 | − | − | 78.20 |
Second-Path (Shibuya and Hovy, 2020) | 83.73 | 81.91 | 82.81 | 82.98 | 82.42 | 82.70 | 78.07 | 76.45 | 77.25 |
Seg-Graph (Wang and Lu, 2018) | 78.00 | 72.40 | 75.10 | 76.80 | 72.30 | 74.50 | 77.00 | 73.30 | 75.10 |
ARN (Lin et al., 2019) | − | − | − | 76.20 | 73.60 | 74.90 | 75.80 | 73.90 | 74.80 |
BART-Large (Yan et al., 2021) | 87.27 | 86.41 | 86.84 | 83.16 | 86.38 | 84.74 | 78.87 | 79.60 | 79.23 |
BERT-MRC (Li et al., 2020) | 85.05 | 86.32 | 85.98 | 87.16 | 86.59 | 86.88 | 85.18 | 81.12 | 83.75 |
Locate-Label (Shen et al., 2021) | 87.44 | 87.38 | 87.41 | 86.09 | 87.27 | 86.67 | 80.19 | 80.89 | 80.54 |
W2NER (Li et al., 2022) | 87.33 | 87.71 | 87.52 | 85.03 | 88.62 | 86.79 | 83.10 | 79.76 | 81.39 |
Extract-Select (Huang et al., 2022) | 88.26 | 88.53 | 88.39 | 87.15 | 88.37 | 87.76 | 83.64 | 84.41 | 84.02 |
T2-NER | 90.88 | 87.97 | 89.40 | 90.53 | 90.15 | 90.34 | 85.09 | 83.82 | 84.45 |
Model . | ACE2004 . | ACE2005 . | GENIA . | ||||||
---|---|---|---|---|---|---|---|---|---|
P . | R . | F1 . | P . | R . | F1 . | P . | R . | F1 . | |
Seq2Seq (Straková et al., 2019) | − | − | 84.33 | − | − | 83.42 | − | − | 78.20 |
Second-Path (Shibuya and Hovy, 2020) | 83.73 | 81.91 | 82.81 | 82.98 | 82.42 | 82.70 | 78.07 | 76.45 | 77.25 |
Seg-Graph (Wang and Lu, 2018) | 78.00 | 72.40 | 75.10 | 76.80 | 72.30 | 74.50 | 77.00 | 73.30 | 75.10 |
ARN (Lin et al., 2019) | − | − | − | 76.20 | 73.60 | 74.90 | 75.80 | 73.90 | 74.80 |
BART-Large (Yan et al., 2021) | 87.27 | 86.41 | 86.84 | 83.16 | 86.38 | 84.74 | 78.87 | 79.60 | 79.23 |
BERT-MRC (Li et al., 2020) | 85.05 | 86.32 | 85.98 | 87.16 | 86.59 | 86.88 | 85.18 | 81.12 | 83.75 |
Locate-Label (Shen et al., 2021) | 87.44 | 87.38 | 87.41 | 86.09 | 87.27 | 86.67 | 80.19 | 80.89 | 80.54 |
W2NER (Li et al., 2022) | 87.33 | 87.71 | 87.52 | 85.03 | 88.62 | 86.79 | 83.10 | 79.76 | 81.39 |
Extract-Select (Huang et al., 2022) | 88.26 | 88.53 | 88.39 | 87.15 | 88.37 | 87.76 | 83.64 | 84.41 | 84.02 |
T2-NER | 90.88 | 87.97 | 89.40 | 90.53 | 90.15 | 90.34 | 85.09 | 83.82 | 84.45 |
6.3 Results for Discontinuous NER
Table 4 shows the performance comparison on three discontinuous NER datasets. As seen, T2-NER outperforms the previous best model W2NER by 2.48%, 1.28%, and 0.37% in F1 score on CADEC, ShARe13, and ShARe14, respectively, obtaining new SOTA results. Although some methods outperform T2-NER for Precision or Recall, they sacrifice another score, which results in lower F1 score. This observation can also be found in the flat and overlapped NER results.
Model . | CADEC . | ShARe13 . | ShARe14 . | ||||||
---|---|---|---|---|---|---|---|---|---|
P . | R . | F1 . | P . | R . | F1 . | P . | R . | F1 . | |
Two-Stage (Wang and Lu, 2019) | 72.10 | 48.40 | 58.00 | 83.80 | 60.40 | 70.30 | 79.10 | 70.70 | 74.70 |
Transition (Dai et al., 2020) | 68.90 | 69.00 | 69.00 | 80.50 | 75.00 | 77.70 | 78.10 | 81.20 | 79.60 |
BART-Large (Yan et al., 2021) | 70.08 | 71.21 | 70.64 | 82.09 | 77.42 | 79.69 | 77.20 | 83.75 | 80.34 |
MAPtr (Fei et al., 2021) | 75.50 | 71.80 | 72.40 | 87.90 | 77.20 | 80.30 | − | − | − |
MAC (Wang et al., 2021) | 70.50 | 72.50 | 71.50 | 84.30 | 78.20 | 81.20 | 78.20 | 84.70 | 81.30 |
W2NER (Li et al., 2022) | 74.09 | 72.35 | 73.21 | 85.57 | 79.68 | 82.52 | 79.88 | 83.71 | 81.75 |
T2-NER | 78.33 | 73.22 | 75.69 | 87.41 | 80.48 | 83.80 | 80.53 | 83.77 | 82.12 |
Model . | CADEC . | ShARe13 . | ShARe14 . | ||||||
---|---|---|---|---|---|---|---|---|---|
P . | R . | F1 . | P . | R . | F1 . | P . | R . | F1 . | |
Two-Stage (Wang and Lu, 2019) | 72.10 | 48.40 | 58.00 | 83.80 | 60.40 | 70.30 | 79.10 | 70.70 | 74.70 |
Transition (Dai et al., 2020) | 68.90 | 69.00 | 69.00 | 80.50 | 75.00 | 77.70 | 78.10 | 81.20 | 79.60 |
BART-Large (Yan et al., 2021) | 70.08 | 71.21 | 70.64 | 82.09 | 77.42 | 79.69 | 77.20 | 83.75 | 80.34 |
MAPtr (Fei et al., 2021) | 75.50 | 71.80 | 72.40 | 87.90 | 77.20 | 80.30 | − | − | − |
MAC (Wang et al., 2021) | 70.50 | 72.50 | 71.50 | 84.30 | 78.20 | 81.20 | 78.20 | 84.70 | 81.30 |
W2NER (Li et al., 2022) | 74.09 | 72.35 | 73.21 | 85.57 | 79.68 | 82.52 | 79.88 | 83.71 | 81.75 |
T2-NER | 78.33 | 73.22 | 75.69 | 87.41 | 80.48 | 83.80 | 80.53 | 83.77 | 82.12 |
Recall that three discontinuous NER datasets also contain flat and overlapped entities. Only around 10% of entity mentions in these datasets are discontinuous. To truly understand how T2-NER behaves on data with only discontinuous entities, we follow Dai et al. (2020) and experiment on a subset of the test set where only sentences with at least one discontinuous entity mention are included (Figure 4(a)). Since sentences in this subset sometimes contain flat and overlapped mentions as well, we further report the test results when only discontinuous entity mentions are considered (Figure 4(b)). We compare with several discontinuous NER models and report the results on the subset of three datasets in Figure 4. We can see that our model can predict the discontinuous entity mentions and consistently defeat the baseline models in both settings.
6.4 Model Ablation Study
To elucidate the contribution of the main components, we design five internal baselines for comparison. We only report the results on CoNLL2003, ACE2005, and CADEC datasets as the findings on the other datasets are qualitatively similar.
w/o. Adjacent Packing: This variation removes adjacent packing strategy and orderly clusters spans into multiple groups of equal size K. For instance, the group S1 would be . In practice, we set K to 128. Accordingly, the grouped templates would also be different.
w/o. Latter Packing: This variation leaves out latter packing strategy and clusters all of the span pairs into multiple groups. Specifically, we batch candidate pairs by appending 4 markers for each pair to the end of the sentence, until the total number of tokens exceeds 250. Accordingly, the appended markers form one template.
w/o. Reverse Setting: This variation adopts the uni-directional prediction, leaving out the reverse span pair representation and only using equation (3) for span pair classification.
w/o. Syntax Information: This variation removes syntax graph-guided AGGCN module. The syntax-enhanced token representation is removed accordingly, without influencing the major goal.
w. Untyped Span Pair Marker: This variation apply untyped span pair markers to construct templates in the second stage.
Results are shown in Table 5. As seen: (1) T2-NER greatly or comparably outperforms five internal baselines on the test set of three datasets. Compared to T2-NER, w/o. Adjacent Packing drops 0.02%, 0.52%, and 0.94% F1 score on CoNLL2003, ACE2005, and CADEC. The results demonstrate that it is sub-optimal to simply pack spans equally into multiple groups. The reason may be that the modeling of these groups does not produce meaningful information, whereas the adjacent packing strategy can learn discriminative boundary information through the integral modeling of same-start-token spans. (2) w/o. Latter Packing shows huge performance drop on ACE2005 and CADEC. This result demonstrates that the integral modeling of same-former spans contributes to discriminative representations of span pairs, impelling better Next-Fragment and Overlapped relation classification. Moreover, w/o. Latter Packing suffers from a slight F1 score decrease on CoNLL2003, which reveals that the double check of Other relation in the second stage is also effective. (3) When the reverse setting is removed, the F1 scores on all datasets decrease, indicating the significance of modeling the information from the latter span to the former span. (4) Experimental results of w/o. Syntax Information and w. Untyped Span Pair Marker demonstrate that, after removing either AGGCN module or entity type information, the F1 scores go down. This observation suggests that syntax information and entity type information are both effective for our model. (5) After removing the Syntax information, the performance drop on CoNLL2003 and ACE2005 datasets is more significant compared to CADEC. The reason may be that the parser performs better in news domain than in biomedical domain, as it is trained over the news text.
Model . | CoNLL2003 . | ACE2005 . | CADEC . |
---|---|---|---|
w/o. Adjacent Packing | 94.11(−0.02) | 89.82 (−0.52) | 74.75 (−0.94) |
w/o. Latter Packing | 94.12 (−0.01) | 88.36 (−1.98) | 74.28 (−1.41) |
w/o. Reverse Setting | 94.08 (−0.05) | 90.17 (−0.17) | 75.37 (−0.32) |
w/o. Syntax Information | 94.00 (−0.13) | 89.84 (−0.50) | 75.65 (−0.04) |
w. Untyped Span | 94.13 (0) | 90.14 (−0.20) | 75.36 (−0.33) |
Pair Marker | |||
T2-NER | 94.13 | 90.34 | 75.69 |
Model . | CoNLL2003 . | ACE2005 . | CADEC . |
---|---|---|---|
w/o. Adjacent Packing | 94.11(−0.02) | 89.82 (−0.52) | 74.75 (−0.94) |
w/o. Latter Packing | 94.12 (−0.01) | 88.36 (−1.98) | 74.28 (−1.41) |
w/o. Reverse Setting | 94.08 (−0.05) | 90.17 (−0.17) | 75.37 (−0.32) |
w/o. Syntax Information | 94.00 (−0.13) | 89.84 (−0.50) | 75.65 (−0.04) |
w. Untyped Span | 94.13 (0) | 90.14 (−0.20) | 75.36 (−0.33) |
Pair Marker | |||
T2-NER | 94.13 | 90.34 | 75.69 |
6.5 Analysis of Joint Training
To explore the influence of jointly training two stages, we present the performance changes of the first stage (i.e., Span Extraction) before and after adding the second stage (i.e., Span Pair Classification). The performance comparisons on CADEC are shown in Table 6. We find that the F1 scores of span extraction increases by 0.39% after adding the span pair classification task. This observation shows that the second stage could benefit the first stage, revealing that joint training is more advisable. Moreover, the double-checking in the second stage may benefit the extraction of flat and overlapped spans in the first stage.
6.6 Analysis of Complexity
Along with the significant performance improvements, pre-trained language models (e.g., BERT) usually face the issue of high computational costs. Even worse, this issue become more severe as the sequence length continues to increase. In this paper, we insert markers and concatenate them with the original texts. It is obvious that these markers extend the length of input text.
For both stages, we group these markers into several batches, which can control the length of appended sequences. For the first stage, we enumerate spans in a small-length text and then use its contexts to expand the text to 256 tokens, for which the number of candidate spans in a text is usually less than the context length. Hence, with grouped templates consisting of a small number of span markers, the complexity of T2-NER is still near-linearly to that of the model without templates. For the second stage, after filtering non-entity text spans in the first stage, the number of candidate spans is relatively small, thus the increased computation is limited.
We further present the comparison of inference speed between three baselines and T2-NER in Table 7. For fair comparison, all of these models are implemented by PyTorch and ran on a NVIDIA RTX 3090 GPU environment. As we can see, the inference speeds of T2-NER are around 3 times faster than Transition and 10 times faster than BART-Large. By contrast, our model achieves a 2.48% F1 improvement on CADEC but sacrifices 48% speed compared to the W2NER model. Considering the effectiveness and efficiency of our model, we expect it to be effective in practice.
Model . | F1 . | Speed (Sentence/s) . |
---|---|---|
Transition (Dai et al., 2020) | 69.00 | 66.5 |
BART-Large (Yan et al., 2021) | 70.64 | 19.2 |
MAC (Wang et al., 2021) | 71.50 | 109.7 |
W2NER (Li et al., 2022) | 73.21 | 365.7 |
T2-NER | 75.69 | 190.3 |
6.7 Case Study
We find that most mistakes are caused by incorrectly recalling some entities. As shown in Table 8, the span “laceration in esophagus” in case 3 is incorrectly recognized as an entity. Our model may be confused by the preposition “in”, and tend to classify phrases such as “blood in stomach” and “laceration in esophagus” as entities. Nonetheless, three cases illustrate that T2-NER can recognize the flat, overlapped and discontinuous entities. From case 1, we find that three overlapped entities from inside to outside are all accurately recognized, revealing that T2-NER can recognize multi-level overlapped entities. From case 2, we find that T2-NER correctly recognizes the reference phrase “both side”, whereas W2NER incorrectly recognizes it as a PER entity, revealing that T2-NER can resolve ambiguous entity reference.
7 Conclusion
In this paper, we introduce a two-stage span-based framework with templates, named T2-NER for the unified NER task. T2-NER formulate the NER task as span pair relation classification, thus naturally tackling the overlapped and discontinuous NER. Thanks to this formulation, T2-NER is quite effective for various NER tasks, achieving the SOTA performances on eight benchmark datasets. As future work, T2-NER will be improved through discovering more advanced span features and knowledge derived from external sources such as Wikipeidia.
Acknowledgments
We would like to thank our action editor, Miguel Ballesteros, and the anonymous reviewers for their invaluable suggestions and feedback. This work is partially supported by National Key R&D Program of China Nos. 2022YFB3102600, NSFC under grants Nos. 62002373, 62006243, 62272469, and 71971212, and the Science and Technology Innovation Program of Hunan Province under grant No. 2020RC4046.
Notes
References
A Step-by-step Examples of Our Work
The detailed examples of our work step by step are shown in Figure 5.
Author notes
Action Editor: Miguel Ballesteros