Abstract
Named Entity Recognition (NER) is a fundamental NLP task, commonly formulated as classification over a sequence of tokens. Morphologically rich languages (MRLs) pose a challenge to this basic formulation, as the boundaries of named entities do not necessarily coincide with token boundaries, rather, they respect morphological boundaries. To address NER in MRLs we then need to answer two fundamental questions, namely, what are the basic units to be labeled, and how can these units be detected and classified in realistic settings (i.e., where no gold morphology is available). We empirically investigate these questions on a novel NER benchmark, with parallel token- level and morpheme-level NER annotations, which we develop for Modern Hebrew, a morphologically rich-and-ambiguous language. Our results show that explicitly modeling morphological boundaries leads to improved NER performance, and that a novel hybrid architecture, in which NER precedes and prunes morphological decomposition, greatly outperforms the standard pipeline, where morphological decomposition strictly precedes NER, setting a new performance bar for both Hebrew NER and Hebrew morphological decomposition tasks.
1 Introduction
Named Entity Recognition (NER) is a fundamental task in the area of Information Extraction (IE), in which mentions of Named Entities (NE) are extracted and classified in naturally occurring texts. This task is most commonly formulated as a sequence labeling task, where extraction takes the form of assigning each input token with a label that marks the boundaries of the NE (e.g., B, I, O), and classification takes the form of assigning labels to indicate entity type (Per, Org, Loc, etc.).
Despite a common initial impression from latest NER performance, brought about by neural models on the main English NER benchmarks—CoNLL 2003 (Tjong Kim Sang, 2003) and OntoNotes (Weischedel et al., 2013)—the NER task in real-world settings is far from solved. Specifically, NER performance is shown to greatly diminish when moving to other domains (Luan et al., 2018; Song et al., 2018), when addressing the long tail of rare, unseen, and new user-generated entities (Derczynski et al., 2017), and when handling languages with fundamentally different structure than English. In particular, there is no readily available and empirically verified neural modeling strategy for neural NER in those languages with complex word-internal structure, also known as morphologically rich languages.
Morphologically rich languages (MRL) (Tsarfaty et al., 2010; Seddah et al., 2013; Tsarfaty et al., 2020) are languages in which substantial information concerning the arrangement of words into phrases and the relations between them is expressed at the word level, rather than in a fixed word-order or a rigid structure. The extended amount of information expressed at word-level and the morpho-phonological processes creating these words result in high token-internal complexity, which poses serious challenges to the basic formulation of NER as classification of raw, space-delimited, tokens. Specifically, while NER in English is formulated as the sequence labeling of space-delimited tokens, in MRLs a single token may include multiple meaning-bearing units, henceforth morphemes, only some of which are relevant for the entity mention at hand.
In this paper we formulate two questions concerning neural modeling strategies for NER in MRLs, namely: (i) What should be the granularity of the units to be labeled? Space-delimited tokens or finer-grain morphological segments? and, (ii) How can we effectively encode, and accurately detect, the morphological segments that are relevant to NER, and specifically in realistic settings, when gold morphological boundaries are not available?
To empirically investigate these questions we develop a novel parallel benchmark, containing parallel token-level and morpheme-level NER annotations for texts in Modern Hebrew—a morphologically rich and morphologically ambiguous language, which is known to be notoriously hard to parse (More et al., 2019; Tsarfaty et al., 2019).
Our results show that morpheme-based NER is superior to token-based NER, which encourages a segmentation-first pipeline. At the same time, we demonstrate that token-based NER improves morphological segmentation in realistic scenarios, encouraging a NER-first pipeline. While these two findings may appear contradictory, we aim here to offer a resolution; a hybrid architecture where the token-based NER predictions precede and prune the space of morphological decomposition options, while the actual morpheme-based NER takes place only after the morphological decomposition. We empirically show that the hybrid architecture we propose outperforms all token-based and morpheme-based model variants of Hebrew NER on our benchmark, and it further outperforms all previously reported results on Hebrew NER and morphological decomposition. Our error analysis further demonstrates that morpheme-based models better generalize, that is, they contribute to recognizing the long tail of entities unseen during training (out-of-vocabulary, OOV), in particular those unseen entities that turn out to be composed of previously seen morphemes.
The contribution of this paper is thus manifold. First, we define key architectural questions for Neural NER modeling in MRLs and chart the space of modeling options. Second, we deliver a new and novel parallel benchmark that allows one to empirically compare and contrast the morpheme vs. token modeling strategies. Third, we show consistent advantages for morpheme-based NER, demonstrating the importance of morphologically aware modeling. Next we present a novel hybrid architecture which demonstrates an even further improved performance on both NER and morphological decomposition tasks. Our results for Hebrew present a new bar on these tasks, outperforming the reported state-of-the-art results on various benchmarks.1
2 Research Questions: NER for MRLs
In MRLs, words are internally complex, and word boundaries do not generally coincide with the boundaries of more basic meaning-bearing units. This fact has critical ramifications for sequence labeling tasks in MRLs in general, and for NER in MRLs in particular. Consider, for instance, the three-token Hebrew phrase in (1):2
It is clear that תאילנד/thailand (Thailand) and סיו/sin (China) are NEs, and in English, each NE is its own token. In the Hebrew phrase though, neither NE constitutes a single token. In either case, the NE occupies only one of two morphemes in the token, the other being a case-assigning preposition. This simple example demonstrates an extremely frequent phenomenon in MRLs such as Hebrew, Arabic, or Turkish, that the adequate boundaries for NEs do not coincide with token boundaries, and tokens must be segmented in order to obtain accurate NE boundaries.3
The segmentation of tokens and the identification of adequate NE boundaries is, however, far from trivial, due to complex morpho-phonological and orthographic processes in some MRLs (Vania et al., 2018; Klein and Tsarfaty, 2020). This means that the morphemes that compose NEs are not necessarily transparent in the character sequence of the raw tokens. Consider for example phrase (2):
Here, the full form of the NE הבית הלבו / habayit halavan (the White House), is not present in the utterances, only the sub-string בית הלבו / bayit halavan ((the) White House) is present in (2)—due to phonetic and orthographic processes suppressing the definite article ה/ha in certain environments. In this and many other cases, it is not only that NE boundaries do not coincide with token boundaries, they do not coincide with characters or sub-strings of the token either. This calls for accessing the more basic meaning-bearing units of the token, that is, to decompose the tokens into morphemes.
Unfortunately though, the morphological decomposition of surface tokens may be very challenging due to extreme morphological ambiguity. The sequence of morphemes composing a token is not always directly recoverable from its character sequence, and is not known in advance.4 This means that for every raw space-delimited token, there are many conceivable readings which impose different segmentations, yielding different sets of potential NE boundaries. Consider for example the token לבני (lbny) in different contexts:
In (3a) the token לבני is completely consumed as a labeled NE. In (3b) לבני is only partly consumed by an NE, and in (3c) and (3d) the token is entirely out of an NE context. In (3c) the token is composed of several morphemes, and in (3d) it consists of a single morpheme. These are only some of the possible decompositions of this surface token, other alternatives may still be available. As shown by Goldberg and Tsarfaty (2008), Green and Manning (2010), Seeker and Çetinoğlu (2015), Habash and Rambow (2005), More et al. (2019), and others, the correct morphological decomposition becomes apparent only in the larger (syntactic or semantic) context. The challenge, in a nutshell, is as follows: In order to detect accurately NE boundaries, we need to segment the raw token first, however, in order to segment tokens correctly, we need to know the greater semantic content, including, for example, the participating entities. How can we break out of this apparent loop?
Finally, MRLs are often characterized by an extremely sparse lexicon, consisting of a long-tail of OOV entities unseen during training (Czarnowska et al., 2019). Even in cases where all morphemes are present in the training data, morphological compositions of seen morphemes may yield tokens and entities which were unseen during training. Take for example the utterance in (4), which the reader may inspect as familiar:
Example (4) is in fact example (1) with a switched flight direction. This subtle change creates two new surface tokens לתאילנד, מסיו which might not have been seen during training, even if example (1) had been observed. Morphological compositions of an entity with prepositions, conjunctions, definite markers, possessive clitics and more, cause mentions of seen entities to have unfamiliar surface forms, which often fail to be accurately detected and analyzed.
Given the aforementioned complexities, in order to solve NER for MRLs we ought to answer the following fundamental modeling questions:
Q1. Units: What are the discrete units upon which we need to set NE boundaries in MRLs? Are they tokens? characters? morphemes? a representation containing multiple levels of granularity?
Q2. Architecture: When employing morphemes in NER, the classical approach is “segmentation- first”. However, segmentation errors are detrimental and downstream NER cannot recover from them. How is it best to set up the pipeline so that segmentation and NER could interact?
Q3. Generalization: How do the different modeling choices affect NER generalization in MRLs? How can we address the long tail of OOV NEs in MRLs? Which modeling strategy best handles pseudo-OOV entities that result from a previously unseen composition of already seen morphemes?
3 Formalizing NER for MRLs
To answer the aforementioned questions, we chart and formalize the space of modeling options for neural NER in MRLs. We cast NER as a sequence labeling task and formalize it as , where is a sequence x1,…,xn of n discrete strings from some vocabulary xi ∈ Σ, and is a sequence y1,..,yn of the same length, where yi ∈ Labels, and Labels is a finite set of labels composed of the BIOSE tags (i.e., BIOLU as described in Ratinov and Roth, 2009). Every non-O label is also enriched with an entity type label. Our list of types is presented in Table 2.
3.1 Token-Based or Morpheme-Based?
Our first modeling question concerns the discrete units upon which to set the NE boundaries. That is, what is the formal definition of the input vocabulary Σ for the sequence labeling task?
3.2 Realistic Morphological Decomposition
A major caveat with morpheme-based modeling strategies is that they often assume an ideal scenario of gold morphological decomposition of the space-delimited tokens into morphological segments (cf. Nivre et al., 2007; Pradhan et al., 2012). But in reality, gold morphological decomposition is not known in advance, it has to be predicted automatically, and prediction errors may propagate to contaminate the downstream task.
Our second modeling question therefore concerns the interaction between the morphological decomposition and the NER tasks: How would it be best to set up the pipeline so that the prediction of the two tasks can interact?
Both MDStandard and MDHybrid are disambiguation architectures that result in a morpheme sequence M ∈ℳ. The latter benefits from the NER signal, while the former doesn’t. The sequence M ∈ℳ can be used in one of two ways. We can use M as input to a morpheme model to output morpheme labels. Or, we can rely on the output of the token-multi model and align the token’s multi-label with the segments in M.
In what follows, we want to empirically assess the effect of different modeling choices (token- single, token-multi, morpheme) and disambiguation architectures (Standard, Hybrid) on the performance of NER in MRLs. To this end, we need a corpus that allows training and evaluating NER at both token and morpheme-level granularity.
4 The Data: A Novel NER Corpus
This work empirically investigates NER modeling strategies in Hebrew, a Semitic language known for its complex and highly ambiguous morphology. Ben-Mordecai (2005), the only previous work on Hebrew NER to date, annotated space-delimited tokens, basing their guidelines on the CoNLL 2003 shared task (Chinchor et al., 1999).
Popular Arabic NER corpora also label space- delimited tokens (ANERcorp [Benajiba et al., 2007], AQMAR [Mohit et al., 2012], TWEETS [Darwish, 2013]), with the exception of the Arabic portion of OntoNotes (Weischedel et al., 2013) and ACE (LDC, 2008) which annotate NER labels on gold morphologically pre-segmented texts. However, these works do not provide a comprehensive analysis on the performance gaps between morpheme-based and token-based scenarios.
In agglutinative languages as Turkish, token segmentation is always performed before NER (Tür et al., 2003; Küçük and Can, 2019, re-enforcing the need to contrast the token-based scenario, widely adopted for Semitic languages, with the morpheme-based scenarios in other MRLs.
Our first contribution is thus a parallel corpus for Hebrew NER; one version consists of gold-labeled tokens and the other consists of gold-labeled morphemes, for the same text. For this, we performed gold NE annotation of the Hebrew Treebank (Sima’an et al., 2001), based on the 6,143 morpho-syntactically analyzed sentences of the HAARETZ corpus, to create both token-level and morpheme-level variants, as illustrated at the topmost and lowest rows of Table 1, respectively.
Annotation Scheme We started off with the guidelines of Ben-Mordecai (2005), from which we deviate in three main ways. First, we label NE boundaries and their types on sequences of morphemes, in addition to the space-delimited token annotations.6 Secondly, we use the finer-grained entity categories list of ACE (LDC, 2008).7 Finally, we allow nested entity mentions, as in Finkel and Manning (2009) and Benikova et al. (2014).8
Annotation Cycle As Fort et al. (2009) put it, examples and rules would never cover all possible cases because of the specificity of natural language and the ambiguity of formulation. To address this we employed the cyclic approach of agile annotation as offered by Alex et al. (2010). Every cycle consisted of: annotation, evaluation and curation, clarification and refinements. We used WebAnno (Yimam et al., 2013) as our annotation interface.
The Initial Annotation Cycle was a two-stage pilot with 12 participants, divided into 2 teams of 6. The teams received the same guidelines, with the exception of the specifications of entity boundaries. One team was guided to annotate the minimal string that designates the entity. The other was guided to tag the maximal string which can still be considered as the entity. Our agreement analysis showed that the minimal guideline generally led to more consistent annotations. Based on this result (as well as low-level refinements) from the pilot, we devised the full version of the guidelines.9
Annotation,Evaluation, and Curation: Every annotation cycle was performed by two annotators (A, B) and an annotation manager/curator (C). We annotated the full corpus in 7 cycles. We evaluated the annotation in two ways, manual curation and automatic evaluation. After each annotation step, the curator manually reviewed every sentence in which disagreements arose, as well as specific points of difficulty pointed out by the annotators. The inter-annotator agreement metric described below was also used to quantitatively gauge the progress and quality of the annotation.
Clarifications and Refinements: In the end of each cycle we held a clarification talk between A, B, and C, in which issues that came up during the cycle were discussed. Following that talk we refined the guidelines and updated the annotators, which went on to the next cycle. In the end we performed a final curation run to make sentences from earlier cycles comply with later refinements.10
Inter-Annotator Agreement (IAA) IAA is commonly measured using the κ-statistic. However, Pyysalo et al. (2007) show that it is not suitable for evaluating inter-annotator agreement in NER. Instead, an F1 metric on entity mentions has in recent years been adopted for this purpose (Zhang, 2013). This metric allows for computing pair-wise IAA using standard F1 score by treating one annotator as gold and the other as the prediction.
Our full corpus pair-wise F1 scores are: IAA(A,B)=89, IAA(B,C)=92, IAA(A,C)=96. Table 2 presents final corpus statistics.
. | train . | dev . | test . |
---|---|---|---|
Sentences | 4,937 | 500 | 706 |
Tokens | 93,504 | 8,531 | 12,619 |
Morphemes | 127,031 | 11,301 | 16,828 |
All mentions | 6,282 | 499 | 932 |
Type: Person (Per) | 2,128 | 193 | 267 |
Type: Organization (Org) | 2,043 | 119 | 408 |
Type: Geo-Political (Gpe) | 1,377 | 121 | 195 |
Type: Location (Loc) | 331 | 28 | 41 |
Type: Facility (Fac) | 163 | 12 | 11 |
Type: Work-of-Art (Woa) | 114 | 9 | 6 |
Type: Event (Eve) | 57 | 12 | 0 |
Type: Product (Duc) | 36 | 2 | 3 |
Type: Language (Ang) | 33 | 3 | 1 |
. | train . | dev . | test . |
---|---|---|---|
Sentences | 4,937 | 500 | 706 |
Tokens | 93,504 | 8,531 | 12,619 |
Morphemes | 127,031 | 11,301 | 16,828 |
All mentions | 6,282 | 499 | 932 |
Type: Person (Per) | 2,128 | 193 | 267 |
Type: Organization (Org) | 2,043 | 119 | 408 |
Type: Geo-Political (Gpe) | 1,377 | 121 | 195 |
Type: Location (Loc) | 331 | 28 | 41 |
Type: Facility (Fac) | 163 | 12 | 11 |
Type: Work-of-Art (Woa) | 114 | 9 | 6 |
Type: Event (Eve) | 57 | 12 | 0 |
Type: Product (Duc) | 36 | 2 | 3 |
Type: Language (Ang) | 33 | 3 | 1 |
Annotation Costs The annotation took on average about 35 seconds per sentence, and thus a total of 60 hours for all sentences in the corpus for each annotator. Six clarification talks were held between the cycles, which lasted from thirty minutes to an hour. Giving a total of about 130 work hours of expert annotators.11
5 Experimental Settings
Goal We set out to empirically evaluate the representation alternatives for the input/output sequences (token-single, token-multi, morpheme) and the effect of different architectures (Standard, Hybrid) on the performance of NER for Hebrew.
Modeling Variants All experiments use the corpus we just described and employ a standard Bi-LSTM-CRF architecture for implementing the neural sequence labeling task (Huang et al., 2015). Our basic architecture12 is composed of an embedding layer for the input and a 2-layer Bi-LSTM followed by a CRF inference layer—for which we test three modeling variants.
Figures 2–3 present the variants we employ. Figure 2 shows the token-based variants, token-single and token-multi. The former outputs a single BIOSE label per token, and the latter outputs a multi-label per token—a concatenation of BIOSE labels of the morphemes composing the token. Figure 3 shows the morpheme-based variant for the same input phrase. It has the same basic architecture, but now the input consists of morphological segments instead of tokens. The model outputs a single BIOSE label for each morphological segment in the input.
In all modeling variants, the input may be encoded in two ways: (a) String-level embeddings (token-based or morpheme-based) optionally initialized with pre-trained embeddings. (b) Char- level embeddings, trained simultaneously with the main task (cf. Ma and Hovy, 2016; Chiu and Nichols, 2015; Lample et al., 2016). For char- based encoding (of either tokens or morphemes) we experiment with CharLSTM, CharCNN, or NoChar, that is, no character embedding at all.
We pre-trained all token-based or morpheme- based embeddings on the Hebrew Wikipedia dump of Goldberg (2014). For morpheme-based embeddings, we decompose the input using More et al. (2019), and use the morphological segments as the embedding units.13 We compare GloVe (Pennington et al., 2014) and fastText (Bojanowski et al., 2017). We hypothesize that since FastText uses sub-string information, it will be more useful for analyzing OOVs.
Hyper Parameters Following Reimers and Gurevych (2017) and Yang et al. (2018), we performed hyper-parameter tuning for each of our model variants. We performed hyper-parameter tuning on the dev set in a number of rounds of random search, independently on every input/output and char-embedding architecture. Table 3 shows our selected hyper-parameters.14 The Char CNN window size is particularly interesting as it was not treated as a hyper-parameter in Reimers and Gurevych (2017), Yang et al. (2018). However, given the token-internal complexity in MRLs we conjecture that the window size over characters might make a crucial effect. In our experiments we found that a larger window (7) increased the performance. For MRLs, further research into this hyper-parameter might be of interest.
Parameter . | Value . | Parameter . | Value . |
---|---|---|---|
Optimizer | SGD | *LR (token-single) | 0.01 |
*Batch Size | 8 | *LR (token-multi) | 0.005 |
LR decay | 0.05 | *LR (morpheme) | 0.01 |
Epochs | 200 | Dropout | 0.5 |
Bi-LSTM layers | 2 | *CharCNN window | 7 |
*Word Emb Dim | 300 | Char Emb dim | 30 |
Word Hidden Dim | 200 | *Char Hidden Dim | 70 |
Parameter . | Value . | Parameter . | Value . |
---|---|---|---|
Optimizer | SGD | *LR (token-single) | 0.01 |
*Batch Size | 8 | *LR (token-multi) | 0.005 |
LR decay | 0.05 | *LR (morpheme) | 0.01 |
Epochs | 200 | Dropout | 0.5 |
Bi-LSTM layers | 2 | *CharCNN window | 7 |
*Word Emb Dim | 300 | Char Emb dim | 30 |
Word Hidden Dim | 200 | *Char Hidden Dim | 70 |
Evaluation Standard NER studies typically invoke the CoNLL evaluation script that anchors NEs in token positions (Tjong Kim Sang, 2003). However, it is inadequate for our purposes because we want to compare entities across token-based vs. morpheme-based settings. To this end, we use a revised evaluation procedure, which anchors the entity in its form rather than its index. Specifically, we report F1 scores on strict, exact-match of the surface forms of the entity mentions. That is, the gold and predicted NE spans must exactly match in their form, boundaries, and entity type. In all experiments, we report both token-level F-scores and morpheme-level F-scores, for all models.
Token-Level Evaluation. For the sake of backwards compatibility with previous work on Hebrew NER, we first define token-level evaluation. For token-single this is a straightforward calculation of F1 against gold spans. For token-multi and morpheme, we need to map the predicted label sequence of that token to a single label, and we do so using linguistically informed rules we devise (as elaborated in Appendix A).15
Morpheme-Level Evaluation. Our ultimate goal is to obtain precise boundaries of the NEs. Thus, our main metric evaluates NEs against the gold morphological boundaries. For morpheme and token-single models, this is a straightforward F1 calculation against gold spans. Note for token-single we are expected to pay a price for boundary mismatch. For token-multi, we know the number and order of labels, so we align the labels in the multi-label of the token with the morphemes in its morphological decomposition.16
For all experiments and metrics, we report mean and confidence interval (0.95) over ten runs.
Input-Output Scenarios We experiment with two kinds of input settings: token-based, where the input consists of the sequence of space-delimited tokens, and morpheme-based, where the input consists of morphological segments. For the morpheme input, there are three input variants:
- (i)
Morph-gold: where the morphological sequence is produced by an expert (idealistic).
- (ii)
Morph-standard: where the morphological sequence is produced by a standard segmentation-first pipeline (realistic).
- (iii)
Morph-hybrid: where the morphological sequence is produced by the hybrid architecture we propose (realistic).
In the token-multi case we can perform morpheme-based evaluation by aligning individual labels in the multi-label with the morpheme sequence of the respective token. Again we have three options as to which morphemes to use:
- (i)
Tok-multi-gold: The multi-label is aligned with morphemes produced by an expert (idealistic).
- (ii)
Tok-multi-standard: The multi-label is aligned with morphemes produced by a standard pipeline (realistic).
- (iii)
Tok-multi-hybrid: The multi-label is aligned with morphemes produced by the hybrid architecture we propose (realistic).
Pipeline Scenarios Assume an input sentence x. In the Standard pipeline we use YAP,17 the current state-of-the-art morpho-syntactic parser for Hebrew (More et al., 2019), for the predicted segmentation M = MD(MA(x)). In the Hybrid pipeline, we use YAP to first generate complete morphological lattices MA(x). Then, to obtain we omit lattice paths where the number of morphemes in the token decomposition does not conform with the number of labels in the multi-label of NERtoken-multi(x). Then, we apply YAP to obtain on the constrained lattice. In predicted morphology scenarios (either Standard or Hybrid), we use the same model weights as trained on the gold segments, but feed predicted morphemes as input.18
6 Results
6.1 The Units: Tokens vs. Morphemes
Figure 4 shows the token-level evaluation for the different model variants we defined. We see that morpheme models perform significantly better than the token-single and token-multi variants. Interestingly, explicit modeling of morphemes leads to better NER performance even when evaluated against token-level boundaries. As expected, the performance gaps between variants are smaller with fastText than they are with embeddings that are unaware of characters (GloVe) or with no pre-training at all. We further pursue this in Section 6.3.
Figure 5 shows the morpheme-level evaluation for the same model variants as in Figure 4. The most obvious trend here is the drop in the performance of the token-single model. This is expected, reflecting the inadequacy of token boundaries for identifying accurate boundaries for NER. Interestingly, morpheme and token-multi models keep a similar level of performance as in token-level evaluation, only slightly lower. Their performance gap is also maintained, with morpheme performing better than token-multi. An obvious caveat is that these results are obtained with gold morphology. What happens in realistic scenarios?
6.2 The Architecture: Pipeline vs. Hybrid
Figure 6 shows the token-level evaluation results in realistic scenarios. We first observe a significant drop for morpheme models when Standard predicted segmentation is introduced instead of gold. This means that MD errors are indeed detrimental for the downstream task, in a non-negligible rate. Second, we observe that much of this performance gap is recovered with the Hybrid pipeline. It is noteworthy that while morph hybrid lags behind morph gold, it is still consistently better than token-based models, token-single and token-multi.
Figure 7 shows morpheme-level evaluation results for the same scenarios as in Table 6. All trends from the token-level evaluation persist, including a drop for all models with predicted segmentation relative to gold, with the hybrid variant recovering much of the gap. Again morph gold outperforms token-multi, but morph hybrid shows great advantages over alltok-multi variants. This performance gap between morph (gold or hybrid) and tok-multi indicates that explicit morphological modeling is indeed crucial for accurate NER.
6.3 Morphologically Aware OOV Evaluation
As discussed in Section 2, morphological composition introduces an extremely sparse word-level “long-tail” in MRLs. In order to gauge this phenomenon and its effects on NER performance, we categorize unseen, out-of-training-vocabulary (OOTV) mentions into 3 categories:
Lexical: Unknown mentions caused by an unknown token which consists of a single morpheme. This is a strictly lexical unknown with no morphological composition (most English unknowns are in this category).
Compositional: Unknown mentions caused by an unknown token which consists of multiple known morphemes. These are unknowns introduced strictly by morphological composition, with no lexical unknowns.
LexComp: Unknown mentions caused by an unknown token consisting of multiple morphemes, of which (at least) one morpheme was not seen during training. In such cases, both unknown morphological composition and lexical unknowns are involved.
We group NEs based on these categories, and evaluate each group separately. We consider mentions that do not fall into any category as Known.
Figure 8 shows the distributions of entity mentions in the dev set by entity type and OOTV category. OOTV categories that involve composition (Comp and LexComp) are spread across all categories but one, and in some they even make up more than half of all mentions.
Figure 9 shows token-level evaluation19 with fastText embeddings, grouped by OOTV type. We first observe that indeed unknown NEs that are due to morphological composition (Comp and LexComp) proved the most challenging for all models. We also find that in strictly Compositional OOTV mentions, morpheme-based models exhibit their most significant performance advantage, supporting the hypothesis that explicit morphology helps to generalize. We finally observe that token-multi models perform better than token-single models for these NEs (in contrast with the trend for non-compositional NEs). This corroborates the hypothesis that even partial modeling of morphology (as in token-multi compared to token-single) is better than none, leading to better generalization.
String-level vs. Character-level Embeddings
To further understand the generalization capacity of different modeling alternatives in MRLs, we probe into the interplay of string-based and char-based embeddings in treating OOTV NEs.
Figure 10 presents 12 plots, each of which presents the level of performance (y-axes) for all models (x-axes). Token-based models are on the left of each x-axis, morpheme-based are on the right. We plot results with and without character embeddings,20 in orange and blue, respectively. The plots are organized in a large grid, with the type of NE on the y-axes (Known, Lex, Comp, LexComp), and the type of pre-training on the x-axes (No pre-training, GloVe, fastText).
At the top-most row, plotting the accuracy for Known NEs, we see a high level of performance for all pre-training methods, with not much differences between the type of pre-training, with or without the character embeddings. Moving further down to the row of Lexical unseen NEs, char-based representations lead to significant advantages when we assume no pre-training, but with GloVe pre-training the performance substantially increases, and with fastText the differences in performance with/without char-embeddings almost entirely diminish, indicating the char-based embeddings are somewhat redundant in this case.
The two lower rows in the large grid show the performance for Comp and LexComp unseen NEs, which are ubiquitous in MRLs. For Compositional NEs, pre-training closes only part of the gap between token-based and morpheme-based models. Adding char-based representations indeed helps the token-based models, but crucially does not close the gap with the morpheme-based variants.
Finally, for LexComp NEs at the lowest row, we again see that adding GloVe pre-training and char-based embeddings does not close the gap with morpheme-based models, indicating that not all morphological information is captured by these vectors. For fastText with char-based embeddings the gap between token-multi and morpheme greatly diminishes, but is still well above token-single. This suggests biasing the model to learn about morphology (either via multi-labels or by incorporating morphological boundaries) has advantages for analysing OOTV entities, beyond the contribution of char-based embeddings alone.
All in all, the biggest advantage of morpheme-based models over token-based models is their ability to generalize from observed tokens to composition-related OOTV (Comp/LexComp). While character-based embeddings do help token-based models generalize, the contribution of modeling morphology is indispensable, above and beyond the contribution of char-based embeddings.
6.4 Setting in the Greater Context
Test Set Results.
Table 4 confirms our best results on the Test set.The trends are kept, though results on Test are lower than on Dev. The morph gold scenario still provides an upperbound of the performance, but it is not realistic. For the realistic scenarios, morph hybrid generally outperforms all other alternatives. The only divergence is that in token-level evaluation, token-multi performs on a par with morph hybrid on the Test set.
Eval . | Model . | dev . | test . |
---|---|---|---|
Morph-Level | morph gold | 80.03± 0.4 | 79.10± 0.6 |
morph hybrid | 78.51 ± 0.5 | 77.11 ± 0.7 | |
morph standard | 72.79 ± 0.5 | 69.52 ± 0.6 | |
token-multi hybrid | 75.70 ± 0.5 | 74.64 ± 0.3 | |
Token-Level | morph gold | 80.30± 0.5 | 79.28± 0.6 |
morph hybrid | 79.04 ± 0.5 | 77.64 ± 0.7 | |
morph standard | 74.52 ± 0.7 | 73.53 ± 0.8 | |
token-multi | 77.59 ± 0.4 | 77.75 ± 0.3 | |
token-single | 78.15 ± 0.3 | 77.15 ± 0.6 |
Eval . | Model . | dev . | test . |
---|---|---|---|
Morph-Level | morph gold | 80.03± 0.4 | 79.10± 0.6 |
morph hybrid | 78.51 ± 0.5 | 77.11 ± 0.7 | |
morph standard | 72.79 ± 0.5 | 69.52 ± 0.6 | |
token-multi hybrid | 75.70 ± 0.5 | 74.64 ± 0.3 | |
Token-Level | morph gold | 80.30± 0.5 | 79.28± 0.6 |
morph hybrid | 79.04 ± 0.5 | 77.64 ± 0.7 | |
morph standard | 74.52 ± 0.7 | 73.53 ± 0.8 | |
token-multi | 77.59 ± 0.4 | 77.75 ± 0.3 | |
token-single | 78.15 ± 0.3 | 77.15 ± 0.6 |
Results on MD Tasks.
While the Hybrid pipeline achieves superior performance on NER, it also improves the state-of-the-art on other tasks in the pipeline. Table?? shows the Seg+POS results of our Hybrid pipeline scenario, compared with the Standard pipeline which replicates the pipeline of More et al. (2019). We use the metrics defined by More et al. (2019). We show substantial improvements for the Hybrid pipeline over the results of More et al. (2019), and also outperforming the Test results of Seker and Tsarfaty (2020).
Comparison with Prior Art.
Table 6 presents our results on the Hebrew NER corpus of Ben-Mordecai (2005) compared to their model, which uses a hand-crafted feature-engineered MEMM with regular-expression rule-based enhancements and an entity lexicon. Like Ben-Mordecai (2005) we performed three 75%-25% random train/test splits, and used the same seven NE categories (Per,Loc,Org,Time,Date,Percent,Money). We trained a token-single model on the original space-delimited tokens and a morpheme model on automatically segmented morphemes we obtained using our best segmentation model (Hybrid MD on our trained token-multi model, as in Table 5). Since their annotation includes only token-level boundaries, all of the results we report conform with token-level evaluation.
. | . | Seg+POS . |
---|---|---|
dev | Standard (More et al., 2019, 2019) | 92.36 |
Ptr-Network (Seker and Tsarfaty, 2020, 2020) | 93.90 | |
Hybrid (This work) | 93.12 | |
test | Standard (More et al., 2019, 2019) | 89.08 |
Ptr-Network (Seker and Tsarfaty, 2020, 2020) | 90.49 | |
Hybrid (This work) | 90.89 |
. | . | Seg+POS . |
---|---|---|
dev | Standard (More et al., 2019, 2019) | 92.36 |
Ptr-Network (Seker and Tsarfaty, 2020, 2020) | 93.90 | |
Hybrid (This work) | 93.12 | |
test | Standard (More et al., 2019, 2019) | 89.08 |
Ptr-Network (Seker and Tsarfaty, 2020, 2020) | 90.49 | |
Hybrid (This work) | 90.89 |
. | Precision . | Recall . | F1 . |
---|---|---|---|
Ben-Mordecai (2005) MEMM+HMM+REGEX | 84.54 | 74.31 | 79.10 |
This work token-single+FT+CharLSTM | 86.84 ± 0.5 | 82.6 ± 0.9 ± 0.5 | 84.71 |
This work morph-Hybrid+FT+CharLSTM | 86.93 ± 0.6 | 83.59 ± 0.8 | 85.22 ± 0.5 |
. | Precision . | Recall . | F1 . |
---|---|---|---|
Ben-Mordecai (2005) MEMM+HMM+REGEX | 84.54 | 74.31 | 79.10 |
This work token-single+FT+CharLSTM | 86.84 ± 0.5 | 82.6 ± 0.9 ± 0.5 | 84.71 |
This work morph-Hybrid+FT+CharLSTM | 86.93 ± 0.6 | 83.59 ± 0.8 | 85.22 ± 0.5 |
Table 6 presents the results of these experiments. Both models significantly outperform the previous state-of-the-art by Ben-Mordecai (2005), setting a new performance bar on this earlier benchmark. Moreover, we again observe an empirical advantage when explicitly modeling morphemes, even with the automatic noisy segmentation that is used for the morpheme-based training.
7 Discussion: Joint Modeling Alternatives and Future Work
The present study provides the motivation and the necessary foundations for comparing morpheme-based and token-based modeling for NER. While our findings clearly demonstrate the advantages of morpheme-based modeling for NER in a morphologically rich language, it is clear that our proposed Hybrid architecture is not the only modeling alternative for linking NER and morphology.
For example, a previous study by Güngör et al. (2018) addresses joint neural modeling of morphological segmentation and NER labeling, proposing a multi-task learning approach for joint MD and NER in Turkish. They employ separate Bi-LSTM networks for the MD and NER tasks, with a shared loss to allow for joint learning. Their results indicate improved NER performance, with no improvement in the MD results. Contrary to our proposal, they view MD and NER as distinct tasks, assuming a single NER label per token, and not providing disambiguated morpheme-level boundaries for the NER task. More generally, they test only token-based NER labeling and do not attend to the question of input/output granularity in their models.
A different approach for joint NER and morphology is jointly predicting the segmentation and labels for each token in the input stream. This is the approach taken, for instance, by the lattice- based Pointer-Network of Seker and Tsarfaty (2020). As shown in Table 5, their results for morphological segmentation and POS tagging are on a par with our reported results and, at least in principle, it should be possible to extend the Seker and Tsarfaty (2020) approach to yield also NER predictions.
However, our preliminary experiments with a lattice-based Pointer-network for token segmentation and NER labeling shows that this is not a straightforward task. Contrary to POS tags, which are constrained by the MA, every NER label can potentially go with any segment, and this leads to a combinatorial explosion of the search space represented by the lattice. As a result, the NER predictions are brittle to learn, and the complexity of the resulting model is computationally prohibitive.
A different approach to joint sequence segmentation and labeling can be applying the neural model directly on the character-sequence of the input stream. Such an approach is for instance the char-based labeling as segmentation setup proposed by Shao et al. (2017). Shao et al. use a character-based Bi-RNN-CRF to output a single label-per-char which indicates both word boundary (using BIES sequence labels) and the POS tags. This method is also used in their universal segmentation paper, (Shao et al., 2018). However, as seen in the results of Shao et al. (2018), char-based labeling for segmenting Semitic languages lags far behind all other languages, precisely because morphological boundaries are not explicit in the character sequences.
Additional proposals are those of Kong et al. (2015); Kemos et al. (2019). First, Kong et al. (2015) proposed to solve, for example, Chinese segmentation and POS tagging using dynamic programming with neural encoding, by using a Bi-LSTM to encode the character input, and then feeding it to a semi-Markov CRF to obtain probabilities for the different segmentation options. Kemos et al. (2019) propose an approach similar to Kong et al. (2015) for joint segmentation and tagging but add convolution layers on top of the Bi-LSTM encodings to obtain segment features hierarchically and then feed them to the semi-Markov CRF.
Preliminary experiments we conducted confirm that char-based joint segmentation and NER labeling for Hebrew, either using char-based labeling or a seq2seq architecture, still lags behind our reported results. We conjecture that this is due to the complex morpho-phonological and orthographic processed in Semitic languages. Going into char-based modeling nuances and offering a sound joint solution for a language like Hebrew is an important matter that merits its own investigation. Such work is feasible now given the new corpus, however, it is out of the scope of the current study.
All in all, the design of sophisticated joint modeling strategies for morpheme-based NER poses fascinating questions—for which our work provides a solid foundation (data, protocols, metrics, strong baselines). More work is needed for investigating joint modeling of NER and morphology, in the directions portrayed in this section, yet it is beyond the scope of this paper, and we leave this investigation for future work.
Finally, while the joint approach is appealing, we argue that the elegance of our Hybrid solution is precisely in providing a clear and well-defined interface between MD and NER through which the two tasks can interact, while still keeping the distinct models simple, robust, and efficiently trainable. It also has the advantage of allowing us to seamlessly integrate sequence labelling with any lattice-based MA, in a plug-and-play language-agnostic fashion, towards obtaining further advantages on both of these tasks.
8 Conclusion
This work addresses the modeling challenges of neural NER in MRLs. We deliver a parallel token- vs-morpheme NER corpus for Modern Hebrew, that allows one to assess NER modeling strategies in morphologically rich-and-ambiguous environments. Our experiments show that while NER benefits from morphological decomposition, downstream results are sensitive to segmentation errors. We thus propose a Hybrid architecture in which NER precedes and prunes the morphological decomposition. This approach greatly outperforms a Standard pipeline in realistic (non-gold) scenarios. Our analysis further shows that morpheme-based models better recognize OOVs that result from morphological composition. All in all we deliver new state-of-the-art results for Hebrew NER and MD, along with a novel benchmark, to encourage further investigation into the interaction between NER and morphology.
Acknowledgments
We are grateful to the BIU-NLP lab members as well as 6 anonymous reviewers for their insightful remarks. We further thank Daphna Amit and Zef Segal for their meticulous annotation and profound discussions. This research is funded by an ISF Individual Grant (1739/26) and an ERC Starting Grant (677352), for which we are grateful.
A Alignment Heuristics
Aligning Multi-labels to Single Labels. In order to evaluate morpheme-based labels (morph or token-multi) in token-based settings, we introduce a deterministic procedure to extend the morphological labels to token boundaries. Specifically, we use regular expressions to map the multiple sequence labels to a single label by choosing the first non-O entity category (BIES) as
the single category. In case the sequence of labels is not valid (e.g., B comes after E, or there is an O between two I labels), we use a relaxed mapping that does not take the order of the labels into consideration: if there is an S or both B and E in the sequence, return an S. Otherwise, if there is an E, return an E; if there is a B, return a B; if there is an I return an I (Figure 11).
Aligning Multi-labels to Morphemes. In order to obtain morpheme boundary labels from token- multi, we introduce a deterministic procedure to align the token’s predicted multi-label with the list of morphemes predicted for it by the MD. Specifically, we align the multi-labels to morphemes in the order that they are both provided. In case of a mismatch between the number of labels and morphemes predicted for the token, we match label- morpheme pairs from the final one backwards. If the number of morphemes exceeds the number of labels, we pad unpaired morphemes with O labels. If the number of labels exceeds the morphemes, we drop unmatched labels (Figure 12).
Notes
Data & code: https://github.com/OnlpLab/NEMO.
Glossing conventions are in accord with the Leipzig Glossing Rules (Comrie et al., 2008).
We use the term morphological segmentation (or segmentation) to refer to splitting raw tokens into morphological segments, each carrying a single Part-Of-Speech tag. That is, we segment away prepositions, determiners, subordination markers and multiple kinds of pronominal clitics, that attach to their hosts via complex morpho-phonological processes. Throughout this work, we use the terms morphological segment, morpheme, or segment interchangeably.
This ambiguity gets magnified by the fact that Semitic languages that use abjads, like Hebrew and Arabic, lack capitalization altogether and suppress all vowels (diacritics).
We can do this for any sequence labeling task in MRLs.
A single NE is always continuous. Token-morpheme discrepancies do not lead to discontinuous NEs.
Entity categories are listed in Table 2. We dropped the NORP category, since it introduced complexity concerning the distinction between adjectives and group names. Law did not appear in our corpus.
Nested labels are are not modeled in this paper, but they are published with the corpus, to allow for further research.
The complete annotation guide is publicly available at https://github.com/OnlpLab/NEMO-Corpus.
A, B, and C annotations are published to enable research on learning with disagreements (Plank et al., 2014).
The corpus is available at https://github.com/OnlpLab/NEMO-Corpus.
Using the NCRF++ suite of Yang and Zhang (2018).
Embeddings and Wikipedia corpus also available in: https://github.com/OnlpLab/NEMO.
A few interesting empirical observations diverging from those of Reimers and Gurevych (2017) and Yang et al. (2018) are worth mentioning. We found that a lower Learning Rate than the one recommended by Yang et al. (2018) (0.015), led to better results and less occurrences of divergence. We further found that raising the number of Epochs from 100 to 200 did not result in over-fitting, and significantly improved NER results. We used for evaluation the weights from the best epoch.
In the morpheme case we might encounter “illegal” label sequences in case of a prediction error. We employ similar linguistically informed heuristics to recover from that (see Appendix A).
In case of a misalignment (in the number of morphemes and labels) we match the label-morpheme pairs from the final one backwards, and pad unpaired morphemes with O labels.
For other languages this may be done using models for canonical segmentation as in (Kann et al., 2016).
We do not re-train the morpheme models with predicted segmentation, which might achieve better performance (e.g., jackknifing). We leave this for future work.
This section focuses on token-level evaluation, which is a permissive evaluation metric, allowing us to compare the models on a more level playing field, where all models (including token-single) have an equal opportunity to perform.
For brevity we only show char LSTM (vs. no char representation), there was no significant difference with CNN.