Neural Modeling for Named Entities and Morphology (NEMO2)

Abstract Named Entity Recognition (NER) is a fundamental NLP task, commonly formulated as classification over a sequence of tokens. Morphologically rich languages (MRLs) pose a challenge to this basic formulation, as the boundaries of named entities do not necessarily coincide with token boundaries, rather, they respect morphological boundaries. To address NER in MRLs we then need to answer two fundamental questions, namely, what are the basic units to be labeled, and how can these units be detected and classified in realistic settings (i.e., where no gold morphology is available). We empirically investigate these questions on a novel NER benchmark, with parallel token- level and morpheme-level NER annotations, which we develop for Modern Hebrew, a morphologically rich-and-ambiguous language. Our results show that explicitly modeling morphological boundaries leads to improved NER performance, and that a novel hybrid architecture, in which NER precedes and prunes morphological decomposition, greatly outperforms the standard pipeline, where morphological decomposition strictly precedes NER, setting a new performance bar for both Hebrew NER and Hebrew morphological decomposition tasks.


Introduction
Named Entity Recognition (NER) is a fundamental task in the area of Information Extraction (IE), in which mentions of Named Entities (NE) are extracted and classified in naturally-occurring texts.This task is most commonly formulated as a sequence labeling task, where extraction takes the form of assigning each input token with a label that marks the boundaries of the NE (e.g., B,I,O), and classification takes the form of assigning labels to indicate entity type (PER, ORG, LOC, etc.).Despite a common initial impression from latest NER performance, brought about by neural models on the main English NER benchmarks -CoNLL 2003 (Tjong Kim Sang, 2003) and OntoNotes (Weischedel et al., 2013) -the NER task in real-world settings is far from solved.Specifically, NER performance is shown to greatly diminish when moving to other domains (Luan et al., 2018;Song et al., 2018), when addressing the long tail of rare, unseen, and new usergenerated entities (Derczynski et al., 2017), and when handling languages with fundamentally different structure than English.In particular, there is no readily available and empirically verified neural modeling strategy for Neural NER in those languages with complex word-internal structure, also known as morphologically-rich languages.
Morphologically-rich languages (MRL) (Tsarfaty et al., 2010;Seddah et al., 2013;Tsarfaty et al., 2020) are languages in which substantial information concerning the arrangement of words into phrases and the relations between them is expressed at word level, rather than in a fixed word-order or a rigid structure.The extended amount of information expressed at word-level and the morpho-phonological processes creating these words result in high token-internal complexity, which poses serious challenges to the basic formulation of NER as classification of raw, space-delimited, tokens.Specifically, while NER in English is formulated as the sequence labeling of space-delimited tokens, in MRLs a single token may include multiple meaning-bearing units, henceforth morphemes, only some of which are relevant for the entity mention at hand.
In this paper we formulate two questions concerning neural modelling strategies for NER in MRLs, namely: (i) what should be the granularity of the units to be labeled?Space-delimited tokens or finer-grain morphological segments?and, (ii) how can we effectively encode, and accurately de-arXiv:2007.15620v2[cs.CL] 10 May 2021 tect, the morphological segments that are relevant to NER, and specifically in realistic settings, when gold morphological boundaries are not available?
To empirically investigate these questions we develop a novel parallel benchmark, containing parallel token-level and morpheme-level NER annotations for texts in Modern Hebrew -a morphologically rich and morphologically ambiguous language, which is known to be notoriously hard to parse (More et al., 2019;Tsarfaty et al., 2019).
Our results show that morpheme-based NER is superior to token-based NER, which encourages a segmentation-first pipeline.At the same time, we demonstrate that token-based NER improves morphological segmentation in realistic scenarios, encouraging a NER-first pipeline.While these two findings may appear contradictory, we aim here to offer a climax; a hybrid architecture where the token-based NER predictions precede and prune the space of morphological decomposition options, while the actual morpheme-based NER takes place only after the morphological decomposition.We empirically show that the hybrid architecture we propose outperforms all token-based and morpheme-based model variants of Hebrew NER on our benchmark, and it further outperforms all previously reported results on Hebrew NER and morphological decomposition.Our error analysis further demonstrates that morphemebased models better generalize, that is, they contribute to recognizing the long tail of entities unseen during training (out-of-vocabulary, OOV), in particular those unseen entities that turn out to be composed of previously seen morphemes.
The contribution of this paper is thus manifold.First, we define key architectural questions for Neural NER modeling in MRLs and chart the space of modeling options.Second, we deliver a new and novel parallel benchmark that allows one to empirically compare and contrast the morpheme vs. token modeling strategies.Third, we show consistent advantages for morpheme-based NER, demonstrating the importance of morphologically-aware modeling.Next we present a novel hybrid architecture which demonstrates an even further improved performance on both NER and morphological decomposition tasks.Our results for Hebrew present a new bar on these tasks, outperforming the reported state-of-the-art results on various benchmarks.The segmentation of tokens and the identification of adequate NE boundaries is however far from trivial, due to complex morpho-phonological and orthographic processes in some MRLs (Vania et al., 2018;Klein and Tsarfaty, 2020).This means that the morphemes that compose NEs are not necessarily transparent in the character sequence of the raw tokens.Consider for example phrase (2): (2) ‫הלב‬ ‫לבית‬ ‫המרו‬ hamerotz labayit halavan the-race to-house.DEF the-white 'the race to the White House' Here, the full form of the NE ‫הלב‬ ‫הבית‬ / habayit halavan (the White House), is not present in the utterances, only the sub-string ‫הלב‬ ‫בית‬ / bayit halavan ((the) White House) is present in (2) -due to phonetic and orthographic processes suppressing the definite article ‫/ה‬ha in certain environments.In this and many other cases, it is not only that NE boundaries do not coincide with token boundaries, they do not coincide with characters or substrings of the token either.This calls for accessing the more basic meaning-bearing units of the token, that is, to decompose the tokens into morphemes.
Unfortunately though, the morphological decomposition of surface tokens may be very challenging due to extreme morphological ambiguity.The sequence of morphemes composing a token is not always directly recoverable from its character sequence, and is not known in advance. 4his means that for every raw space-delimited token, there are many conceivable readings which impose different segmentations, yielding different sets of potential NE boundaries.Consider for example the token ‫לבני‬ (lbny) in different contexts: (3) (a) ‫לבני‬ ‫השרה‬ hasara livni In (3a) the token ‫לבני‬ is completely consumed as a labeled NE.In (3b) ‫לבני‬ is only partly consumed by an NE, and in (3c) and (3d) the token is entirely out of an NE context.In (3c) the token is composed of several morphemes, and in (3d) it consists of a single morpheme.These are only some of the possible decompositions of this surface token, other alternatives may still be available.As shown by Goldberg and Tsarfaty (2008); Green and Manning (2010); Seeker and Çetinoglu (2015); Habash and Rambow (2005); More et al. (2019), and others, the correct morphological decomposition becomes apparent only in the larger (syntactic or semantic) context.The challenge, in a nutshell, is as follows: in order to detect accurately NE boundaries, we need to segment the raw token first, however, in order to segment tokens correctly, we need to know the greater semantic content, including, e.g., the participating entities.How can we break out of this apparent loop?Finally, MRLs are often characterized by an extremely sparse lexicon, consisting of a long-tail of out-of-vocabulary (OOV) entities unseen during training (Czarnowska et al., 2019).Even in cases where all morphemes are present in the training data, morphological compositions of seen morphemes may yield tokens and entities which were unseen during training.Take for example the utterance in (4), which the reader may inspect as familiar: (4) ‫לתאילנד‬ ‫מסי‬ ‫טסנו‬ tasnu misin lethailand flew.1PLfrom-China to-Thailand 'we flew from China to Thailand' Example (4) is in fact example (1) with a switched flight direction.This subtle change creates two new surface tokens ‫,מסי‬ ‫לתאילנד‬ which might not have been seen during training, even if example (1) had been observed.Morphological compositions of an entity with prepositions, conjunctions, definite markers, possessive clitics and more, cause mentions of seen entities to have unfamiliar surface forms, which often fail to be accurately detected and analyzed.Given the aforementioned complexities, in order to solve NER for MRLs we ought to answer the following fundamental modeling questions: Q1.Units: What are the discrete units upon which we need to set NE boundaries in MRLs?Are they tokens?characters?morphemes?a representation containing multiple levels of granularity?Q2.Architecture: When employing morphemes in NER, the classical approach is "segmentationfirst".However, segmentation errors are detrimental and downstream NER cannot recover from them.How is it best to set up the pipeline so that segmentation and NER could interact?Q3.Generalization: How do the different modeling choices affect NER generalization in MRLs?How can we address the long tail of OOV NEs in MRLs?Which modeling strategy best handles pseudo-OOV entities that result from a previously unseen composition of already seen morphemes?
To answer the aforementioned questions, we chart and formalize the space of modeling options for neural NER in MRLs.We cast NER as a Sequence Labelling task and formalize it as f : X → Y, where x ∈ X is a sequence x 1 , ..., x n of n discrete strings from some vocabulary x i ∈ Σ, and y ∈ Y is a sequence y 1 , .., y n of the same length, where y i ∈ Labels, and Labels is a finite set of labels composed of the BIOSE tags (a.k.a., BIOLU as described in Ratinov and Roth (2009)).Every non-O label is also enriched with an entity type label.Our list of types is presented in Table 2.

Token-Based or Morpheme-Based?
Our first modeling question concerns the discrete units upon which to set the NE boundaries.That is, what is the formal definition of the input vocabulary Σ for the sequence labeling task?
The simplest scenario, adopted in most NER studies, assumes token-based input, where each token admits a single label -hence token-single: NER token-single : W → L Here, W = {w * |w ∈ Σ} is the set of all possible token sequences in the language and L = {l * |l ∈ Labels} is the set of all possible label sequences over the label set defined above.Each token gets assigned a single label, so the input and output sequences are of the same length.The drawback of this scenario is that since the input for token-single incorporates no morphological boundaries, the exact boundaries of the NEs remain underspecified.This case is exemplified at the top row of Table 1.
There is another conceivable scenario, where the input is again the sequence of space-delimited tokens, and the output consists of complex labels (henceforth multi-labels) reflecting, for each token, the labels of its constituent morphemes; henceforth, a token-multi scenario: Here, W = {w * |w ∈ Σ} is the set of sequences of tokens as in token-single.Each token is assigned a multi-label, i.e., a sequence (l * ∈ L) which indicates the labels of the token's morphemes in order.The output is a sequence of such multi-labels, one multi-label per token.This variant incorporates morphological information concerning the number and order of labeled morphemes, but lacks the precise morphological boundaries.This is illustrated at the middle of Table 1.A downstream application may require (possibly noisy) heuristics to determine the precise NE boundaries of each individual label in the multi-label for an input token.
Another possible scenario is a morpheme-based scenario, assigning a label l ∈ L for each segment: Here, M = {m * |m ∈ Morphemes} is the set of sequences of morphological segments in the language, and L = {l * |l ∈ Labels} is the set of label sequences as defined above.The upshot of this scenario is that NE boundaries are precise.An example is given in the bottom row of Table 1.But, since each token may contain many meaningful morphological segments, the length of the token sequence is not the same as the length of morphological segments to be labeled, and the model assumes prior morphological segmentation -which in realistic scenarios is not necessarily available.

Realistic Morphological Decomposition
A major caveat with morpheme-based modeling strategies is that they often assume an ideal scenario of gold morphological decomposition of the space-delimited tokens into morphological segments (cf.Nivre et al. (2007);Pradhan et al. (2012)).But in reality, gold morphological decomposition is not known in advance, it has to be predicted automatically, and prediction errors may propagate to contaminate the downstream task.
Our second modeling question therefore concerns the interaction between the morphological decomposition and the NER tasks: how would it be best to set up the pipeline so that the prediction of the two tasks can interact?To answer this, we define morphological decomposition as consisting of two subtasks: morphological analysis (MA) and morphological disambiguation (MD).We view sentence-based MA as: Here W = {w * |w ∈ Σ} is the set of possible token sequences as before, M = {m * |m ∈ M orphemes} is the set of possible morpheme sequences, and P(M) is the set of subsets of M.
The role of M A is then to assign a token sequence w ∈ W with all of its possible morphological decomposition options.We represent this set of alternatives in a dense structure that we call a lattice (exemplified in Figure 1).MD is the task of picking the single correct morphological path M ∈ M through the MA lattice of a given sentence: Now, assume x ∈ W is a surface sentence in the language, with its morphological decomposition initially unknown and underspecified.In a Standard pipeline, MA strictly precedes MD: The main problem here is that MD errors may propagate to contaminate the NER output.
We propose a novel Hybrid alternative, in which we inject a task-specific signal, in this case NER,5 to constrain the search for M through the lattice: Here, the restriction M A(x) N ER(x) indicates pruning the lattice structure M A(x) to contain only MD options that are compatible with the token-based NER predictions, and only then apply M D to the pruned lattice.
Both M D Standard and M D Hybrid are disambiguation architectures that result in a morpheme sequence M ∈ M. The latter benefits from the NER signal, while the former doesn't.The sequence M ∈ M can be used in one of two ways.We can use M as input to a morpheme model to output morpheme labels.Or, we can rely on the output of the token-multi model and align the token's multi-label with the segments in M .
In what follows, we want to empirically assess the effect of different modeling choices (tokensingle, token-multi, morpheme) and disambiguation architectures (Standard, Hybrid) on the performance of NER in MRLs.To this end, we need a corpus that allows training and evaluating NER at both token and morpheme-level granularity.

The Data: A Novel NER Corpus
This work empirically investigates NER modeling strategies in Hebrew, a Semitic language known for its complex and highly ambiguous morphology.Ben-Mordecai (2005), the only previous work on Hebrew NER to date, annotated spacedelimited tokens, basing their guidelines on the CoNLL 2003 shared task (Chinchor et al., 1999).
In agglutinative languages as Turkish, token segmentation is always performed before NER (Tür et al. (2003); Küçük and Can (2019), reenforcing the need to contrast the token-based scenario, widely adopted for Semitic languages, with the morpheme-based scenarios in other MRLs.
Our first contribution is thus a parallel corpus for Hebrew NER, one version consists of goldlabeled tokens and the other consists of goldlabeled morphemes, for the same text.For this, we performed gold NE annotation of the Hebrew Treebank (Sima'an et al., 2001), based on the 6,143 morpho-syntactically analyzed sentences of the HAARETZ corpus, to create both token-level and morpheme-level variants, as illustrated at the topmost and lowest rows of Table 1, respectively.
Annotation Scheme We started off with the guidelines of Ben-Mordecai (2005), from which we deviate in three main ways.First, we label NE boundaries and their types on sequences of morphemes, in addition to the space-delimited token annotations. 6Secondly, we use the finer-grained entity categories list of ACE (LDC, 2008). 7Finally, we allow nested entity mentions, as in Finkel and Manning (2009); Benikova et al. (2014). 8  Annotation Cycle As Fort et al. (2009) put it, examples and rules would never cover all possible cases because of the specificity of natural language and the ambiguity of formulation.To address this we employed the cyclic approach of agile annotation as offered by Alex et al. (2010).Every cycle consisted of: annotation, evaluation and curation, clarification and refinements.We used WebAnno (Yimam et al., 2013) as our annotation interface.
The Initial Annotation Cycle was a two-stage pilot with 12 participants, divided into 2 teams of 6.The teams received the same guidelines, with the exception of the specifications of entity boundaries.One team was guided to annotate the minimal string that designates the entity.The other was guided to tag the maximal string which can still be considered as the entity.Our agreement analysis showed that the minimal guideline generally led to more consistent annotations.Based on this result (as well as low-level refinements) from the pilot, we devised the full version of the guidelines. 9 Annotation, Evaluation and Curation: Every annotation cycle was performed by two annotators (A, B) and an annotation manager/curator (C).We annotated the full corpus in 7 cycles.We evaluated the annotation in two ways, manual curation and automatic evaluation.After each annotation step, the curator manually reviewed every sentence in which disagreements arose, as well as specific points of difficulty pointed out by the annotators.The inter-annotator agreement metric described below was also used to quantitatively gauge the progress and quality of the annotation. 6A single NE is always continuous.Token-morpheme discrepancies do not lead to discontinuous NEs.
7 Entity categories are listed in Table 2.We dropped the NORP category, since it introduced complexity concerning the distinction between adjectives and group names.LAW did not appear in our corpus.
8 Nested labels are are not modeled in this paper, but they are published with the corpus, to allow for further research. 9The complete annotation guide is publicly available at Clarifications and Refinements: In the end of each cycle we held a clarification talk between A, B and C, in which issues that came up during the cycle were discussed.Following that talk we refined the guidelines and updated the annotators, which went on to the next cycle.In the end we performed a final curation run to make sentences from earlier cycles comply with later refinements. 10nter-Annotator Agreement (IAA) IAA is commonly measured using κ-statistic.However, Pyysalo et al. (2007) show that it is not suitable for evaluating inter-annotator agreement in NER.Instead, an F 1 metric on entity mentions has in recent years been adopted for this purpose (Zhang, 2013).This metric allows for computing pair-wise IAA using standard F 1 score by treating one annotator as gold and the other as the prediction.
Annotation Costs The annotation took on average about 35 seconds per sentence, and thus a total of 60 hours for all sentences in the corpus for each annotator.Six clarification talks were held between the cycles, which lasted from thirty minutes to an hour.Giving a total of about 130 work hours of expert annotators.11

Experimental Settings
Goal We set out to empirically evaluate the representation alternatives for the input/output sequences (token-single, token-multi, morpheme) and the effect of different architectures (Standard, Hybrid) on the performance of NER for Hebrew.Modeling Variants All experiments use the corpus we just described and employ a standard Bi-LSTM-CRF architecture for implementing the neural sequence labeling task (Huang et al., 2015).Our basic architecture12 is composed of an embedding layer for the input and a 2-layer Bi-LSTM followed by a CRF inference layer -for which we test three modeling variants.
Figures 2-3 present the variants we employ.Figure 2 shows the token-based variants, tokensingle and token-multi.The former outputs a single BIOSE label per token, and the latter outputs a multi-label per token -a concatenation of BIOSE labels of the morphemes composing the token.Figure 3   we experiment with CharLSTM, CharCNN or NoChar, that is, no character embedding at all.
We pre-trained all token-based or morphemebased embeddings on the Hebrew Wikipedia dump of Goldberg (2014).For morpheme-based embeddings, we decompose the input using More et al. (2019), and use the morphological segments as the embedding units. 13We compare GloVe (Pennington et al., 2014) and fastText (Bojanowski et al., 2017).We hypothesize that since FastText uses sub-string information, it will be more useful for analyzing OOVs.
Hyper parameters Following Reimers and Gurevych (2017); Yang et al. (2018), we performed hyper-parameter tuning for each of our model variants.We performed hyper-parameter tuning on the dev set in a number of rounds of random search, independently on every input/output and char-embedding architecture.Table 3 shows our selected hyper-parameters. 14 (Yang and Zhang, 2018).
not treated as a hyper-parameter in Reimers and Gurevych (2017), Yang et al. (2018).However, given the token-internal complexity in MRLs we conjecture that the window size over characters might make a crucial effect.In our experiments we found that a larger window (7) increased the performance.For MRLs, further research into this hyper-parameter might be of interest.
Evaluation Standard NER studies typically invoke the CoNLL evaluation script that anchors NEs in token positions (Tjong Kim Sang, 2003).However, it is inadequate for our purposes because we want to compare entities across token-based vs. morpheme-based settings.To this end, we use a revised evaluation procedure, which anchors the entity in its form rather than its index.Specifically, we report F 1 scores on strict, exact-match of the surface forms of the entity mentions.I.e., the gold and predicted NE spans must exactly match in their form, boundaries, and entity type.In all experiments, we report both token-level F-scores and morpheme-level F-scores, for all models.
• Token-Level evaluation.For the sake of backwards compatibility with previous work on Hebrew NER, we first define token-level evaluation.For token-single this is a straightforward calculation of F 1 against gold spans.For token-multi and morpheme, we need to map the predicted label sequence of that token to a single label, and we do so using linguistically-informed rules we devise (as elaborated in Appendix A). 15 • Morpheme-Level evaluation.Our ultimate goal is to obtain precise boundaries of the NEs.Thus, our main metric evaluates NEs 15 In the morpheme case we might encounter "illegal" label sequences in case of a prediction error.We employ similar linguistically-informed heuristics to recover from that (See Appendix A).
against the gold morphological boundaries.For morpheme and token-single models, this is a straightforward F 1 calculation against gold spans.Note for token-single we are expected to pay a price for boundary mismatch.For token-multi, we know the number and order of labels, so we align the labels in the multi-label of the token with the morphemes in its morphological decomposition. 16or all experiments and metrics, we report mean and confidence interval (0.95) over ten runs.

Input-Output Scenarios
We experiment with two kinds of input settings: token-based, where the input consists of the sequence of space-delimited tokens, and morpheme-based, where the input consists of morphological segments.For the morpheme input, there are three input variants: (i) Morph-gold: where the morphological sequence is produced by an expert (idealistic).
(ii) Morph-standard: where the morphological sequence is produced by a standard segmentation-first pipeline (realistic).(iii) Morph-hybrid: where the morphological sequence is produced by the hybrid architecture we propose (realistic).
In the token-multi case we can perform morpheme-based evaluation by aligning individual labels in the multi-label with the morpheme sequence of the respective token.Again we have three options as to which morphemes to use: (i) Tok-multi-gold: The multi-label is aligned with morphemes produced by an expert (idealistic).
(ii) Tok-multi-standard: The multi-label is aligned with morphemes produced by a standard pipeline (realistic).(iii) Tok-multi-hybrid: The multi-label is aligned with morphemes produced by the hybrid architecture we propose (realistic).
Pipeline Scenarios Assume an input sentence x.
In the Standard pipeline we use YAP,17 the current state-of-the-art morpho-syntactic parser for Hebrew (More et al., 2019), for the predicted segmentation M = M D(M A(x)).In the Hybrid  pipeline, we use YAP to first generate complete morphological lattices M A(x).Then, to obtain M A(x) N ER(x) we omit lattice paths where the number of morphemes in the token decomposition does not conform with the number of labels in the multi-label of NER token-multi (x).Then, we apply YAP to obtain M D(M A(x) N ER(x)) on the constrained lattice.In predicted morphology scenarios (either Standard or Hybrid), we use the same model weights as trained on the gold segments, but feed predicted morphemes as input. 18 Results

The Units: Tokens vs. Morphemes
Figure 4 shows the token-level evaluation for the different model variants we defined.We see that morpheme models perform significantly better than the token-single and token-multi variants.In-  terestingly, explicit modeling of morphemes leads to better NER performance even when evaluated against token-level boundaries.As expected, the performance gaps between variants are smaller with fastText than they are with embeddings that are unaware of characters (GloVe) or with no pretraining at all.We further pursue this in Sec.6.3.
Figure 5 shows the morpheme-level evaluation for the same model variants as in Table 4.The most obvious trend here is the drop in the performance of the token-single model.This is expected, reflecting the inadequacy of token boundaries for identifying accurate boundaries for NER.Interestingly, morpheme and token-multi models keep a similar level of performance as in tokenlevel evaluation, only slightly lower.Their per- formance gap is also maintained, with morpheme performing better than token-multi.An obvious caveat is that these results are obtained with gold morphology.What happens in realistic scenarios?
6.2 The Architecture: Pipeline vs. Hybrid Figure 6 shows the token-level evaluation results in realistic scenarios.We first observe a significant drop for morpheme models when Standard predicted segmentation is introduced instead of gold.This means that MD errors are indeed detrimental for the downstream task, in a non-negligible rate.Second, we observe that much of this performance gap is recovered with the Hybrid pipeline.It is noteworthy that while morph hybrid lags behind morph gold, it is still consistently better than token-based models, token-single and token-multi.
Figure 7 shows morpheme-level evaluation results for the same scenarios as in Table 6.All trends from the token-level evaluation persist, including a drop for all models with predicted segmentation relative to gold, with the hybrid variant recovering much of the gap.Again morph gold outperforms token-multi, but morph hybrid shows great advantages over all tok-multi variants.This performance gap between morph (gold or hybrid) and tok-multi indicates that explicit morphological modeling is indeed crucial for accurate NER.

Morphologically-Aware OOV Evaluation
As discussed in Section 2, morphological composition introduces an extremely sparse word-level "long-tail" in MRLs.In order to gauge this phenomenon and its effects on NER performance, we categorize unseen, out-of-training-vocabulary (OOTV) mentions into 3 categories: • Lexical: Unknown mentions caused by an unknown token which consists of a single morpheme.This is a strictly lexical unknown with no morphological composition (most English unknowns are in this category).
• Compositional: Unknown mentions caused by an unknown token which consists of multiple known morphemes.These are unknowns introduced strictly by morphological composition, with no lexical unknowns.
• LexComp: Unknown mentions caused by an unknown token consisting of multiple morphemes, of which (at least) one morpheme was not seen during training.In such cases, both unknown morphological composition and lexical unknowns are involved.
We group NEs based on these categories, and evaluate each group separately.We consider mentions that do not fall into any category as Known.
Figure 8 shows the distributions of entity mentions in the dev set by entity type and OOTV category.OOTV categories that involve composition (Comp and LexComp) are spread across all categories but one, and in some they even make up more than half of all mentions.
Figure 9 shows token-level evaluation 19 with fastText embeddings, grouped by OOTV type.We first observe that indeed unknown NEs that are due to morphological composition (Comp and Lex-Comp) proved the most challenging for all models.We also find that in strictly Compositional OOTV mentions, morpheme-based models exhibit their most significant performance advantage, supporting the hypothesis that explicit morphology helps to generalize.We finally observe that token-multi models perform better than token-single models for these NEs (in contrast with the trend for noncompositional NEs).This corroborates the hypothesis that even partial modeling of morphology (as in token-multi compared to token-single) is better than none, leading to better generalization.

String-level vs. Character-level Embeddings
To further understand the generalization capacity of different modeling alternatives in MRLs, we probe into the interplay of string-based and charbased embeddings in treating OOTV NEs.
Figure 10 presents 12 plots, each of which presents the level of performance (y-axes) for all models (x-axes).Token-based models are on the left of each x-axes, morpheme-based are on the right.We plot results with and without character embeddings,20 in orange and blue respectively.The plots are organized in a large grid, with the type of NE on the y-axes (Known, Lex, Comp, Lex-Comp), and the type of pre-training on the x-axes (No pre-training, GloVe, fastText) .
At the top-most row, plotting the accuracy for Known NEs, we see a high level of performance for all pre-training methods, with not much differences between the type of pre-training, with or without the character embeddings.Moving further down to the row of Lexical unseen NEs, char-based representations lead to significant advantages when we assume no pre-training, but with GloVe pre-training the performance substantially increases, and with fastText the differences in performance with/without char-embeddings almost entirely diminish, indicating the char-based embeddings are somewhat redundant in this case.
The two lower rows in the large grid show the performance for Comp and LexComp unseen NEs, which are ubiquitous in MRLs.For Compositional NEs, pre-training closes only part of the gap between token-based and morpheme-based models.Adding char-based representations indeed helps the token-based models, but crucially does not close the gap with the morpheme-based variants.
Finally All in all, the biggest advantage of morphemebased models over token-based models is their ability to generalize from observed tokens to composition-related OOTV (Comp/LexComp).While character-based embeddings do help tokenbased models generalize, the contribution of modeling morphology is indispensable, above and beyond the contribution of char-based embeddings.

Setting in the Greater Context
Test Set Results Table 4 confirms our best results on the Test set.The trends are kept, though results on Test are lower than on Dev.The morph gold scenario still provides an upperbound of the performance, but it is not realistic.For the realistic scenarios, morph hybrid generally outperforms all other alternatives.The only divergence is that in token-level evaluation, token-multi performs on a par with morph hybrid on the Test set.
Results on MD Tasks.While the Hybrid pipeline achieves superior performance on NER, it also improves the state-of-the-art on other tasks in the pipeline.Table 5 shows the Seg+POS results of our Hybrid pipeline scenario, compared with the Standard pipeline which replicates the pipeline of More et al. (2019).We use the metrics defined by More et al. (2019).We show substantial improvements for the Hybrid pipeline over the results of More et al. (2019), and also outperforming the Test results of Seker and Tsarfaty (2020).Comparison with Prior Art.we performed three 75%-25% random train/test splits, and used the same seven NE categories (PER,LOC,ORG,TIME,DATE,PERCENT,MONEY).We trained a token-single model on the original space-delimited tokens and a morpheme model on automatically segmented morphemes we obtained using our best segmentation model (Hybrid MD on our trained token-multi model, as in Table 5).
Since their annotation includes only token-level boundaries, all of the results we report conform with token-level evaluation.Table 6 presents the results of these experiments.Both models significantly outperform the previous state-of-the-art by Ben-Mordecai (2005), setting a new performance bar on this earlier benchmark.Moreover, we again observe an empirical advantage when explicitly modeling morphemes, even with the automatic noisy segmentation that is used for the morpheme-based training.
7 Discussion: Joint Modeling Alternatives and Future Work The present study provides the motivation and the necessary foundations for comparing morphemebased and token-based modeling for NER.While our findings clearly demonstrate the advantages of morpheme-based modeling for NER in a morphologically rich language, it is clear that our proposed Hybrid architecture is not the only modeling alternative for linking NER and morphology.
For example, a previous study by Güngör et al. (2018) addresses joint neural modeling of morphological segmentation and NER labeling, proposing a multi-task learning (MTL) approach for joint MD and NER in Turkish.They employ separate Bi-LSTM networks for the MD and NER tasks, with a shared loss to allow for joint learning.Their results indicate improved NER performance, with no improvement in the MD results.Contrary to our proposal, they view MD and NER as distinct tasks, assuming a single NER label per token, and not providing disambiguated morpheme-level boundaries for the NER task.More generally, they test only token-based NER labeling and do not attend to the question of input/output granularity in their models.
A different approach for joint NER and morphology is jointly predicting the segmentation and labels for each token in the input stream.This is the approach taken, for instance, by the lattice-based Pointer-Network of Seker and Tsarfaty (2020).As shown in Table 5, their results for morphological segmentation and POS tagging are on a par with our reported results and, at least in principle, it should be possible to extend the Seker and Tsarfaty (2020) approach to yield also NER predictions.
However, our preliminary experiments with a lattice-based Pointer-network for token segmentation and NER labeling shows that this is not a straightforward task.Contrary to POS tags, which are constrained by the MA, every NER label can potentially go with any segment, and this leads to a combinatorial explosion of the search space represented by the lattice.As a result, the NER predictions are brittle to learn, and the complexity of the resulting model is computationally prohibitive.
A different approach to joint sequence segmentation and labeling can be applying the neural model directly on the character-sequence of the input stream.Such an approach is for instance the char-based labeling as segmentation setup proposed by Shao et al. (2017).Shao et al. use a character-based Bi-RNN-CRF to output a single label-per-char which indicates both word boundary (using BIES sequence labels) and the POS tags.This method is also used in their universal segmentation paper, (Shao et al., 2018).However, as seen in the results of Shao et al. (2018), charbased labeling for segmenting Semitic languages lags far behind all other languages, precisely because morphological boundaries are not explicit in the character sequences.
Additional proposals are those of Kong et al. (2015); Kemos et al. (2019).First, Kong et al. (2015) proposed to solve e.g.Chinese segmentation and POS tagging using dynamic programming with neural encoding, by using a Bi-LSTM to encode the character input, and then feed it to a semi-markov CRF to obtain probabilities for the different segmentation options.Kemos et al. (2019) propose an approach similar to Kong et al. (2015) for joint segmentation and tagging but add convolution layers on top of the Bi-LSTM encodings to obtain segment features hierarchically and then feed them to the semi-markov CRF.
Preliminary experiments we conducted confirm that char-based joint segmentation and NER labeling for Hebrew, either using char-based labeling or a seq2seq architecture, still lags behind our reported results.We conjecture that this is due to the complex morpho-phonological and orthographic processed in Semitic languages.Going into charbased modeling nuances and offering a sound joint solution for a language like Hebrew is an important matter that merits its own investigation.Such work is feasible now given the new corpus, however, it is out of the scope of the current study.
All in all, the design of sophisticated joint modeling strategies for morpheme-based NER poses fascinating questions -for which our work provides a solid foundation (data, protocols, metrics, strong baselines).More work is needed for investigating joint modeling of NER and morphology, in the directions portrayed in this Section, yet it is beyond the scope of this paper, and we leave this investigation for future work.
Finally, while the joint approach is appealing, we argue that the elegance of our Hybrid solution is precisely in providing a clear and well-defined interface between MD and NER through which the two tasks can interact, while still keeping the distinct models simple, robust, and efficiently trainable.It also has the advantage of allowing us to seamlessly integrate sequence labelling with any lattice-based MA, in a plug-and-play languageagnostic fashion, towards obtaining further advantages on both of these tasks.

Conclusion
This work addresses the modeling challenges of Neural NER in MRLs.We deliver a parallel tokenvs-morpheme NER corpus for Modern Hebrew, that allows one to assess NER modeling strategies in morphologically rich-and-ambiguous environments.Our experiments show that while NER benefits from morphological decomposition, downstream results are sensitive to segmentation errors.We thus propose a Hybrid architecture in which NER precedes and prunes the morphological decomposition.This approach greatly outperforms a Standard pipeline in realistic (nongold) scenarios.Our analysis further shows that morpheme-based models better recognize OOVs that result from morphological composition.All in all we deliver new state-of-the-art results for Hebrew NER and MD, along with a novel benchmark, to encourage further investigation into the interaction between NER and morphology.

Figure 1 :
Figure1: Lattice for a partial list of analyses of the Hebrew tokens ‫הלב‬ ‫לבית‬ corresponding to Table1.Bold nodes are token boundaries.Light nodes are segment boundaries.Every path through the lattice is a single morphological analysis.The bold path is a single NE.

Figure 2 :
Figure 2: The token-single and token-multi Models.The input and output correspond to rows 1,2 in Tab. 1. Triangles indicate string embeddings.Circles indicate char-based encoding.
shows the morpheme-based variant for the same input phrase.It has the same basic architecture, but now the input consists of morphological segments instead of tokens.The model outputs a single BIOSE label for each morphological segment in the input.In all modeling variants, the input may be encoded in two ways: (a) String-level embeddings (token-based or morpheme-based) optionally initialized with pre-trained embeddings.(b) Charlevel embeddings, trained simultaneously with the main task (cf.Ma and Hovy (2016); Chiu and Nichols (2015); Lample et al. (2016)).For charbased encoding (of either tokens or morphemes)

Figure 3 :
Figure 3: The morpheme Model.The input and output correspond to row 3 in Tab. 1. Triangles indicate string embeddings.Circles indicate char-based encoding.

Figure 6 :
Figure 6: Token-Level Evaluation in Realistic Scenarios on Dev, comparing Gold, Standard and Hybrid Morphology.CharCNN for morph, CharLSTM for tok.Results for Gold, token-single and token-multi are taken from Fig 4.

Figure 7 :
Figure 7: Morph-Level Evaluation in Realistic Scenarios on Dev, comparing Gold, Standard and Hybrid Morphology.CharCNN for morph, CharLSTM for tok.Results for Gold, token-single and token-multi are taken from Fig 5.

Figure 8 :
Figure 8: Entity Mention Counts and Ratio by Category and OOTV Category, for Dev Set.
1 1 Data & code: https://github.com/OnlpLab/NEMO 2 Research Questions: NER for MRLsIn MRLs, words are internally complex, and word boundaries do not generally coincide with the boundaries of more basic meaning-bearing units.
is its own token.In the Hebrew phrase though, neither NE constitutes a single token.In either case, the NE occupies only one of two morphemes in the token, the other being a case-assigning preposition.This simple example demonstrates an extremely frequent phenomenon in MRLs such as Hebrew, Arabic or Turkish, that the adequate boundaries for NEs do not coincide with token boundaries, and tokens must be segmented in order to obtain accurate NE boundaries.3
The Char CNN window size is particularly interesting as it was

Table 3 :
Summary of Hyper-Parameter Tuning.The * indicates divergence from the NCRF++ proposed setup and empirical findings

Table 4 :
, for LexComp NEs at the lowest row, we again see that adding GloVe pre-training and charbased embeddings does not close the gap with Test vs. Dev: Results with fastText for all Models.morph-gold presents an ideal upper-bound.

Table 6
Ben-Mordecai (2005)s on the Hebrew NER corpus of Ben-Mordecai (2005) compared to their model, which uses a hand-crafted feature-engineered MEMM with regular-expression rule-based enhancements and an entity lexicon.LikeBen-Mordecai (2005)