Context-aware Adversarial Training for Name Regularity Bias in Named Entity Recognition

In this work, we examine the ability of NER models to use contextual information when predicting the type of an ambiguous entity. We introduce NRB, a new testbed carefully designed to diagnose Name Regularity Bias of NER models. Our results indicate that all state-of-the-art models we tested show such a bias; BERT fine-tuned models significantly outperforming feature-based (LSTM-CRF) ones on NRB, despite having comparable (sometimes lower) performance on standard benchmarks. To mitigate this bias, we propose a novel model-agnostic training method that adds learnable adversarial noise to some entity mentions, thus enforcing models to focus more strongly on the contextual signal, leading to significant gains on NRB. Combining it with two other training strategies, data augmentation and parameter freezing, leads to further gains.


Introduction
Recent advances in language model pretraining (Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019) have greatly improved the performance of many Natural Language Understanding (NLU) tasks. Yet, several studies (McCoy et al., 2019;Clark et al., 2019;Utama et al., 2020b) revealed that state-of-the-art NLU models often make use of surface patterns in the data that do not generalize well. Named-Entity Recognition (NER), a downstream task that consists in identifying textual mentions and classifying them into a predefined set of types, is no exception.

Obama, Fukui
Obama LOC P ER is located in far southwestern Fukui Prefecture.

Patricia A. Madrid
Madrid P ER LOC won her first campaign in 1978 ..

Asda Jayanama
Asda P ER ORG joined his brother, Surapong ... Figure 1: Examples extracted from Wikipedia (title in bold) that illustrate name regularity bias in NER. Entities of interest are underlined, gold types are in blue superscript, model predictions are in red subscript, and context information is highlighted in purple. Models employed in this study disregard contextual information and rely instead on some signal from the named-entity itself. Agarwal et al., 2020b;Zeng et al., 2020) in NER occurs when a model relies on a signal coming from the entity name, and disregards evidences within the local context. Figure 1 shows examples where state-of-the-art models (Peters et al., 2018;Akbik et al., 2018;Devlin et al., 2019) fail to exploit contextual information. For instance, the entity Gonzales in the first sentence of the figure is wrongly recognized as a person, while the context clearly signals that it is a location (city).
To better highlight this issue, we propose NRB, a testbed designed to accurately diagnose name regularity bias of NER models by harvesting natural sentences from Wikipedia that contain challenging entities, such as those in Figure 1. This is different from previous works that evaluate models on artificial data obtained by either randomizing (Lin et al., 2020) or substituting entities by ones from a pre-defined list (Agarwal et al., 2020a). NRB is compatible with any annotation scheme, and is intended to be used as an auxiliary validation set.
We conduct experiments with the feature-based LSTM-CRF architecture (Peters et al., 2018;Akbik et al., 2018) and the BERT (Devlin et al., 2019) fine-tuning approach trained on standard benchmarks. The best LSTM-based model we tested is able to correctly predict 38% of the entities in NRB. BERT-based models are performing much better (+37%), even if they (slightly) underperform on in-domain development and test sets. This mismatch in performance between NRB and standard benchmarks indicates that context awareness of models is not rewarded by existing benchmarks, thus justifying NRB as an additional validation set.
We further propose a novel architectureagnostic adversarial training procedure (Miyato et al., 2016) in which learnable noise vectors are added to named-entity words, weakening their signal, thus encouraging the model to pay more attention to contextual information. Applying it to both feature-based LSTM-CRF and fine-tuned BERT models leads to consistent gains on NRB (+13 points) while maintaining the same level of performance on standard benchmarks.
The remainder of the paper is organized as follows. We discuss related works in Section 2. We describe how we built NRB in Section 3, and its use in diagnosing named-entity bias of state-ofthe-art models in Section 4. In Section 5, we present a novel adversarial training method that we compare and combine with two simpler ones. We further analyze these training methods in Section 6, and conclude in Section 7.

Related Work
Robustness and out-of-distribution generalization has always been a persistent concern in deep learning applications such as computer vision (Szegedy et al., 2013;Recht et al., 2019), speech processing (Seltzer et al., 2013;Borgholt et al., 2020), and NLU (Søgaard, 2013;Hendrycks and Gimpel, 2017;Ghaddar and Langlais, 2017;Yaghoobzadeh et al., 2019;Hendrycks et al., 2020). One key challenge behind this issue in NLU is the tendency of models to quickly leverage surface form features and annotation artifacts (Gururangan et al., 2018), which is often referred to as dataset biases (Dasgupta et al., 2018;Shah et al., 2020). We discuss related works along two axes: diagnosis and mitigation.

Diagnosing Biais
A growing number of studies (Zellers et al., 2018;Poliak et al., 2018;Geva et al., 2019;Utama et al., 2020b;Sanh et al., 2020) are showing that NLU models rely heavily on spurious correlations between output labels and surface features (e.g. keywords, lexical overlap), impacting their generalization performance. Therefore, considerable attention has been paid to design diagnostic benchmarks where models relying on bias would perform poorly. For instance, HANS (Mc-Coy et al., 2019), FEVER Symmetric (Schuster et al., 2019), and PAWS  are benchmarks that contain counterexamples to well-known biases in the training data of textual entailment (Williams et al., 2017), fact verification (Thorne et al., 2018), and paraphrase identification (Wang et al., 2018) respectively.
Naturally, many entity names have a strong correlation with a single type (e.g. <Gonzales, PER> or <Madrid, LOC>). Recent works have noted that over-relying on entity name information negatively impacts NLU tasks. Balasubramanian et al. (2020) found that substituting named-entities in standard test sets of natural language inference, coreference resolution, and grammar error correction has a negative impact on those tasks. In political claims detection (Padó et al., 2019), Dayanik and Padó (2020) show that claims made by frequently occurring politicians in the training data are better recognized than those made by less frequent ones.
Recently, Zeng et al. (2020) and Agarwal et al. (2020b) conducted two separate analyses on the decision making mechanism of NER models. Both works found that context tokens do contribute to system performance, but that entity names play a major role in driving high performances. Agarwal et al. (2020a) reported a performance drop in NER models when entities in standard test sets are substituted with other ones pulled from pre-defined lists. Concurrently, Lin et al. (2020) conducted an empirical analysis on the robustness of NER models in the open domain scenario. They show that models are biased by strong entity name regularity, and train\test overlap in standard benchmarks. They observe a drop in performance of 34% when entity mentions are randomly replaced by other mentions.
The aforementioned studies certainly demonstrate name regularity bias. Still, in many cases the entity mention is the only key to infer its type, as in "James won the league". Thus, randomly swapping entity names, as proposed by Lin et al. (2020), typically introduces false positive examples, which obscures observations. Furthermore, creating artificial word sequences introduces a mismatch between the pre-training and the fine-tuning phases of large-scale language models. NER is also challenging because of compounding factors such as entity boundary detection (Zheng et al., 2019), rare words and emerging entities (Strauss et al., 2016), documentlevel context (Durrett and Klein, 2014), capitalization mismatch (Mayhew et al., 2019), unbalance datasets (Nguyen et al., 2020, and domain shift (Alvarado et al., 2015;Augenstein et al., 2017). It is unclear to us how randomizing mentions in a corpus, as proposed by Lin et al. (2020), is interfering with these factors.
NRB gathers genuine entities that appear in natural sentences extracted from Wikipedia. Examples are selected so that entity boundaries are easy to identify, and their types can be inferred from the local context, thus avoiding compounding many factors responsible for lack of robustness.

Mitigating Bias
The prevailing approach to address dataset biases consists in adjusting the training loss for biased examples. A number of recent studies (Clark et al., 2019;Belinkov et al., 2019;He et al., 2019;Mahabadi et al., 2020;Utama et al., 2020a) proposed to train a shallow model that exploits manually designed biased features. A main model is then trained in an ensemble with this pre-trained model, in order to discourage the main model from adopting the naive strategy of the shallow one.
Adversarial training (Miyato et al., 2016) is a regularization method which has been shown to improve not only robustness (Ebrahimi et al., 2018;Bekoulis et al., 2018), but also generalization (Cheng et al., 2019; in NLU. It builds on the idea of adding adversarial examples (Goodfellow et al., 2014;Fawzi et al., 2016) to the training set, that is, small perturbations of the data that can change the prediction of a classifier. These perturbations for NLP tasks are done at the token embedding level and are norm bounded. Typically, adversarial training algorithms can be defined as a minmax optimization problem wherein the adversarial examples are generated to maximize the loss, while the model is trained to minimize it. Belinkov et al. (2019) used adversarial training to mitigate the hypothesis-only bias in textual entailment models. Clark et al. (2020) adversarially trained a low and a high capacity model in an ensemble in order to ensure that the latter model is focusing on patterns that should generalize better. Dayanik and Padó (2020) used an extra adversarial loss in order to encourage a political claims detection model to learn more from samples with infrequent politician names. Le Bras et al. (2020) proposed an adversarial technique to filter-out biased examples from training material. Models trained on the filtered datasets show improved outof-distribution performances on various computer vision and NLU tasks.
Data augmentation is another strategy for enhancing robustness. It was successfully used in (Min et al., 2020) and (Moosavi et al., 2020) to improve textual entailment performances on the HANS benchmark. The former approach proposes to append original training sentences with their corresponding predicate-arguments triplets generated by a semantic role labelling tagger; while the latter generates new examples by applying syntactic transformations to the original training instances. Zeng et al. (2020) created new examples by randomly replacing an entity by another one of the same type that occurs in the training data. New examples are considered valid if the type of the replaced entity is correctly predicted by a NER model trained on the original dataset. Similarly, Dai and Adel (2020) explored different entity substitution techniques for data augmentation tailored to NER. Both studies conclude that data augmentation techniques based on entity substitution improves the overall performances on low resource biomedical NER.
Studies discussed above have the potential to mitigate name regularity bias of NER models. Still, we are not aware of any dedicated work that shows it is so. In this work, we propose ways of mitigating name regularity bias for NER, including an elaborate adversarial method that enforces the model to capture more signal from the context. Our methods do not require an extra training stage, or to manually characterize biased features. They are therefore conceptually simpler, and can potentially be combined to any of the discussed techniques. Furthermore, our proposed methods are effective under both low and high resource settings.

The NRB Benchmark
NRB is a diagnosing testbed exclusively dedicated to name regularity bias in NER. To this end, it gathers named-entities that satisfy 4 criteria: 1. Must be real-world entities within natural sentences → We select sentences from Wikipedia articles.
2. Must be compatible with any annotation scheme → We restrict our focus on the 3 most common types found in NER benchmarks: person, location, and organization.

Boundary detection (segmentation)
should not be a bottleneck → We only select single word entities that start with a capital letter.
4. Supporting evidences of the type must be restricted to local context only (a window of 2 to 4 tokens) → We developed a primitive context-only tagger to filter-out entities with no close-context signal.  The strategy used to gather examples in NRB is illustrated in Figure 2. We first select Wikipedia articles that are listed in a disambiguation page. Disambiguation pages group different topics that could be referred to by the same query term. 1 The query term Bromwich in Figure 2 has its own disambiguation page that contains a link to the city of West Bromwich, West Bromwich Albion Football Club, and Kenny Bromwich the rugby league player.
We associate each article in a disambiguation page to the entity type found in its corresponding Freebase page (Bollacker et al., 2008), considering only articles whose Freebase type can be mapped to a person, a location, or an organization. We assume that occurrences of the query term within the article are of this type. This assumption was found accurate in previous works on Wikipedia distant supervision for NER Langlais, 2016, 2018). The sentence in our example is extracted from the Kenny Bromwich article, whose Freebase type can be mapped to a person. Therefore, we assume Bromwich in this sentence to be a person.
To decide whether a sentence containing a query term is worth being included in NRB, we rely on two NER taggers. One is a popular NER system which provides a confidence score to each prediction, and which acts as a weak superviser, the other is a context-only tagger we designed specifically (see section 3.1) to detect entities with a strong signal from their local context. A sentence is selected if the query term is incorrectly labeled with high confidence (score > 0.85) by the former tagger, while the latter one labels it correctly with high confidence (a gap of at least 0.25 in probability between the first and second predicted types). This is the case of the sentence in Figure 2 where Bromwich is incorrectly labeled as an organisation by the weak supervision tagger, however correctly labeled as a person by the context-only tagger.

Implementation
We used the Stanford CoreNLP (Manning et al., 2014) tagger as our weak supervision tagger and developed a simple yet efficient method to build a context-only tagger. For this, we first applied the Stanford tagger to the entire Wikipedia dump and replaced all entity mentions identified by their tag. Then, we train a 5-gram language model on the resulting corpus using kenLM (Heafield, 2011). Figure 3 illustrates how this model is deployed as an entity tagger: the mention is replaced by an empty slot and the language model is queried for each type. We rank the tags using the perplexity score given by the model to the resulting sentences, then we normalize those scores to get a probability distribution over types.
Obama is located in far southwestern Fukui Prefecture. <?> is located in far southwestern Fukui Prefecture.
We downloaded the Wikipedia dump of June 2020, which contains 30k disambiguation pages. These pages contain links to 263k articles, where only 107k (40%) of them have a type in Freebase that can be mapped to the 3 types of interest. The Stanford tagger identified 440k entities that match the query term of the disambiguation pages. The thresholds discussed previously were chosen to select around 5000 of the most challenging examples in terms of name regularity bias. This figure aligns with the number of entities present in the test set of the well-studied CONLL benchmark (Tjong Kim Sang and De Meulder, 2003).
We assessed the annotation quality, by asking a human to filter out noisy examples. A sentence was removed if it contains an annotation error, or if the type of the query term cannot be inferred from the local context. Only 1.3% of the examples where removed, which confirms the accuracy of our automatic procedure. NRB is composed of 5275 examples, and each sentence contains a single annotation (see Figure 1 for examples).

Control Set (WTS)
In addition to NRB, we collected a set of domain control sentences -called WTS for WITNESSthat contain the very same query terms selected in NRB, but which were correctly labeled by both the Stanford (score > 0.85) and the context-only taggers. We selected examples with a small gap (< 0.1) between the first and second ranked type assigned to the query term by the latter tagger. Thus, examples in WTS should be easy to tag. For example, because Obama the Japanese city (see Figure 3) is selected among the query terms in NRB, we added an instance of Obama the president.
Performing poorly on such examples 2 indicates 2 That is, a system that fail to tag Obama the president as a domain shift between NRB (Wikipedia) and whatever dataset a model is trained on (we call it the in-domain corpus). WTS is composed of 5192 sentences that have also been manually checked.

Data
To be comparable with state-of-the-art models, we consider two standard benchmarks for NER: CONLL-2003 (Tjong Kim Sang andDe Meulder, 2003) and ONTONOTES 5.0 (Pradhan et al., 2012) which include 4 and 18 types of named-entities respectively. ONTONOTES is 4 times larger than CONLL, and both benchmarks mainly cover the news domain. We run experiments on the official train/dev/test splits, and report mention-level F1 scores, following previous works. Since in NRB, there is only one entity per sentence to annotate, a system is evaluated on its ability to correctly identify the boundaries of this entity and its type. When we train on ONTONOTES (18 types) and evaluate on NRB (3 types), we perform type mapping using the scheme of Augenstein et al. .

Systems
Following (Devlin et al., 2019), we term all approaches that learn the encoder from scratch as feature-based, as opposed to the ones that finetune a pre-trained model for the downstream task.
We conduct experiments using 3 feature-based and 2 fine-tuning approaches for NER: • • BERT-LSTM Similar to the previous model, but replacing ELMo by a representation gathered from the last four layers of BERT.
• BERT-base The fine-tuning approach proposed by Devlin et al. (2019) using the BERT-base model.
• BERT-large The fine-tuning approach using the BERT-large model. We used Flair-LSTM off-the-shelf, 3 and reimplemented other approaches using the default settings proposed in the respective papers. For our reimplementations, we used early stopping based on performance on the development set, and report average performance over 5 runs. For BERTbased solutions, we adopt spanBERT (Joshi et al., 2020) as a backbone model since it was found by Li et al. (2020) to perform better on NER. Table 1 shows the mention level F1 score of the systems considered. FLAIR-LSTM and BERTlarge are the best performing models on in-domain test sets, the maximum gap with other models being 1.1 and 2.7 on CONLL and ONTONOTES respectively. These figures are in line with previous works. What is more interesting is the performance on NRB. Feature-based models do poorly, Flair-LSTM underperforms compared to other models (F1 score of 27.6 and 33.7 when trained on CONLL and ONTONOTES respectively). Fine-tuned BERT models clearly perform better (around 75), but far from in-domain results (92.9 and 89.9 on CONLL and ONTONOTES respectively). Domain shift is not a reason for those results, since the performances on WTS are rather high (92 or higher). Furthermore, we found that the boundary detection (segmentation) performance on NRB is above 99.2% across all settings. Since errors made on NRB are neither due to segmentation nor to domain shift, they must be imputed to name regularity bias of models.

Results
It is worth noting that BERT-LSTM outperforms ELMo-LSTM on NRB, despite underper-forming on in-domain test sets. This may be because BERT was pre-trained on Wikipedia (same domain of NRB), while ELMo embeddings were trained on the One Billion Word corpus (Chelba et al., 2014). Also, we observe that switching from BERT-base to BERT-large, or training on 4 times more data (CONLL versus ONTONOTES) does not help on NRB. This suggests that name regularity bias is neither a data nor a model capacity issue.

Feature-based vs. Fine-tuning
In this section, we analyze reasons for the drastic superiority of fined-tuned models on NRB. First, the large gap between BERT-LSTM and BERTbase on NRB suggests that this is not related to the representations being used at the input layer.
Second, we tested several configurations of ELMo-LSTM where we scale up the number of LSTM layers and hidden units. We observed a degradation of performance on dev, test and NRB sets, mostly due to over-parameterized models. We also trained 9, 6 and 4 layers BERT-base models, 4 and still noticed a large advantage of BERT models on NRB. 5 This suggests that the higher capacity of BERT alone can not explain all the gains.
Third, since by design, evidences on the entity type in NRB reside within the local context, it is unlikely that gains on this set come from the ability of Transformers (Vaswani et al., 2017) to better handle long dependencies than LSTM (Hochreiter and Schmidhuber, 1997). To further validate this statement, we fine-tuned BERT models with randomly initialized weights, except the embedding layer. We noticed that this time, the performances on NRB fall into the same range of those of feature-based models, and a drastic decrease (12-15%) on standard benchmarks. These observations are in keeping with results from (Hendrycks et al., 2020) on the out-of-distribution robustness of fine-tuning pre-trained transformers, and also confirms observations made by (Agarwal et al., 2020b).
From these analyses, we conclude that the Masked Language Model (MLM) objective (Devlin et al., 2019) that the BERT models were pretrained with is a key factor driving superior performances of the fine-tuned models on NRB. In most cases, the target word is masked or randomly selected, therefore the model must rely on the context to predict the correct target, which is what a model should do to correctly predict the type of entities in NRB. We think that in fine-tuning, training for a few epochs with a small learning rate, helps the model to preserve the contextual behaviour induced by the MLM objective.
Nevertheless, fine-tuned models recording at best an F1 score of 75.6 on NRB do show some name regularity bias, and fail to capture useful local contextual information.

Mitigating Bias
In this section, we investigate training procedures that are designed to enhance the contextual awareness of a model, leading to a better performance on NRB without impacting in-domain performance. These training procedures are not supposed to use any external data. In fact, NRB is only used as a diagnosing corpus, once the model is trained. We propose 3 training procedures that can be combined, two of them are architecture-agnostic, and one is specific to fine-tuning BERT.

Entity Masking
Inspired by the masking strategy applied during the pre-training phase of BERT, we propose a data augmentation approach that introduces a special [MASK] token in some of the training examples. Specifically, we search for entities in the training material that are preceded or followed by 3 nonentity words. This criterion applies to 35% and 39% of entities in the training data of CONLL and ONTONOTES respectively. For each such entity, we create a new training example (new sentence) by replacing the entity by [MASK], thus forcing the model to infer the type of masked tokens from the context. We call this procedure mask.

Parameter Freezing
Another simple strategy, specific to fine-tuning BERT, consists of freezing part of the network. More precisely, we freeze the bottom half of BERT, including the embedding layer. The intuition is to preserve part of the predicting-bycontext mechanism that BERT has acquired during the pre-training phase. This training procedure is expected to enforce the contextual ability of the model, thus adding to our analysis on the critical role of the MLM objective in pre-training BERT. We name this method freeze.

Adversarial Noise
We propose an adversarial learning algorithm that makes entity type patterns in the input representation less reliable for the model, thus enforcing it to rely more aggressively on the context. To do so, we add a learnable adversarial noise vector (only) to the input representation of entities. We refer to this method as adv.
Let T = {t 1 , t 2 , . . . , t K } be a predefined set of types such as PER, LOC, and ORG in our case. Let x = x 1 , x 2 , . . . , x n be the input sequence of length n, y = y 1 , y 2 , . . . , y n be the gold label sequence following the IOB 6 tagging scheme, and y = y 1 , y 2 , . . . , y n be a sequence obtained by adding noise to y at the mention-level, that is, by randomly replacing the type of mentions in y with some noisy type sampled from T .
Let Y ij (t) = y i , . . . , y j be a mention of type t ∈ T , spanning the sequence of indices i to j in y. We derive a noisy mention Y ij in y from Y ij (t) as follows: where λ is a threshold parameter, U (0, 1) refers to the uniform distribution in the range [0,1], Cat(γ|ξ = 1 K−1 ) is the categorical distribution whose outcomes are equally likely with the probability of ξ, and the set T \ {t} = {t : t ∈ T ∧ t = t} stands for the set T excluding type t.  Figure 4: Illustration of our adversarial method applied on the entity New York. First, we generate a noisy type (PER), and then add a learnable noise embedding (LOC→PER) to the input representation of that entity. This will make entity patterns (hashed rectangles) unreliable for the model, hence forcing it to collect evidences (dotted arrow) from the context. The noise embedding matrix and the noise label projection layer weights (dotted rectangle) are trained independently from the model parameters.
The above procedure only applies to the entities which are preceded or followed by 3 context words. For instance, in Figure 4, we produce a noisy type for New York (PER), but not for John (p > λ). Also, note that we generate a different sequence y from y at each training epoch.
Next, we define a learnable noisy embedding matrix E ∈ R m×d where m = |T | × (|T | − 1) is the number of valid type switching possibilities, and d is the dimension of the input representations of x. For each token with a noisy label, we add the corresponding noisy embedding to its input representation. For other tokens, we simply add a zero vector of size d. As depicted in Figure 4, the noisy type of the entity New York is PER, therefore we add the noise embedding at index LOC → P ER to its input representation.
Then, the input representation of the sequence is fed to an encoder followed by an output layer, such as LSTM-CRF in (Peters et al., 2018), or BERT-Softmax in (Devlin et al., 2019). First, we extend the aforementioned models by generating an extra logit f using a projection layer parametrized by W and followed by a softmax function. As shown in Figure 4, for each token the model produces two logits relative to the true and noisy tags. Then, we train the entire model to minimize two losses: L true (θ) and L noisy (θ ), where θ is the original set of parameters and θ = {E , W } is the extra set we added (dotted boxes in Figure 4). L true (θ) is the regular loss on the true tags, while L noisy (θ ) is the loss on the noisy tags defined as follows: where CE is the cross-entropy loss function. Both losses are minimized using gradient descent. It is worth mentioning that λ is the only hyperparameter of our adv method. It controls how often noisy embeddings are added during training. Higher values of λ increase the amount of uncertainty around salient patterns in the input representation of entities, hence preventing the model from overfitting those patterns, and therefore pushing it to rely more on context information. We tried values of λ between 0.3 and 0.9, and found λ = 0.8 to be the best one based on CONLL and ONTONOTES development sets.

Results
We trained models on CONLL and ONTONOTES, and evaluated them on their respective TEST set. 7 Recall that NRB and WTS are only used as auxiliary diagnosing sets. Table 2 shows the impact of our training methods when fine-tuning the BERTlarge model (the one that performs best on NRB).
First, we observe that each training method significantly improves the performance on NRB. Adding adversarial noise is notably the best performing method on NRB, with an additional gain of 10.5 and 10.4 F1 points over the respective baselines. On the other hand, we observe minor variations on in-domain test sets, as well as on WTS. The paired sample t-test (Cohen, 1996) confirms that these variations are not statistically significant (p > 0.05). After all, the number of decisions that differ between the baseline and the best model on a given in-domain set is less than 20. Impact of training methods on BERT-large models fine-tuned on CONLL or ONTONOTES.

Method
Second, we observe that combining methods always leads to improvements on NRB; the best configuration being when we combine all 3 methods. It is interesting to note that combining training methods leads to a performance on NRB which does not depend much on the training set used: CONLL (89.7) and ONTONOTES (88.8). This suggests that name regularity bias is a modelling issue, and not the effect of factors such as training data size, domain, or type granularity.  In order to validate that our training methods are not specific to the fine-tuning approach, we replicated the same experiments with the ELMo-LSTM. Table 3 shows the performances of the mask and adv procedures (the freeze method does not apply here). The results are in line with those observed with BERT-large: significant gains on NRB of 14 and 12 points for CONLL and ONTONOTES models respectively, and no statistically significant changes on in-domain test sets. Again, combining training methods leads to systematic gains on NRB (13 points on average). Differently from fine-tuning BERT, we observe a slight drop in performance of 1.2% on WTS when both methods are used.
The performance of ELMo-LSTM on NRB does not rival with the one obtained by finetuning the BERT-large model, which confirms that BERT is a key factor to enhance robustness, even if in-domain performance is not necessarily rewarded (McCoy et al., 2019;Hendrycks et al., 2020).

Analysis
So far, we have shown that state-of-the-art models do suffer from name regularity bias, and we proposed model-agnostic training methods which are able to mitigate this bias to some extent. In Section 6.1, we provide further evidences that our training methods force the BERT-large model to better concentrate on contextual cues. In Section 6.2, we replicate the evaluation protocol of Lin et al. (2020) in order to clear out the possibility that our training methods are only valid on NRB. Last, we perform extensive experiments on name regularity bias under low resource (Section 6.3) and multilingual (Section 6.4) settings.

Attention Heads
We leverage the attention map of BERT to better understand how our method enhances context encoding. To this end, we calculate the average number of attention heads that point to the entity mentions being predicted at each layer. We conduct this experiment on NRB with the BERT-large model (24 layers with 16 attention heads at each layer) fine-tuned on CONLL.
At each layer, we average the number of heads which have their highest attention weight (argmax) pointing to the entity name. 8 Figure 5 shows the average number of attention heads that point to an entity mention in the BERT-large model fine-tuned without our methods, with the adversarial noise method (adv), and with all three methods. We observe an increasing number of heads pointing to entity names when we get closer to the output layer: at the bottom layers (left part of the figure) only a few heads are pointing to entity names, in contrast to the last 2 layers (right part) where almost all heads do so. This observation is inline with Jawahar et al. (2019) who show that bottom and intermediate BERT layers mainly encode lexical and syntactic information, while top layers represent task-related information. Our training methods lead to less heads at top layers pointing to entity mentions, suggesting the model is focusing more on contextual information.

Random Permutations
Following the protocol described in (Lin et al., 2020), we modified dev and test sets of standard benchmarks by randomly permuting dataset-wise mentions of entities, keeping the types untouched. For instance, the span of a specific mention of a person can be replaced by a span of a location, whenever it appears in the dataset. These randomized tests are highly challenging, as discussed in Section 2, since here the context is the only available clue to solve the task, and many false positive examples are introduced that way.  Table 4: F1 scores of BERT-large models finetuned on CONLL and evaluated on randomly permuted versions of the dev and test sets: π(dev) and π(test). Table 4 shows the results of the BERT-large model fine-tuned on CONLL and evaluated on the permuted in-domain dev and test sets. F1 scores are much lower here, confirming this is a hard testbed, but they do provide evidences of the named-regularity bias of BERT. Our training methods improve the model F1 score by 17% and 13% on permuted dev and test sets respectively, an increase much inline with what we observed on NRB.

Low Resource Setting
Similarly to (Zhou et al., 2019;Ding et al., 2020), we simulate a low resource setting by randomly sampling tiny subsets of the training data. Since our focus is to measure the contextual learning ability of models, we first selected sentences of CONLL training data that contain at least one entity followed or preceded by 3 non-entity words. Then, we randomly sampled k ∈ {100, 500, 1000, 2000} sentences 9 with which we fine-tuned BERT-large. Figure 6 shows the performance of the resulting models on NRB. Expectedly, F1 scores of models fine-tuned with few examples are rather low on NRB as well as on the in-domain test set. Not shown in Figure 6, fine-tuning on 100 and 2000 sentences leads to performance of 14% and 45% respectively on the CONLL test set. Nevertheless, we observe that our training methods, and adv in particular, improve performances on NRB even under extremely low resource settings. On CONLL test and WTS sets, scores vary in a range of ±0.5 and ±0.7 respectively when our methods are added to BERT-large.

Experimental Protocol
For in-domain data, we use the German, Spanish, and Dutch CONLL-2002(Tjong Kim Sang, 2002 NER datasets. Those benchmarks -also from the news domain -come with a train/dev/test split, and the training material is comparable in size to the English CONLL dataset. In addition, we experiment with four non CONLL benchmarks: Finnish (Luoma et al., 2020), Danish (Hvingelby et al., 2020), Croatian (Ljubešić et al., 2018, and Afrikaans (Eiselen, 2016) data. These corpora have more diversified text genres, yet mainly follow the CONLL annotation scheme. 10 Finnish and Afrikaans datasets have comparable size to English CONLL, Danish is 60% smaller, while the Croatian is twice larger. We use the provided train/dev/test splits for Danish and Finnish, while we randomly split (80/10/10) the Croatian and Afrikaans datasets.
Since NRB and WTS are in English, we designed a simple yet generic method for projecting them to another language. First, both test sets are translated to the target language using an online translation service. In order to ensure a high quality corpus, we eliminate a sentence if the BLEU score (Papineni et al., 2002) Table 5 reports the percentage of discarded sentences for each language. While for the Finnish (fi), Croatian (hr) and German (de) languages we remove a large proportion of sentences, we found our translation approach more simple and systematic than generating an NRB corpus from scratch for each language. The latter approach depends on the robustness of the weak tagger, the number of Wikipedia articles and disambiguation pages per language, as well as the existence of type information. This is left as future work.
For feature-based approaches, we use the same architecture for ELMo-LSTM (Peters et al., 2018) except that we replace English word embeddings by language-specific ones: FastText (Bojanowski et al., 2017) for static representations, and the aforementioned BERT-base models for contextualized ones. Table 6 reports the performances on test, NRB, and WTS sets for both feature-based and finetuning approaches with and without our training methods. We used the hyper-parameters of the English CONLL experiments with no further tuning. We selected the best performing models based on development sets score, and report average results on 5 runs.

Results
Mainly due to implementation details and hyper-parameter settings, our fine-tuned BERTbase models perform better on the CONLL test sets for German (83.8 vs. 80.4) and Dutch (91.8 vs. 90.0) and slightly worse on Spanish (88.0 vs. 88.4) compared to the results reported in their respective BERT papers.
Consistent with the results obtained on English for feature-based (Table 1) and fine-tuned (Table 3) models, the latter approach performs better on NRB, although by a smaller margin compared to English (+37%). More precisely, we observe a gain of +28% and +26% on German and Croatian respectively, and a gain ranging between 11% and 15% for other languages.
Nevertheless, our training methods lead to systematic and often drastic improvements on NRB coupled with a statistically non significant overall decrease on in-domain test sets. They do however incur a slight but significant drop of around 2 F1 score points on WTS for feature-based mod-  Altogether, these results demonstrate that name regularity bias is not specific to a particular language, even if its degree of severity varies from one language to another, and that the training methods proposed notably mitigate this bias.

Conclusion
In this work, we focused on the name regularity bias of NER models, a problem first discussed in (Lin et al., 2020). We propose NRB, a benchmark we specifically designed to diagnose such a bias. As opposed to existing strategies devised to measure it, NRB is composed of real sentences with easy to identify mentions.
We show that current state-of-the-art models, perform from poorly (feature-based) to decently (fined-tuned BERT) on NRB. In order to mitigate this bias, we propose a novel adversarial training method based on adding some learnable noise vectors to entity words. These learnable vectors encourage the model to better incorporate contextual information. We demonstrate that this approach greatly improves the contextual ability of existing models, and that it can be combined with other training methods we proposed. Significant gains are observed in both low-resource and multilingual settings. To foster research on NER robustness, we encourage others to report results on NRB and WTS. 13 13 English and multilingual NRB and WTS are avail-This study opens up new avenues of investigations. Conducting a large-scaled multilingual experiment, characterizing the name regularity bias of more diversified morphological language families is one of them, possibly leveraging massively multilingual resources such as WikiAnn (Pan et al., 2017), Polyglot-NER (Al-Rfou et al., 2015), or Universal Dependencies (Nivre et al., 2016). We can also develop a more challenging NRB by selecting sentences with multi-word entities.
Also, non-sequential labelling approaches for NER like the ones of (Li et al., 2020; have reported impressive results on both flat and nested NER. We plan to measure their bias on NRB and study the benefits of applying our training methods to those approaches. Finally, we want to investigate whether our adversarial training method can be successfully applied to other NLP tasks.