Partially Supervised Named Entity Recognition via the Expected Entity Ratio Loss

We study learning named entity recognizers in the presence of missing entity annotations. We approach this setting as tagging with latent variables and propose a novel loss, the Expected Entity Ratio, to learn models in the presence of systematically missing tags. We show that our approach is both theoretically sound and empirically useful. Experimentally, we find that it meets or exceeds performance of strong and state-of-the-art baselines across a variety of languages, annotation scenarios, and amounts of labeled data. In particular, we find that it significantly outperforms the previous state-of-the-art methods from Mayhew et al. (2019) and Li et al. (2021) by +12.7 and +2.3 F1 score in a challenging setting with only 1,000 biased annotations, averaged across 7 datasets. We also show that, when combined with our approach, a novel sparse annotation scheme outperforms exhaustive annotation for modest annotation budgets.


Introduction
Named entity recognition (NER) is a critical subtask of many domain-specific natural language understanding tasks in NLP, such as information extraction, entity linking, semantic parsing, and question answering. For large, exhaustively annotated benchmark datasets, this problem has been largely solved by fine-tuning of high-capacity pretrained sentence encoders from massive-scale language modeling tasks (Peters et al., 2018;Devlin et al., 2019;. However, fully annotated datasets themselves are expensive to obtain at scale, creating a barrier to rapid development of models in low-resource situations. 1 We have published for our implementation and experimental results at https://github.com/ teffland/ner-expected-entity-ratio.

Raw Latent
False Negative Figure 1: An example low-recall sentence with two entities (one is missing) and its NER tags. The Gold row shows the true tags, the Raw row shows a false negative induced by the standard "tokens without entity annotations are non-entities" assumption, and the Latent row reflects our view of unannotated tags as latent variables.
Partial annotations, instead, may be much cheaper to obtain. For example, when building a dataset for a new entity extraction task, a domain expert may be able to annotate entity spans with high precision at a lower recall by scanning through documents inexhaustively, creating a higher diversity of contexts and surface forms by limiting the amount of time spent on individual documents. In another scenario studied by Mayhew et al. (2019), non-speaker annotators for lowresource languages may only be able to recognize some of the more common entities in the target language, but will miss many less common ones. In both of these situations, we wish to leverage partially annotated training data with high precision but low recall for entity spans. Because of the low recall, unannotated tokens are ambiguous and it is not reasonable to assume they are non-entities (the O tag). We give an example of this in Figure 1.
We address the problem of training NER taggers with partially labeled, low-recall data by treating unannotated tags as latent variables for a discriminative tagging model. We propose to combine marginal tag likelihood training (Tsuboi et al., 2008) with a novel discriminative criterion, the Expected Entity Ratio (EER), to control the relative proportion of entity tags in the sentence. The proposed loss is (1) flexibly able to incorporate prior knowledge about expected entity rates under uncertainty; (2) theoretically recovers the true tagging distribution under mild conditions; and (3) easy to implement, fast to compute, and amenable to standard gradient-based optimization. We evaluate our method across 7 corpora in 6 languages along two diverse low-recall annotation scenarios, one of which we introduce. We show that our method performs as well or better than the previous state-of-the-art methods from Mayhew et al. (2019) and the recent work of Li et al. (2021) across the studied languages, scenarios, and amounts of labeled entities. Further, we show that our novel partial annotation scheme, when combined with our method, outperforms exhaustive annotation for modest annotation budgets.

Related Works
A common paradigm for low-recall NER is automatically creating silver-labeled data using outside resources. Bellare and McCallum (2007) approach the problem by distantly supervising spans using a database with marginal tag training. Carlson et al. (2009) similarly use a gazetteer and adapt the structured perceptron (Collins, 2002) to handle partially labeled sequences, while Nothman et al. (2008) use Wikipedia and label-propagation heuristics. Peng et al. (2019) also use distant supervision to silver-labeled entities, but use PUlearning with specified class priors to estimate individual classifiers with ad-hoc decoding. Yang et al. (2018);Nooralahzadeh et al. (2019) optimize the marginal likelihood (Tsuboi et al., 2008) of the distantly annotated tags but require gazatteers and some fully labeled data to handle proper prediction of the O tag. Greenberg et al. (2018) use a marginal likelihood objective to pool overlapping NER tasks and datasets, but must exploit crossdataset constraints. Snorkel (Ratner et al., 2020) uses many sources of weak supervision, but relies on high-recall and overlap to work. In contrast to these works, we do not use outside resources.
Our problem setting has connections to PUlearning, which is classically an approach to classification (Liu et al., 2002(Liu et al., , 2003Elkan and Noto, 2008;Grave, 2014), but here we work with tagging structures. Our approach is also related to constraint-satisfaction methods for shaping the model distribution such as CoDL (Chang et al., 2007), used by Mayhew et al. (2019), and is also related to Posterior Regularization (Ganchev et al., 2010), with main differences being that we do not use the KL-divergence and use gradient-based up-dates to a nonlinear model instead of closed-form updates to a log-linear model.
The problem setup from Jie et al. (2019); Mayhew et al. (2019) is the same as ours, but Jie et al. (2019) use a cross-validated self-training approach and Mayhew et al. (2019) use an iterative constraint-driven self-training approach to downweigh possible false-negative O tags, which they show to outperform Jie et al. (2019). Mayhew et al. (2019) is the current state of the art on the CoNLL 2003 NER datasets (Tjong Kim Sang and De Meulder, 2003) and we compare to their work in the experiments. Recently, Li et al. (2021) have published a span-based method that uses negative sampling of non-entity spans, but they do not provide any supporting theoretical guarantees. We also compare to them in the experiments.

Methods
In this section, we describe the proposed approach. We begin with a description of the problem and notation in § 3.1, followed by the NER tagging model in § 3.2. We then describe the supervised marginal tag loss and our proposed auxiliary loss, used for learning on positive-only annotations, in § 3.3 and § 3.4, respectively. Finally, in § 3.5 we describe the full objective and give theory showing that our approach recovers the true tagging distribution in the large-sample limit.

Problem Setup and Notation
We formulate NER as a tagging problem, as is extremely common (McCallum and Li, 2003;Lample et al., 2016;Devlin et al., 2019;Mayhew et al., 2019), inter alia. In fully supervised tagging for NER, we are given an input sentence x 1:n = x 1 . . . x n , x i ∈ X of length n tokens paired with a sequence y 1:n , y i ∈ Y of tags that encode the typed entity spans in the sentence. Following previous work, we use the BILUO scheme (Ratinov and Roth, 2009). Under this formulation, a NER dataset of fully annotated sentences is a set of pairs of token and tag sequences:

Partial Annotations
Normally, fully annotated tag sequences are derived from exhaustive annotation schemes, where annotators mark all positive entity spans in the text and then the filler O tag can be perfectly inferred at all unannotated tokens. Training a model on such fully annotated data is easy enough with traditional maximum likelihood estimation (McCallum and Li, 2003;Lample et al., 2016). In many cases, however, it is desirable to be able to learn on incomplete, partially annotated training data that has high precision for entity spans, but low recall ( §4.2 discusses two such scenarios). Because of the low recall, unannotated tokens are ambiguous and it is not reasonable to assume they are non-entities (the O tag). Even in this lowrecall situation, prior works (Jie et al., 2019;Mayhew et al., 2019) assume that unannotated tokens are given this non-entity tag. Their approaches then try to estimate which of these tags are "incorrect" through self-training-like schemes, iteratively down-weighing the contribution of these noisy tags to the loss with a meta training loop.
In contrast to prior work, we make no direct assumptions about unannotated tokens and treat all such positions as latent tags. In this view, a partially annotated sentence is a token sequence x 1:n paired with a set of observed (tag,position) pairs. Given a sentence x 1:n , we define as the set of observed tags y at positions i. For example, in Figure 1 we would have y O = {(U-ORG, 7)}. Under this formulation, we will be given a partially observed dataset: We use data of this form for the rest of the work.

Tagging Model
We use a simple, relatively off-the-shelf tagging model for p(y 1:n |x 1:n ; θ). Our model, BERT-CRF, first encodes the token sequence using a contextual Transformer-based (Vaswani et al., 2017) encoder, initialized from a pretrained language-model objective (Devlin et al., 2019;. Given the output representations from the last layer of the encoder, we then score each tag individually with a linear layer, as in Devlin et al. (2019). Finally, we model the distribution p(y 1:n |x 1:n ) with a linear-chain CRF (Lafferty et al., 2001), using the individual tag scores and learned transition parameters T as potentials. Mathematically, our tagging model is given by: where φ ∈ R n×|Y|×|Y| is the tensor of individual potentials and θ = {θ BERT , T } ∪ {v y } y∈Y are the full set of model parameters.
A few important things to note: (1) while we call the encoder "BERT", in practice we utilize various BERT-like pretrained transformer language models from the HuggingFace Transformers (Inc., 2019) library; (2) we apply grammaticality constraints to the transition parameters T that cause the model to put zero mass on invalid transitions; and (3) we do not use special start and end states, as pretrained transformers already bookend the sentence with SOS and EOS tokens that can be assumed to always be O tags. This combined with the transition constraints guarantees that the tagger outputs valid sequences.
We choose this model architecture because it closely reflects recent standard practice in applied NER (Devlin et al., 2019;Inc., 2019), where a pretrained transformer is fine-tuned to the tagging dataset. However, we improve this practice by using a CRF layer on top instead of predicting all tags independently. We stress that the additional CRF layer has multiple benefits -the transition parameters and global normalization improve model capacity and, importantly, prevent invalid predictions. In preliminary experiments, we found that invalid predictions were common in some of the few-annotation scenarios we study here.

Supervised Marginal Tag Loss
We train our tagger on partially annotated data by maximizing the marginal likelihood (Tsuboi et al., 2008) of the observed tags under the model: with log p(y O |x 1:n ) = log y 1:n |=y O p(y 1:n |x 1:n ) (2) where y 1:n |= y O means all taggings satisfying the observations y O . For tree-shaped CRFs, this loss is tractable with dynamic programming. While it is possible to optimize only this loss for the given partially annotated data, doing so alone has deleterious effects in our scenario -the resulting model will not learn to meaningfully predict the O tag, by far the most common tag (Jie et al., 2019) and thus fail to have acceptable performance, with high recall at nearly zero precision. We need another term in the loss to encourage the model to predict O tags, which we introduce next.

Expected Entity Ratio Loss
As has been observed in prior work (Augenstein et al., 2017;Peng et al., 2019;Mayhew et al., 2019), the number of named entity tags (versus O tags) over the entire distribution of sentences occur at relatively stable rates for different named entity datasets with the same task specification. For any specific dataset, we call this proportion the "expected entity ratio" (EER), which is simply the marginal distribution of some tag y being part of an entity span, p(y = O). Given an estimate of this EER, ρ = p(y = O), for the dataset in question, we propose to impose a second loss that directly encourages the tag marginals under the model to match the given EER, up to a margin of uncertainty γ. This loss is given by: is the model's expected rate of entity tags.
For linear-chain CRFs, the inner expected count can be computed exactly, because it factors over the model potentials and reduces to a simple sum over the tag marginals under the model, 2 and is differentiable. The outer expectation is not feasible for large datasets on modern hardware, so we approximate it with Monte-Carlo estimates from mini-batches and optimize using stochastic gradient descent (Robbins and Monro, 1951). We also note that the loss in Eqn 3 takes the same form as the -insenstive hinge loss for support vector regression machines (Vapnik, 1995;Drucker et al., 1996), though our use-case is quite different. Additionally, this loss function is differentiable everywhere except at the ρ ± γ points.

Combined Objective and Consistency
The final loss, presented in Eqn. 5, combines Eqns. 1 and 3 with a balancing coefficient λ u .
(5) This loss has an intuitive explanation. The supervised loss L p optimizes the entity recall of the model. The addition of the EER loss L u further controls the precision of the model. Together, they form a principled objective whose optimum recovers the true distribution under mild conditions.
We now present a theorem that gives insight into why the loss in Eqn. 5 is justified. First, we introduce the following set of assumptions: 3 Assumption 1. Assume there are finite vocabularies of words X and tags Y, and that Y contains a special tag O. We have some model p(y 1:n |x 1:n ; θ) with parameter space Θ. Assume some distribution p X,Y (x 1:n , y 1:n ) over sequence pairs x 1:n ∈ X + , y 1:n ∈ Y + , and define S = {x 1:n ∈ X + : p X (x 1:n ) > 0}. Assume in addition the following: (a) p Y |X is deterministic: that is, for any x 1:n ∈ S, there exists some y 1:n ∈ Y + such that p Y |X (y 1:n |x 1:n ) = 1.
(c2) Positive entity support: for all x 1:n ∈ S, That is, only y and O are possible underp, and the tag y has probability strictly greater than zero.
Given these assumptions, define L ∞ to be the expected loss under the distributionp: We can then state the following theorem. Theorem 2. Assume that all conditions in assumption 1 hold. Define ρ = ρ * where ρ * is the known marginal entity tag distribution, γ = 0, and λ u > 0. Then for any θ ∈ arg min L ∞ (θ; λ u , ρ, γ), the following holds: ∀(x 1:n , y 1:n ) ∈ S × Y + , p(y 1:n |x 1:n ; θ) = p Y |X (y 1:n |x 1:n ) The proof of the theorem is in the Appendix. Intuitively, this result is important because it shows that in the limit of infinite data, parameter estimates optimizing the loss function will recover the correct underlying distribution p Y |X . More formally, this theorem is the first critical step in proving consistency of an estimation method based on optimization of the loss function. In particular (see for example Section 4 of Ma and Collins (2018)) it should be relatively straightforward to derive a result of the form under some appropriate definition of distance between distributions d, wherep m Y |X is the distribution under parameters θ m derived from a random sample D m of size m. However, for reasons of space we leave this to future work. 4

Benchmark Experiments
We evaluate our approach on 7 datasets in 6 languages for two diverse annotation scenarios (14 datasets in total) and compare to strong and stateof-the-art baselines.

Corpora
Our original datasets come from two benchmark NER corpora in 6 languages. We use the English (eng-c), Spanish (esp), German (deu) and Dutch (ned) languages from the CoNLL 2003 shared tasks (Tjong Kim Sang and De Meulder, 2003). We also use the NER annotations for English (engo), Mandarin Chinese (chi), and Arabic (ara) from the Ontonotes5 corpus (Hovy et al., 2006).
By studying across this wide array of corpora, we test the approaches in a variety of language settings, as well as dataset and task sizes. The CoNLL corpus specifies 4 entity classes while the Ontonotes corpus has 18 different classes and they span 7.4K to 82K training sentences. We use standard train/dev/test document splits. For each corpus, we generate two partially annotated datasets according to the scenarios from § 4.2.

Simulated Annotation Scenarios
We simulate two partial annotation scenarios that model diverse real-world situations. The first is the "Non-Native Speaker" (NNS) scenario from Mayhew et al. (2019) and the second, "Exploratory Expert" (EE), is a novel scenario inspired by industry. We choose these two samplers to make our results more applicable to practitioners. The simpler alternative -dropping entity annotations uniformly at random (as in Jie et al. (2019); Li et al. (2021)) -is not realistic, leaving an overly diverse set of surface mentions with none of the biases incurred by real-world partial labeling. While there are other partial annotation scenarios compatible with our method that we could have considered here as well, such as using Wikipedia or gazatteers for silver-labeled supervision, we chose to work with simulated scenarios that allow us to study a large array of datasets without introducing the confounding effects of choices for outside resources.

Scenario 1: Non-Native Speaker (NNS)
Our first low-recall scenario is the one proposed by Mayhew et al. (2019), wherein they study NER datasets that simulate non-native speaker annotators. To simulate data for this scenario, Mayhew et al. (2019) downsample annotations grouped by mention until a recall of 50%. For example, if "New York" is sampled, then all annotations with "New York" as their mention in the text are dropped. After the recall is dropped to 50%, the precision is lowered to 90% by adding short randomly typed false-positive spans. The reasoning for this slightly more complicated scheme is that it better reflects the biases incurred via non-native speaker annotation. When non-native speakers exhaustively annotate for NER, they often systematically miss unrecognized entities and occasionally incorrectly annotate false-positive spans. 5 The original sampling code used in Mayhew et al. (2019) is not available and we have introduced datasets that were not in their study, so we reimplemented their sampler and used our version across all of our corpora for consistency. We do, however, run their model code on our datasets, so our results with respect to their approach still hold.

Scenario 2: Exploratory Expert (EE)
In addition to Mayhew et al. (2019)'s non-native speaker scenario, we introduce a signficantly different scenario that reflects another common realworld low-recall NER situation. Though it has not been studied before in the literature, it is inspired by accounts of partially annotated datasets encountered in industry.
In the "Exploratory Expert" (EE) scenario, we suppose a new NER task to be annotated by a domain expert with limited time. Here, in the initial "exploratory" phase of annotation, the expert may wish to cover more ground by inexhaustively scanning through documents in the corpus, annotating the first few entities they see in a document before moving on, stopping once they have added M total entity spans. The advantage of this approach is that, by being inexhaustive, the resulting set of mentions and contexts will have more diversity than by using exhaustive annotation. Compared to exhaustive annotation, the disadvantage is annotators may miss entities and the annotations are biased toward the top of documents.
We simulate this scenario by first removing all annotations from the dataset, then adding back entity spans with the following process. First, we select a document at random without replacement, then scan this document left to right, adding back 5 It is worth noting that the NNS scenario is also quite close to a silver-labeled scenario using a seed dictionary with 50% recall, only it has some additional false positive noise. entity spans with probability 0.8, until 10 entities have been added, then moving on to the next random document. The process halts when M = 1, 000 total entity spans have been added back to the dataset. We note that this assumes that the expert annotators are skimming, sometimes missing entities (20% of the time), but also assumes that the expert does not make flagrant mistakes and so do not insert random false-positive spans.
An important aspect of this scenario in our experiments is the scale of the number of kept annotations. In previous works (Jie et al., 2019;Mayhew et al., 2019;Li et al., 2021), the number of kept annotations is not dropped below 50% of the complete dataset. By keeping only 1K entities, this scenario is significantly more impoverished than those previously studied (1K entities leaves less than 10% of annotations for all datasets, ranging from 0.8% to 8.5%, depending on the corpus).

Approaches
We compare several modeling approaches on the benchmark corpora, detailed below.

Gold
For comparison, we report our tagging model trained with supervised sequence likelihood on the original gold datasets. This provides an upperbound on tagging performance and puts any performance degradation from partially-supervised datasets into perspective. We do not expect any of the other methods to outperform this.

Raw
In the Raw-BERT baseline, we make the naive assumption that all unobserved tags in the low-recall datasets are the O, reflecting the second row of Figure 1, and train with supervised likelihood. This is a weak baseline that we expect to have low recall.

Cost-aware Decoding (Raw+CD)
This stronger baseline, suggested by a reviewer, explores a simple modification to the Raw baseline at test time: we increase the cost of predicting an O tag during inference in an attempt to artificially increase the recall. That is, we introduce an additional hyperparameter b O ≥ 0 that is subtracted from the O tag potentials, biasing the model away from predicting O tags: Intuitively, this approach will work well if the tag potentials consistently rank false negative entity tokens higher than true O tokens. To select b O , we perform a model-based hyperparameter search (Head et al., 2020) using a Gaussian process with 30 evaluations on the validation set F1 score for each dataset's trained Raw-BERT model.

Constrained Binary Learning (CBL)
The CBL baseline is a state-of-the-art approach to partially supervised NER from Mayhew et al. (2019). The main idea of the approach is to estimate which O tags are false negatives, and remove them from training.
Constrained Binary Learning (CBL) approaches this through a constrained, self-traininglike meta-algorithm, based on Constraint-Driven Learning (Chang et al., 2007). The algorithm starts off with a binarized version of the problem (O tag vs not) and initializes instance weights of 1 for all O tags. It then estimates their final weights by iteratively training a model, predicting tags for the training data, then down-weighing some tags based on the confidence of these predictions according to a linear-programming constraint on the total number of allowed O tags. At each iteration, the number of allowed O tags is decreased slightly, and this loop is repeated until the final target entity ratio (our ρ) is satisfied by the weights. A final tagger is then trained on the original tag set using a weighted modification of the supervised tagging likelihood.
For this method, we used the code exactly as was provided, with the following exception. For all non-english languages, we were not able to obtain the original embeddings used in their experiments, and so we have used languagespecific pretrained embeddings from the Fast-Text library (Grave et al., 2018). The base tagging model from Mayhew et al. (2019) utilizes the BiLSTM-CRF approach from Ma and Hovy (2016). The CBL meta-algorithm, however, is agnostic to the underlying scoring architecture of the CRF, and so we test the CBL algorithm both with their BiLSTM scoring architecture and with our BERT-based scoring architecture, which we call CBL-LSTM and CBL-BERT respectively. By testing the CBL meta-algorithm with our tagging model, we control for the different modeling choices and get a clear view of how their CBL approach compares to ours.

Span-based Negative Sampling (SNS)
The SNS-BERT baseline is a recent state-of-theart approach to partially supervised NER from Li et al. (2021). It uses the same BERT-based encoding architecture, but has a different modeling layer on top. Instead of tagging each token, they instead use a span-based scheme, treating each possible pair of tokens as potential entity and classifying all of the spans independently, using an ad-hoc decoding step based on confidence to eliminate overlapping spans. To deal with the resulting class imbalance (O spans are overwhelmingly common) and low-recall entity annotations, they propose to sample spans from the set of unlabeled spans as negatives. While it is possible that they incorrectly sample false negative entities, they argue that this has very low probability. For this method, we used the code as provided but controlled for the same encoding pretrained weights as our other models.

Expected Entity Ratio (EER)
The EER-BERT model implements our proposed approach, using the proposed tagger ( § 3.2) and loss function described in Eqn. 5.

Preprocessing
All datasets came in documents, pre-tokenized into words, with gold sentence boundaries. Recent work (Akbik et al., 2019;Luoma and Pyysalo, 2020) has demonstrated that larger inter-sentential document context is useful for maximizing performance, so we work with full documents instead of individual sentences. 6 For approaches that used a pretrained transformer, some documents did not fit into the 512 token maximum length. In these cases, we split documents into maximal contiguous chunks at sentence boundaries. Also, for pretrained transformer approaches we expand the tag sequences to match the subword tokenizations.
Because the low-recall data in the EE scenario concentrates annotations at the top of only a few documents, it is possible to identify and omit large unannotated portions of text from the training data. We hypothesize that this will significantly improve model outcomes for the baselines because it significantly cuts down on the number of false negative annotations. Therefore, we explore three preprocessing variants for all EE models: (1) all uses the full dataset as given; (2) short drops all documents with no annotations; and (3) shortest drops all sentences after the last annotation in a document (subsuming short). Model names are suffixed with their preprocessing variants. We note that these approaches do not apply to the NNS scenario, as it has many more annotations spread more evenly throughout the data.

Hyperparameters
All hyperparameters were given reasonable defaults, using recommendations from previous work.
For pretrained transformer models, we used the Huggingface (Inc., 2019) implementations of roberta-base  on English datasets and bert-base-multilingual-cased (Devlin et al., 2019) for the other languages. The vector representations used by these models are 768-dimensional and we used matching dimensions for other vector sizes throughout the model. We used a learning rate of 2 × 10 −5 with slanted triangular schedule peaking at 10% of the iterations (Devlin et al., 2019). For batch size, we use the maximum batch size that will allow us to train in memory on a Tesla V100 GPU (14 for CoNLL data, 2 for Ontonote5 data). We found that training for more epochs than originally recommended (Devlin et al., 2019) was necessary for convergence and used 20 epochs for the all variants and 50 epochs for the significantly smaller short and shortest variants. 7 The only hyperparameter we adjusted (from a preliminary experiment measuring dev set performance) was setting λ u = 10. We originally tried a weight of λ u = 1, but then found that the scale of the L p loss massively overpowered L u , so we increased it to λ u = 10, which yielded good performance. We did not try other values after that.
In important contrast to benchmark experiments from prior work (Jie et al., 2019;Mayhew et al., 2019), we do not assume we know the gold entity tag ratio for each dataset when setting ρ. Instead, to make the evaluation more realistic, we use a reasonable guess of ρ = 0.15 with a margin of uncertainty γ = 0.05 for all approaches and datasets. We choose this range because it covers most of the gold ratios observed in the datasets. 8 7 For the CBL-LSTM approach, we use the hyperparameters from Mayhew et al. (2019): these are more epochs (45), and a higher learning rate of 10 −3 . 8 In early experiments we found that the CBL code from

Results
The results of our evaluation are presented in Table 1. The first row shows the result of training our tagger with the original gold data. These results are competitive with previously published results from similar pretrained transformers (Devlin et al., 2019) that do not use ensembles or NER-specific pretraining (Luoma and Pyysalo, 2020;Baevski et al., 2019;Yamada et al., 2020). Interestingly, we also found that our tagging CRF outperformed the span-based independent distribution of Li et al. (2021) on all gold datasets. NNS Performance. The second set of rows shows test F1 scores of models from § 4.3 for the NNS sampled datasets. We first note that the CBL-LSTM approach from Mayhew et al. (2019) significantly underperformed for all non-english languages (and are much lower than the results from their paper with similar data). We used their code as is, only changing the pretrained word vectors, and so suspect that this is due to lower quality word vectors obtained from FastText instead of their custom-fit vectors. This is confirmed by the results of using their CBL meta-algorithm with our proposed tagging architecture, which is competitive with EER-BERT in this setting. Otherwise, we found that all strong baselines and our method performed quite similarly. This suggests that performance in the NNS regime with relatively high recall (50%) and little label noise per positively labeled mention is not bottlenecked by approaches to resolving missing mentions. Further improvements in this regime will likely come from other sources, such as better pretraining or supplemental corpora. Because of this we recommend that future evaluations for partially supervised NER focus on more impoverished annotation counts, such as the EE scenario we study next.
EE Performance. In the third group of rows, we show test F1 scores for each model using the more challenging EE scenario with only 1, 000 kept annotations. In this setting, using the dataset as is for supervised training (Raw-BERT-all), fails to converge, but smarter preprocessing largely alleviates this problem, with Raw-BERT-shortest obtaining an average F1 of 65. 0. Adding costaware decoding (Raw+CD-BERT-shortest) further improves upon the standard baseline (F1 66.9). † indicates that for EE the test F1 score is statistically signficantly better than SNS-BERT-shortest (p < 0.01) (details in footnote 9). Other pairs between SNS-BERT-shortest and EER-BERT-short/shortest were not signficant.

Approach / Language
Even with only 1, 000 biased and incomplete annotations -less than 10% of the original annotations for all datasets -we find that our approach (EER-BERT-short) still achieves an F1 score of 71.7 on average. This outperforms the best strong baselines: Raw+CD, CBL, and SNS, by 4.8, 12.7, and 2.3 F1 score, respectively. The closest baseline, SNS-BERT-shortest from Li et al. (2021), is competitive with EER-BERT-short on four of the datasets, but performs significantly worse on the other three as well as overall, 9 lead- 9 We assessed significance between model pairs using a percentile bootstrap of F1 score differences, resampling test set documents with replacement 100K times (Efron and Tibshirani, 1994) and measuring the paired F1 scores differences of EER-BERT-short/shortest and SNS-BERT-shortest. Significance was assessed by whether the two-sided 99% confidence interval contained 0.0. To assess overall significance, we concatenated the test datasets before bootstrapping.
ing us to conclude that our method has a performance edge in this regime. Further, EER-BERTshort performs only 4.1 average F1 worse on EE data than EER-BERT-all on NNS data. We also note that EER-BERT-shortest significantly outpeformed SNS-BERT-shortest on two datasets, but failed to reject the null hypothesis overall.
Another important finding is that EER-BERT is much more robust to preprocessing choices than the baselines. The baselines all view missing entities as O tags/spans (at least to start) and these relatively common false negatives severely throw off convergence. By removing most of the unannotated text with preprocessing, we effectively create a much smaller corpus that has nearly 80% recall (for shortest). In contrast, EER-BERT's view of the data makes no assertions about the class of individual unobserved tokens and so is less sensitive to the relative proportion of false negative annotations. This is useful in practice, as our approach should better handle partial annotation scenarios with wider varieties of false negative proportions that may not be so easily addressed with simple preprocessing.
Speed. A pragmatic appeal of our approach compared to CBL (Mayhew et al., 2019) is training time. On NNS data, EER-BERT-all is on average 7.6 times faster than CBL-BERT-all and on EE data EER-BERT-short is 2.2 times faster than CBL-BERT-shortest, even though it uses more data. This is because EER does not require a costly outer self-training loop. 10 Conclusion. These results illustrate that our approach outperforms the previous strong and stateof-the-art baselines in the challenging low-recall EE setting with only 1K annotations while also being more robust to the relative proportions of false negatives in the training corpus. 11

Analysis of EER hyperparams
Recall that the definition of our EER loss in Eqn. 3 defines an acceptable regionρ θ ∈ [ρ − γ, ρ + γ] of learned models and that in our this experiment, we used ρ = 0.15 and γ = 0.05 for all datasets, regardless of the true entity ratios ρ * . Two interesting questions then are (1) "how sensitive is the procedure to choices of ρ and γ?"; and (2) "how closely do the final learned models reflect the true entity ratios for the data?". We address these next.

Robustness to choices of ρ and γ
To study robustness we varied choices of ρ and γ for EER-BERT-short on the CoNLL English EE dataset with three randomly sampled datasets. Table 4. 7.1 shows test F1 scores across seeds for various settings of ρ ± γ. We first show three point estimates with γ = 0.0, the first at ρ = ρ * = 0.23, then shifted around ρ * left and right to ρ = 0.15 and ρ = 0.30, respectively. We then widen the range with γ = 0.05 and show the benchmark result ρ = 0.15, followed by shifts of ρ±0.1. Finally 10 We unfortunately cannot comment on relative speed of SNS because runtimes cannot be inferred from the SNS code output, though we do not expect a fundamental speed advantage of one over the other, as neither use self-training. 11 We also note that the EE scenario averages for all models are significantly affected by the poor performance on the Arabic Ontonotes5 (ara) dataset. After further inspection of the training curves, we found that all models exhibited very slow convergence on this dataset and/or failed to converge in the allotted number of epochs.

Varying EER HPs Test F1 Scores
[ 0.23, 0.23  From the table we can glean two interesting points. The first is that in settings where the high end of range of acceptable EER's is greater then ρ * (when ρ + γ = 0.30) there is a substantial drop in performance (mean = 82.3). The second is that the complement group of settings, where ρ + γ ≤ ρ * are all high-performing with little variance (mean = 87.8, std = 0.4). Together they suggest that the true sensitivity of the proposed EER approach to the high end of the interval and that it is best to conservatively estimate that value, whereas the low end of the range is unimportant. This result agrees well with the intuitions provided in § 3.5: since L p is encouraging models with high recall without regard for precision (ρ θ → 1), it is best to set ρ + γ such that L u introduces a tension in the combined loss by encouragingρ θ ≤ ρ * . This is not the whole story, however, as we discuss next.

Convergence towards ρ *
The results from the previous experiment suggests that L u simply serves to driveρ θ → ρ + γ. Since we used ρ + γ = 0.2 for all datasets in the benchmark, we would then expect to see a result that ρ θ ≈ 0.2 for all models.
We tested this hypothesis by calculating the entity ratioρ θ of final trained EER-BERT-short models for the EE datasets (leaving out ara, since it failed to converge) and calculated the average difference of eachρ θ with respect to the corresponding true ρ * , resulting in mean absolute error of only 0.018. This is much closer on average than if the models just converged to 0.2 (the mean absolute error then would be 0.048), indicating that our approach tends to converge more closely to the true entity ratio ρ * than the estimate given by ρ + γ. In particular, we found that all final models hadρ θ < 0.2 except CoNLL English, wherê ρ θ = 0.23, quite close to the gold ρ * even though it was outside of the target range. This result is encouraging in that it suggests the EER loss, in balance with the supervised marginal tag loss, does more to recover ρ * than just driveρ θ → ρ + γ.

EE vs. Exhaustive Experiments
In situations where we only have partially annotated data without the option for exhaustive annotations, the utility of being able to train with the data as provided is self-evident. However, given the potential upsides of partial annotation relative to exhaustive annotation -mentally less taxing and increased contextual diversity for a fixed annotation budget -it is natural to ask whether it is actually better to go with a sparse annotation scheme.

Annotation Speed User Study
We begin with a user study of annotation speed, comparing EE to the standard exhaustive annotation scheme. Following methodology from Li et al. (2020), we recorded 8 annotation sessions from 4 NLP researchers familiar with NER. Using the Ontonotes5 English corpus, we asked each annotator to annotate for two 20 minute sessions using the BRAT (Stenetorp et al., 2012) annotation tool, one exhaustively and the other following the EE scheme. We split documents into two randomized groups and systematically varied which group was annotated with each scheme and in what order to control for document and ordering variation effects. Then, for each annotator, we measured the number of annotated entities per minute for both schemes and report the ratio of EE annotations per minute to exhaustive annotations per minute (i.e., the relative speed of EE to exhaustive). We found that, although speed varied greatly between annotators (ranging from roughly 4 annotations/min to 9 annotations/min across sessions), EE annotation and exhaustive annotation were essentially the same speed, with EE being 3% faster on average. Thus we may fairly compare exhaustive and EE schemes using model performance at the same number of annotations, which we do next. 12

Performance Learning Curves
In this experiment, we compare the best traditional supervised training from the benchmark (Raw-BERT-shortest) with our proposed approach (EER-BERT-short) on EE-annotated and exhaustively annotated documents from CoNLL'03 English (eng-c) at several annotation budgets, M ∈ {100 (0.4%), 500 (2.1%), 1K (4.3%), 5K (21.3%), 10K (42.6%)}. For each annotation budget, we sampled three datasets with different random seeds for both annotation schemes and trained both modeling approaches. This allows us to study how all four combinations of annotation style and training methods perform at varying magnitudes of annotation counts. In addition to low-recall annotations, we compared our EER approach to supervised training on the gold data.
In Figure 2, we show learning curves for the average test performance of all four annotation/training variants. From the plot, we can infer several points. First, on EE-annotated data, using our EER loss substantially outperforms traditional likelihood training at all amounts of partial annotation, but the opposite is true on exhaustively annotated data. This indicates that the training method should be tailored to the annotation scheme.
The comparison between EE data with EER training versus exhaustive data with likelihood training is more nuanced. At only 100 annotations, exhaustive annotation worked best on average in our sample, but all methods exhibit high variance due to the large variation in which entities were annotated. Interestingly, at modest sizes of only 500 and 1K annotations, EE annotated data with our proposed EER-short approach outperformed  exhaustive annotation with traditional supervised training, with gains of +1.8 and +1.5 average F1 for 500 and 1K annotations, respectively. These results, however, reverse as the annotation counts grow: at 5K annotations, the two approaches perform the same (90.8) and, at even larger annotation counts, exhaustive annotation with traditional training outperforms our approach by +0.5 at 10K annotations and +0.8 on the gold dataset. This indicates that EE annotation, paired with our EER loss, is competitive and potentially advantageous to exhaustive annotation and traditional training at modest annotation counts, but that exhaustive annotation with traditional training is better at large annotation counts. This suggests that a hybrid annotation approach where we sparsely annotate data at first, but eventually switch to exhaustive annotations as the process progresses, is a promising direction of future work. We note that our EER loss can easily incorporate observed O tags from exhaustively annotated documents in y O and so would work in this setup without modification.

Conclusions
We study learning NER taggers in the presence of partially labeled data and propose a simple, fast, and theoretically principled approach, the Expected Entity Ratio loss, to deal with low-recall annotations. We show empirically that it outperforms the previous state of the art across a variety of languages, annotation scenarios, and amounts of labeled data. Additionally, we give evidence that sparse annotations, when paired with our approach, are a viable alternative to exhaustive annotation for modest annotation budgets. Though we study two simulated annotation scenarios to provide controlled experiments, our proposed EER approach is compatible with a variety of other incomplete annotation scenarios, such as incidental annotations (e.g., from web links on Wikipedia), initialized by seed annotations from incomplete distant supervision/gazatteers, or embedded as a learning procedure in an active/iterative learning framework, which we intend to explore in future work.