Where's My Head? Definition, Dataset and Models for Numeric Fused-Heads Identification and Resolution

We provide the first computational treatment of fused-heads constructions (FH), focusing on the numeric fused-heads (NFH). FHs constructions are noun phrases (NPs) in which the head noun is missing and is said to be `fused' with its dependent modifier. This missing information is implicit and is important for sentence understanding. The missing references are easily filled in by humans but pose a challenge for computational models. We formulate the handling of FH as a two stages process: identification of the FH construction and resolution of the missing head. We explore the NFH phenomena in large corpora of English text and create (1) a dataset and a highly accurate method for NFH identification; (2) a 10k examples (1M tokens) crowd-sourced dataset of NFH resolution; and (3) a neural baseline for the NFH resolution task. We release our code and dataset, in hope to foster further research into this challenging problem.


Introduction
Many elements in language are not stated explicitly but need to be inferred from the text. This is especially true in spoken language but also holds for written text. Identifying the missing information and filling the gap is a crucial part of language understanding. Consider the sentences below: (1) I'm 42 , Cercie.
(3) I've got two months left, three at the most.
(4) I make an amazing Chicken Cordon Bleu.
She said she'd never had one In (1), it is clear that the sentence refers to the age of the speaker, but this is not stated explicitly in the sentence. Similarly, in (2) the speaker discusses the worth of an object in some currency. In (3), the number refers back to an object already mentioned before-months.
All of the above examples are of numeric fused heads (NFH), a linguistic construction which is a subclass of the more general fused heads (FH) construction, limited to numbers. FH are noun phrases (NPs) in which the head noun is missing and is said to be "fused" with its dependent modifier (Huddleston et al., 2002). In the examples above, the numbers '42', 'two million', 'three' and 'one' function as FHs, whereas their actual heads (YEARS OLD, DOLLAR, months, Chicken Cordon Bleu) are missing and need to be inferred.
While we focus on NFH, FH in general can occur also with other categories, such as determiners and adjectives. For example in the following sentences: (5) Only the rich will benefit.
(6) I need some screws but can't find any . the adjective 'rich' refers to rich PEOPLE and the determiner 'any' refers to screws. In this work we focus on the numeric fused head.
Such sentences often arise in dialog situations as well as other genres. Numeric expressions play an important role in various tasks, including textual entailment (Lev et al., 2004;Dagan et al., 2013), solving arithmetic problems , numeric reasoning Trask et al., 2018) and language modeling (Spithourakis and Riedel, 2018).
While the inferences required NFH construction may seem trivial for a human hearer, they are for the most part not explicitly addressed by current natural language processing systems. Indeed, tasks such as information extraction, machine translation, question answering and others could greatly benefit from recovering such implicit knowledge prior to (or in conjunction with) running the model. 1 We find NFHs particularly interesting to model: they are common (Section 2), easy to understand and resolve by humans (Section 5), important for language understanding, not handled by current systems (Section 7) and hard for current methods to resolve (Section 6).
The main contributions of this work are: • We provide an account of NFH constructions and their distribution in a large corpus of English dialogues, where they account for 41.2% of the numbers. We similarly quantify the prevalence of NFH in other textual genres, showing that they account for between 22.2% and 37.5% of the mentioned numbers.
• We formulate the FH identification (identifying cases that need to be resolved) and resolution (inferring the missing head) tasks.
• We create an annotated corpus for NFH identification and show that the task can be automatically solved with high accuracy.
• We create a 900,000 tokens annotated corpus for NFH resolution, comprising of~10K 1 To give an example from information extraction, consider a system based on syntactic patterns that needs to handle the sentence "Carnival is expanding its ships business, with 12 to start operating next July.". In the context of MT, Google Translate currently translates the English sentence "I'm in the center lane, going about 60, and I have no choice" into French as "Je suis dans la voie du centre, environ 60 ans, et je n'ai pas le choix", changing the implicit speed to an explicit time period.
NFH examples, and present a strong baseline model for tackling the resolution task.

Numeric Fused-Heads
Throughout the paper, we refer to the visible number in the FH as the anchor and to the missing head as the head.
In FH constructions the implicit heads are missing and are said to be fused with the anchors, which are either determiners or modifiers. In the case of NFH, the modifier role is realized as a number (see examples in Table 1). The anchors then function both as the determiner/modifier and as the head-the parent and the other modifiers of the original head are syntactically attached to the anchor. For example, in Figure 1 the phrase the remaining 100 million contains an NFH construction with the anchor 100 million, which is attached to the sentence through the dotted black dependency edges. The missing head, murders, appears in red together with its missing dependency edges. 2 Distribution NFH constructions are very common in dialog situations (indeed, we show in Section 4 that they account for over 40% of the numbers in a large English corpus of movie dialogs), but are also common in written text such as product reviews or journalistic text. Using an NFH identification model which we describe in Section 4.2, we examined the distribution of NFH in We solved one murder, now we just have the remaining 100 million (murders)  Figure 1: Example for an NFH. The 'murders' token is missing, and fused with the '100 million' numeric-span. different corpora and domains. Specifically, we examined monologues (TED talks (Cettolo et al., 2012)), Wikipedia (WikiText-2 and WikiText-103 (Merity et al., 2016)), journalistic text (PTB (Marcus et al., 1993)) and product reviews (Amazon reviews 3 ) in which we found that more than 35.5%, 33.2%, 32.9%, 22.2% and 37.5% of the numbers respectively are NFH.

FH Types
We distinguish between two kinds of FH, which we call Reference and Implicit. In Reference FH, the missing head is referenced explicitly somewhere else in the discourse, either in the same sentence or in surrounding sentences. In Implicit FH, the missing head does not appear in the text and needs to be inferred by the reader or hearer based on the context or world knowledge.

FH vs. Other Phenomena
FH constructions are closely related to ellipsis constructions and are also reminiscent of coreference resolution and other anaphora tasks.
FH vs. Ellipsis With respect to ellipsis, some of the NFH cases we consider can be analyzed as nominal ellipsis (cf. i, ii in Table 1, and (3) in the intro). Other cases of head-less numbers do not traditionally admit an ellipsis analysis. We do not distinguish between the cases and consider all head-less number cases as NFH.
FH vs. Coreference With respect to coreference, some Reference FH cases may seem similar to coreference cases. However, we stress that these are two different phenomena: in coreference, the mention and its antecedent both refer to the same entity, while the NFH anchor and its head-3 https://www.kaggle.com/bittlingmayer/ amazonreviews reference-like in ellipsis-may share a symbol but do not refer to the same entity. Existing coreference resolution datasets do consider some FH cases, but not in a systematic way. They are also restricted to cases where the antecedent appears in the discourse, i.e. they do not cover any of the NFH Implicit cases.
FH vs. Anaphora Anaphora is another similar phenomenon. As opposed to coreference, anaphora (and cataphora, which are cases with a forward rather than a backward reference) includes mentions of the same type but different entities. However, the anaphora does not cover our Implicit NFH cases, which are not anaphoric but refer to some external context or world knowledge. We note that anaphora/cataphora is a very broad concept, that encompasses many different sub-cases of specific anaphoric relations. There is some overlap between some of these cases and the FH constructions.
Pronimial one The word one is a very common NFH anchor (61% of the occurrences in our corpus), and can be used either as a number (viii) or as a pronoun (xiii). The pronoun usage can be replaced with someone. For consistency, we consider the pronominal usages to be NFH, with the implicit head PEOPLE. 4 The one-anaphora phenomenon was previously studied on its own (Gardiner, 2003;Ng et al., 2005). The work by Ng et al. (2005) divided uses of one into six categories: Numeric (xv), Partitive (v), Anaphoric (xii), Generic (vii), Idiomatic (xiii) and Unclassified. We consider all of these, except the Numeric category as NFH constructions.

Inclusive Definition of NFH
While our work is motivated by the linguistic definition of FH, we take a pragramatic approach in which we do not determine the scope of the NFH task based on fine-grained linguistic distinctions. Rather, we take an inclusive approach that is motivated by considering the end-user of an NFH resolution system who we imagine is interested in resolving all numbers that are missing a nominal head. Therefore, we consider all cases that "look like an NFH" as NFH, even if the actual linguistic analysis would label them as gapping, ellipsis, anaphoric pronominal-one or other phenomena. We believe this makes the task more consistent and easier to understand to both end users, annotators, and model developers.

Computational Modeling and Underlying Corpus
We treat the computational handling of FH as two related tasks: identification and resolution. We create annotated NFH corpora for both.
Underlying Corpus As the FH phenomenon is prevalent in dialog situation, we base our corpus on dialog excerpts from movies and TV-series scripts (The IMDB corpus). The corpus contains 117,823 different episodes and movies. Every such item may contain several scenes, with an average of 6.9 scenes per item. Every scene may contain several speaker turns, each of which may span several sentences. The average number of turns per scene is 3.0. The majority of the scenes have at least two participants. Some of the utterances refer to the global movie context. 5

NFH Identification
In the identification stage, we seek NFH anchors within headless NPs which contain a number. More concretely, given a sentence, we seek a list of spans corresponding to all of the anchors within it. An NFH anchor is restricted to a single number, but not a single token. For example, thirty six is a two-token number which can serve as an NFH anchor. We assume all anchors are contiguous spans. The identification task can be reduced to a binary decision, categorizing each numeric span in the sentence as FH/not-FH.

NFH Resolution
The resolution task resolves an NFH anchor to its missing head. Concretely, given a text fragment w 1 , ..., w n (a context) and an NFH anchor a = (i, j) within it, we seek the head(s) of the anchor. For Implicit FH, the head can be any arbitrary expression. While our annotated corpus supports this (Section 5), in practice our modeling (Section 6) as well as the annotation procedure favors selecting one out of 5 prominent categories or the OTHER category.
For Reference FH, the head is selected from the text fragment. In principle a head can span multiple tokens (e.g. 'unexpected thought' in (xii)). This is also supported by our annotation procedure. In practice, we take the syntactic head of the multi-token answer to be the single-token missing element, and defer the boundary resolution to future work.
In case multiple heads are possible for the same anchor (e.g. (viii,xiv) in Table 1), all should be recovered. Hence, the resolution task is a function from a (text, anchor) pair to a list of heads, where each head is either a single token in the text or an arbitrary expression.

Numeric Fused-Head Identification
The FH task is composed of two sub-tasks. In this section, we describe the first : identifying NFH anchors in a sentence. We begin with a rule-based method, based on the FH definition. We then proceed to a learning-based model, which achieves better results.

Test-set
We create a test set for assessing the identification methods by randomly collecting 500 dialog fragments with numbers, and labeling each number as NFH or not NFH. We observe that more than 41% of the test-set numbers are FHs, strengthening the motivation for dealing with the NFH phenomena.

Rule-based Identification
FHs are defined as NPs in which the head is fused with a dependent element, resulting in an NP without a noun. 6 With access to an oracle constituency tree, NFHs can be easily identified by looking for such NPs. In practice, we resort to using automatically produced parse-trees.
We parse the text using the Stanford constituency parser (Chen and Manning, 2014) and look for noun-phrases 7 which contain a number but not a noun. This already produces reasonably accurate results, but we found that we can improve further by introducing 10 additional textbased patterns, which were customized based on a development set. These rules look for common cases that are often not captured by the parser. For example, a conjunction pattern involving a number followed by 'or', such as "eight or nine clubs" 8 , where 'eight' is an NFH which refers to 'clubs'.
Parsing errors result in false-positives. For example in "You've had [one too many cosmos].", the Stanford parser analyzes 'one' as an NP, despite the head ('cosmos') appearing two tokens later. We cover many such cases by consulting with an additional parser. We use the SPACY dependency parser (Honnibal and Johnson, 2015) and filter out cases where the candidate anchor has a noun as its syntactic head or is connected to its parent via a nummod label. We also filter cases where the number is followed or preceded by a currency symbol.
Evaluation We evaluate the rule-based identification on our test set, resulting in 97.4% precision and 93.6% recall. The identification errors are almost exclusively a result of parsing mistakes in the underlying parsers. An example of a falsenegative error is in the sentence: "The lost six belong in Thorn Valley.", where the dependency parser mistakenly labeled 'belong' as a noun, resulting in a negative classification. An example of a false-positive error is in the sentence: "our God is the one true God" where the dependency parser labeled the head of one as 'is'.

Learning-based Identification
We improve the NFH identification using machine learning. We create a large but noisy data set by considering all the numbers in the corpus and treating the NFH identified by the rule-based approach as positive (79,678 examples) and all other numbers as negative (104,329 examples). We randomly split the dataset into train and development sets in a 90%, 10% split. Table 2 reports the dataset size statistics. We train a linear SVM classifier 9 with 4 features: (1) Concatenation of the anchor-span tokens; (2) Lower-cased tokens in a 3-token window surrounding the anchor span; (3) POS tags of tokens in a 3-token window surrounding the anchor span; and (4) POS-tag of the syntactic head of the anchor. The features for the classifier require running a POS tagger and a dependency parser. These can be omitted with a small performance loss (see Table 3 for an ablation study on the dev set).
On the manually labeled test set, the full model achieves accuracies of 97.5% precision and 95.6% recall, surpassing the rule-based approach.

NFH Statistics
We use the rule-based positive examples of the dataset and report some statistics regarding the NFH phenomenon. The most common anchor of the NFH-dataset with a very big gap is the token 'one' 10 with 48,788 occurrences (61.0% of the data), while the second most commons is the token 'two' with 6,263 occurrences (8.4%). There is a long tail in terms of the tokens occurrences, with 1,803 unique anchor tokens (2.2% of the NFHdataset). Most of the anchors consist of a single token (97.4%), 1.3% contain 2 tokens and the longest anchor consists of 8 tokens ('Fifteen million sixty one thousand and seventy six.'). The numbers tend to be written as words (86.7%) and

NFH Resolution Dataset
Having the ability to identify NFH cases with high accuracy, we turn to the more challenging task of NFH resolution. The first step is creating a gold annotated dataset.

Corpus Candidates
Using the identification methods-which achieve satisfying results-we identify a total of 79,884 NFH cases in the IMDB corpus. We find that a large number of the cases follow a small set of patterns and are easy to resolve deterministically: four deterministic patterns account for 28% of the NFH cases. The remaining cases are harder. We randomly chose a 10,000 subset of the harder cases for manual annotation via crowd-sourcing. We only annotate cases where the rule-based and learning-based identification methods agree.

Deterministic Cases
The four deterministic patterns along with their coverage are detailed in Table 4. The first two are straightforward string matches for the patterns no one and you two, which we find to almost exclusively resolve to PEOPLE. The other two are dependency-based patterns for partitive (four [children] of the children) and copular (John is the one [John]) constructions. We collected a total of 22,425 such cases. While we believe these cases need to be handled by any NFH resolution system, we do not think systems should be evaluated on them. Therefore, we provide these cases as a separate dataset.

Annotation via Crowdsourcing
The FH phenomenon is relatively common and can be understood easily by non-experts, making the task suitable for crowd-sourcing.
The Annotation Task For every NFH anchor, the annotator should decide if it is a Reference FH or an Implicit FH. For Reference, they should mark the relevant textual span. For Implicit, they should specify the implicit head from a closed list. In cases where the missing head belongs to the implicit list, but also appears as a span in the sentence (reference), the annotators are instructed to treat it as a reference. To encourage consistency, we run an initial annotation in which we identified common implicit cases: YEAR (a calendar year, example (ix) in Table 1), AGE (example (x)), CURRENCY (example (xi). While the source of the text suggests US dollars, we do not commit to a specific currency), PERSON/PEOPLE (example (vi)) and TIME (a daily hour, example (iii)). The annotators are then instructed to either choose from these five categories; to choose OTHER and provide free-form text; or to choose UNKNOWN in case the intended head cannot be reliably deduced based on the given text. 11 For the Reference cases, the annotators can mark any contiguous span in the text. We then simplify their annotations and consider only the syntactic head of their marked span. 12 This could be done automatically in most cases, and was done manually in the few remaining cases. The annotator must choose a single span. In case the answer comprises of several spans as in examples (viii, xiv), we rely on it to surface as a disagreement between the annotators, which we then pass to further resolution by expert annotators.  Table 4: Example of NFH whose heads can be resolved deterministically. The first two patterns are the easiest to resolve. These just have to match as is and their head is the PEOPLE class. The last two patterns depends on a dependency parser and can be resolved by following arcs on the parse tree.
In every task (HIT in AMT jargon) a sentence with the FH anchor was presented (target sentence).
Each target sentence was presented with maximum two dialog turns before and one dialog turn after it. This was the sole context that was shown for avoiding to exhaust the AMT workers (turkers) with long texts and in the vast majority of the examined examples, the answer appeared in that scope. Every HIT contained a single NFH example. In cases of more than one NFH per sentence, it was split into 2 different HITs. The annotators were presented with the question: "What does the number [ANCHOR] refer to?" where [AN-CHOR] was replaced with the actual number span, and were asked to choose from 8 possible answers: REFERENCE, YEAR, AGE, CURRENCY, PERSON/PEOPLE, TIME, OTHER and UNKNOWN (See Figure 2 for a HIT example). Choosing the REFERENCE category requires marking a span in the text, corresponding to the referred element (the missing head). The turkers were instructed to prefer this category over the others if possible. Therefore, in example (xiv) the Reference answers were favoured over the PEOPLE answer. Choosing the OTHER category required entering free-form text.
Post-annotation, we unify the Other and Unknown cases into a single OTHER category.
Each example was labeled by three annotators. On the categorical decision (just the 1-of-7 choice, without considering the spans selected for the REFERENCE text and combining the OTHER and UNKNOWN categories), 73.1% of the cases had a perfect agreement (3/3), 25.6% had a majority agreement (2/3), and 1.3% had a complete disagreement. The Fleiss kappa agreement (Fleiss, 1971) is k = 0.73, a substantial agreement score. The high agreement score suggests that the annotators tend to agree on the answer for most cases. Figure 3 shows the confusion matrix for the 1-of-7 task, excluding the cases of complete disagreement. The more difficult cases involve the REFER-ENCE class, which is often confused with PEOPLE and OTHER.

Final Labeling Decisions
Post-annotation, we ignore the free text entry for OTHER and unify OTHER and UNKNOWN into a single category. However, our data collection process (and the corpus we distribute) contain this information, allowing for more complex task definitions in future work.
The disagreement cases surface genuinely hard cases, such as the ones below: (7) Mexicans have fifteen, Jews have thirteen, rich girls have sweet sixteen...
(8) All her communications are to Minnesota numbers. There's not one from California.
(9) And I got to see Irish. I think he might be the one that got away, or the one that got put-a-way.
The majority of the partial category agreement cases (1576) are of REFERENCE vs. OTHER/UNKNOWN, which are indeed quite challenging (e.g. Example 9 where two out of three turkers selected the REFERENCE answer and marked Irish as the head, and the third turker selected the Person/People label, which is also true, but less meaningful in our perspective).
The final labeling decision was carried out in two phases. First, a categorical labeling was done using the majority label, while the 115 examples with disagreement (e.g. Example 7 which was tagged as YEAR, REFERENCE ('birthday' which appeared in the context) and OTHER (free text:'special birthday') were annotated manually by experts.
The second stage dealt with the REFERENCE labels (5718 cases). We associate each annotated span with the lemma of its syntactic head, and consider answers as equivalent if they share the same lemma string. This results in 5101 full-agreement cases on the lemma level. The remaining 617 disagreement cases (e.g. Example (8)) were passed to further annotation by the expert annotators. During the manual annotation we allow also for multiple heads for a single anchor (e.g. for (viii,xiv) in Table 1).
An interesting case in Reference FH are constructions in which the referenced head is not unique. Consider example (viii) in Table 1: the word 'one' refers to either men or busses. Another example of such case is example (xiv) in Table  1 where the word 'two' refers both to fussy old maid and to flashy young man. Notice that the two cases have different interpretations: the referenced heads in (viii) have an or relation between them whereas the relation in (xiv) is and.

NFH Statistics
General We collected a total of 9,412 annotated NFH. The most common class is REFERENCE (45.0% of the dataset). The second common class is OTHER (23.5%), which is the union of original OTHER class, in which turkers had to write the missing head, and the UNKNOWN class, in which no clear answer could be identified in the text. The majority of this joined class is from the UN-KNOWN label (68.3%). The rest of the 5 closedclass categories account for the other 31.5% of the cases. A full breakdown is given in Figure 4. The anchor tokens in the dataset mainly consist of the token 'one' (49.0% of the dataset), with the tokens 'two' and 'three' being the second and third most common. 377 (3.9%) of the anchors are singletons, which appear only once.

Reference Cases
The dataset consists of a total of 4,237 REFERENCE cases. The vast majority of them (3,938 cases) were labeled with a single referred element, 238 with two reference-heads and 16 with three or more.
In most of the cases, the reference span can be found near the anchor span. In 2,019 of the cases, the reference is in the same sentence with the anchor, in 1,747 it appears in a previous/following sentence. Furthermore, in most cases (82.7%), the reference span appears before the anchor and only in 5.1% of the cases it appears after it. An example of such a case is presented in Example (xiv) in Table 1. In the rest of the cases, references appear both before and after the anchor.

NFH Resolution Dataset
The final NFH Resolution dataset consists of 900,777 tokens containing 9,412 instances of gold labeled resolved NFH. The resolution was done by 3 mechanical turk annotators per task, with a high agreement score (k = 0.73) 14 The REFER-ENCE cases are annotated with at least one referring item. The OTHER class unifies several other categories (None and some other scarce Implicit classes), but we maintain the original turkers answers to allow future work to apply more finegrained solutions for these cases.

Where's my head? Resolution Model
We consider the following resolution task: given a numeric anchor and its surrounding context, we need to assign it a single head. The head can be either a token from the text (for Reference FH) or one-of-six categories (the 5 most common categories and OTHER) for Implicit FH. 15 This combines two different kinds of tasks. The REFERENCE case requires selecting the most adequate token over the text, suggesting a similar formulation to coreference resolution (Ng, 2010; and implicit arguments identification (Gerber and Chai, 2012;Moor et al., 2013). The implicit case requires selection from a closed list, a similar formulation to word-tagging-in-context tasks, where the word (in our case span) to be tagged is the anchor. A further complication is the need to weigh the different decisions (Implicit vs Reference) against each other. Our solution is closely modeled after the state-of-the-art coreference resolution system of Lee et al. (2017). 16 However, the coreference-centric architecture had to be adapted to the particularities of the NFH task. Specifically, (a) the NFH resolution does not involve cluster assignments, and (b) it requires handling the Implicit cases in addition to the Reference ones.
The proposed model combines both decisions, a combination which resembles the copymechanisms in neural MT (Gu et al., 2016) and the Pointer Sentinel Mixture Model in neural LM (Merity et al., 2016). As we only consider refer- 14 The Reference cases were treated as a single class for computing the agreement score. 15 This is a somewhat simplified version of the full task defined in Section 3. In particular, we do not require to specify the head in case of OTHER, and we require a single head rather than a list of heads. Nonetheless, we find this variant to be both useful and challenging in practice. For the few multiple-head cases, we consider each of the items in the gold list to be correct, and defer a fuller treatment for future work. 16 Newer systems such as , Zhang et al. (2018) show improvements on the coreference task, but using components which focus on the clustering aspect of coreference, which are irrelevant for the NFH task. ring mentions as single tokens, we discarded the original models' features which handled the multispan representation (e.g. the Attention mechanism). Furthermore, as the Resolution task already receives a numeric anchor, it is redundant to calculate a mention score. In preliminary experiments we did try to add an antecedent score, with no resulting improvement. Our major adaptations to the Lee et al. (2017) model, described below, are the removal of the redundant components, and the addition of an embedding matrix for representing the Implicit classes.

Architecture
Given an anchor, our model assigns a score to each possible anchor-head pair and picks the one with the highest score. The head can be either a token from the text (for the Reference case) or one-of-six category labels (for the Implicit case). We represent the anchor, each of the text tokens and each category label as vectors.
Each of the implicit classes c 1 , ..., c 6 is represented as an embedding vector c i , which is randomly initialized and trained with the system.
To represent the sentence tokens (t i ), we first represent each token as a concatenation of the token embedding and the last state of a character LSTM (Hochreiter and Schmidhuber, 1997): where e i is the ith token embedding and e ic j is the jth character of the ith token. These representations are then fed into a text-level biLSTM resulting in the contextualized token representations t i : Finally, the anchor, which may span several tokens, is represented as the average over its contextualized tokens.
We predict a score s(h, a) for every possible head-anchor pair, where h ∈ {c 1 , ..., c 6 , t 1 , ..., t n } and h i is the corresponding vector. The pair is represented as a concatenation of the head, the anchor and their element-wise multiplication, and scored with a multi-layer perceptron: We normalize all of the scores using softmax, and train to minimize the cross-entropy loss.  Pre-trained LM To take advantage of the recent success in pre-trained language models (Peters et al., 2018;Devlin et al., 2018) we also make use of ELMo contextualized embeddings instead of the embedding matrix and the character LSTM concatentation.

Training Details
The character embedding size is 30 and their LSTM dimension is 10. We use Google's pretrained 300-dimension w2v embeddings (Mikolov et al., 2013) and fix the embeddings so they don't change during training. The text-level LSTM dimension is 50. The Implicit embedding size is the same as the BiLSTM output, 100 units. The MLP has a single hidden layer of size 150 and uses tanh as the non-linear function. We use dropout of 0.2 on all hidden layers, internal representation and tokens representation. We train using the Adam optimizer (Kingma and Ba, 2015) and a learning rate of 0.001 with early stopping, based on the development set. We shuffle the training data before every epoch. The annotation allows more than one referent answer per anchor, in such case, we take the closest one to the anchor as the answer for training, and allow either one when evaluating. The experiments using ELMo, replaced the pretrained word embeddings and character LSTM. It uses the default parameters in the AllenNLP framework (Gardner et al., 2017), with 0.5 dropout on the network, without gradients update on the contextualized representation.

Experiments and Results
Dataset  splits.
Metrics We measure the model performance of the NFH head detection using accuracy. For every example, we measure if the model successfully predicted the correct label or not. We report two additional measurements: binary classification accuracy between the Reference and Implicit cases and a multiclass classification accuracy score, which measures the class-identification accuracy while treating all REFERENCE selections as a single decision, regardless of the chosen token.

Results
We find that 91.8% of the Reference cases are nouns. To provide a simple baseline for the task, we report accuracies solely on the Reference examples (ignoring the Implicit ones) when choosing one of the surrounding nouns. Choosing the first noun in the text, the last one or the closest one to the anchor leads to scores of 19.1%, 20.3% and 39.2% We conduct two more experiments to test our model on the different FH kinds: Reference and Implicit. In these experiments we assume an oracle that tells us the head type (Implicit or Reference) and restricts the candidate set for the correct kind during both training and testing. Table 5 summarizes the results for the oracle experiments as well as for the full model.
The final models accuracies are summarized in Table 6. The complete model trained on the entire training data achieves 65.6% accuracy on the development set and 60.8% accuracy on the test set. The model with ELMo embeddings (Peters et al., 2018) adds a significant boost in performance and achieves 77.2% and 74.0% accuracy on the development and test sets respectively.
The development-set binary separation with ELMo embeddings is 86.1% accuracy and categorical separation is 81.9%. This substantially outperforms all baselines, but still lags behind the oracle experiments (Reference-only and Implicit-only).
As the oracle experiments perform better on the individual Reference and Implicit classes, we experimented with adding an additional objective to the model which tries to predict the oracle decision (implicit vs. reference). This objective was realized as an additional loss term. However, this experiment didn't yield any performance improvement.
We also experimented with linear models, with features based on previous work which dealt with antecedent determination (Ng et al., 2005;Liu et al., 2016) such as POS tags and dependency labels of the candidate head, if the head is the closest noun to the anchor etc. We also added some specific features that dealt with the Implicit category, for example binarization of the anchor based on its magnitude (e.g. <1,<10,<1600,<2100), if there was another currency mention in the text, etc. None of these attempts surpassed the 28% accuracy on the development set. For more details on these experiments, see Appendix A.

Analysis
The base model's results are relatively low, but gain a substantial improvement by adding contextualized embeddings. We perform an error analysis on the ELMo version which highlights the challenges of the task.  Figure 5 shows the confusion matrix of our model and Table 7 lists some errors from the development set.
Pattern-Resolvable Error-cases The first three examples in Table 7 demonstrate error cases that can be solved based on text-internal cues and "complex-pattern-matching" techniques. These can likely be improved with a larger training set or improved neural models.
The errors in Examples 1 and 2 might have caused by a multi-sentence patterns. A possible reason for the errors is due to the lack of that pattern in the training data. Another explanation could be due to a magnitude bias, where in Example 1, One in the beginning of a sentence usually refer to PEOPLE, whereas in Example 2, Five is more likely to refer to an AGE.
In Example 3, the model has to consider several cues from the text, such as the phrase "a hundred dollars" which contains the actual head and is of a similar magnitude to the anchor. In addition, the phrase: "it was more around" gives a strong hint on a previous reference.

Inference/Common Sense Errors
Another category of errors is those that are less likely to be resolved with pattern-based techniques and more data. These require common sense and/or more sophisticated inferences to get right, and will likely require a more sophisticated family of models to solve.
In Example 4, one refers to dad, but the model chose sisters. These are the only nouns in this example, and with the lack of any obvious pattern, a model needs to understand the semantics of the text to identify the missing head correctly.
Example 5 also requires to understand the semantics of the text, and some understanding of its discourse dynamic; where a conversation between the two speakers takes place, with a reply of Krank to L'oncle Irvin, that the model missed.
In Example 6, the model has difficulty to collect the cues in the text that refer to an unmentioned person, and therefore the answer is PEOPLE, but the model predicts OTHER.
Finally, in Example 7 we observe an interesting case of overfitting, which is likely to originate from the word-character encoding. As the anchor -1991 is a four-digit number, which are usually used to describe YEARs, its representation gets a strong signal for this label, even though the few words which precede it (a shiny new) are not likely to describe a YEAR label.

Related Work
The FH problem was not directly studied in the NLP literature. However, several works dealt with overlapping components of this problem.  Sense Anaphora The first, and most related is the line of work by Gardiner (2003); Ng et al. (2005); Recasens et al. (2016) which dealt with sense anaphoric pronouns ("Am I a suspect? -you act like one", c.f. Example 4). Sense anaphora, sometimes also referred to as identity of sense anaphora, are expressions that inherit the sense from their antecedent but do not denote the same referent (as opposed to coreference). The sense anaphora phenomena cover also numerals, and significantly overlap with many of our NFH cases. However, it does not cover the Implicit NFH cases, and also does not cover cases where the target is part of a co-referring expression ("I met Alice and Bob. The two seem to get along well.").
In terms of computational modeling, the sense anaphora task is traditionally split into two subtasks: (i) identifying anaphoric targets and disambiguating their sense; and (ii) resolving the target to an antecedent. Gardiner (2003) and Ng et al. (2005) perform both tasks, but restrict themselves to one anaphora cases and their noun-phrase antecedents. Recasens et al. (2016) on the other hand addressed a wider variety of sense anaphors (e.g. one, all, another, few, most, etc. a total of 15 different senses, including numerals). Recasens et al. (2016) annotated a corpus of third of the English OntoNotes (Weischedel et al., 2011) with sense anaphoric pronouns and their antecedents. Based on this dataset, they introduce a system for distinguishing anaphoric from non-anaphoric usages. However, they do not attempt to resolve any tar-get to its antecedent. The non-anaphoric examples in their work combines both our Implicit class, as well as other non-anaphoric examples indistinguishably, and therefore are not relevant for our work.
In the current work, we restrict ourselves to numbers and so cover only part of the senseanaphora cases handled in Recasens et al. (2016). However, in the categories we do cover, we do not limit ourselves to anaphoric cases (e.g. Ex. 3, 4) but include also non-anaphoric cases that occur in FH constructions (e.g. Ex. 1, 2) and are interesting on their own right. Furthermore, our models not only identify the anaphoric cases but also attempt to resolve them to their antecedent.
Zero Reference In zero reference, the argument of a predicate is missing, but it can be easily understood from context (Hangyo et al., 2013). For example, in the sentence: "There are two roads to eternity, a straight and narrow , and a broad and crooked " have a zero-anaphoric relationship to "two roads to eternity" (Iida et al., 2006). This phenomenon is usually discussed as the context of zero pronouns, where a pronoun is what's missing. It occur mainly in pro-drop languages such as Japanese, Chinese and Italian, but has also observed in English, mainly in conversational interactions (Oh, 2005). Some, but not all, zeroanaphora cases result in FH or NFH instances. Similarly to FH, the omitted element can appear in the text, similar to our Reference definition (zero endophora), or outside of it, similar to our Im-plicit definition (zero exophora). Identification and resolution of this, has attracted a lot of interest mainly in Japanese (Nomoto and Nitta, 1993;Hangyo et al., 2013;Iida et al., 2016) and Chinese (Chen and Ng, 2016;Yin et al., 2018a,b), but also in other languages (Ferrández and Peral, 2000;Yeh and Chen, 2001;Han, 2004;Kong and Zhou, 2010;Mihȃilȃ et al., 2010;Kopeć, 2014). However, most of these works considered only the zero endophora phenomenon in their studies, and even those who did consider zero exophora (Hangyo et al., 2013), only considered the author/reader mentions, e.g. "liking pasta (φ) eats (φ) every day" (translated from Japanese). In this study, we consider a wider set of possibilities. Furthermore, to the best of our knowledge, we are the first to tackle (a subset-of) zero anaphora in English.
Coreference The coreference task is to find within a document (or multiple documents) all the corefering spans which form cluster(s) of the same mention (which are the anaphoric cases as described above). The FHs resolution task, apart from the non-anaphoric cases, is to find the correct anaphora reference of the target span. The span identification component of our task overlaps with the coreference one (see Ng (2010) for a thorough summary on the Noun Phrase coreference resolution and (Sukthanker et al., 2018) for a comparison between coreference and anaphora). Although the span search resemblance, the key conceptual distinctions is that FH allow the anaphoric span to be non co-referring Recent work on coreference resolution (Lee et al., 2017) propose an end-to-end neural architecture which results in a state-of-the-art performance. The work of (Peters et al., 2018;Zhang et al., 2018) further improve on their the scores with pre-training, refining span representation and using biaffine attention model for mention detection and clustering. While these models cannot be applied to the NFH task directly, we propose a solution based on the model of Lee et al. (2017) which we adapt to incorporate the implicit cases.
Ellipsis The most studied type of ellipsis is the Verb Phrase Ellipsis (VPE). Although the following refers to this line of studies, the task and resemblance to the NFH task hold up to the other types of ellipsis as well (e.g. Gapping (Lakoff and Ross, 1970), Sluicing (John, 1969), Nomi-nal Ellipsis (Lobeck et al., 1995), etc.). VPE is the anaphoric process where a verbal constituent is partially or totally unexpressed but can be resolved through an antecedent from context (Liu et al., 2016). For example, in the sentence: "His wife also works for the paper, as did his father", the verb did is used to represent the verb phrase works for the paper. The VPE resolution task is to detect the target word which creates the ellipsis and the anaphoric verb phrase which it depicts. Recent work (Liu et al., 2016;Kenyon-Dean et al., 2016) tackled this problem by dividing it into two main parts: target detection and antecedent identification.
Semantic Graph Representations Several semantic graph representation cover some of the cases we consider. Abstract Meaning Representation (AMR) is a graph-based semantic representation for language (Pareja-Lora et al., 2013). It covers a wide range of concepts and relations. Five of those concepts: year, age, monetary-quantity, time and person correlate to our implicit classes: YEAR, AGE, CURRENCY, TIME and PEOPLE respectively.
The UCCA semantic representation (Abend and Rappoport, 2013) explicitly marks missing information, including the REFERENCE NFH cases, but not the IMPLICIT ones.

Conclusions
Empty elements are pervasive in text, yet do not receive much research attention. In this work, we tackle a common phenomena that did not receive previous treatment. We introduce the FH identification and resolution tasks and focus on a common and important FH subtype: the NFH. We demonstrate that the NFH is a common phenomenon, covering over 40% of the number appearances in a large dialog-based corpus and a substantial amount in other corpora as well (>20%). We create datasets for the NFH identification and resolution tasks. We provide an accurate method for identifying the NFH constructions and a neural baseline for the resolution task. The resolution task proves challenging, requiring further research. We make the code and datasets available to facilitate such research (github.com/ yanaiela/num_fh).