Ultra-fine Entity Typing with Indirect Supervision from Natural Language Inference

The task of ultra-fine entity typing (UFET) seeks to predict diverse and free-form words or phrases that describe the appropriate types of entities mentioned in sentences. A key challenge for this task lies in the large number of types and the scarcity of annotated data per type. Existing systems formulate the task as a multi-way classification problem and train directly or distantly supervised classifiers. This causes two issues: (i) the classifiers do not capture the type semantics because types are often converted into indices; (ii) systems developed in this way are limited to predicting within a pre-defined type set, and often fall short of generalizing to types that are rarely seen or unseen in training. This work presents LITE🍻, a new approach that formulates entity typing as a natural language inference (NLI) problem, making use of (i) the indirect supervision from NLI to infer type information meaningfully represented as textual hypotheses and alleviate the data scarcity issue, as well as (ii) a learning-to-rank objective to avoid the pre-defining of a type set. Experiments show that, with limited training data, LITE obtains state-of-the-art performance on the UFET task. In addition, LITE demonstrates its strong generalizability by not only yielding best results on other fine-grained entity typing benchmarks, more importantly, a pre-trained LITE system works well on new data containing unseen types.1


Introduction
Entity typing, inferring the semantic types of the entity mentions in text, is a fundamental and longhttps://github.com/luka-group/lite. lasting research problem in natural language understanding, which aims at inferring the semantic types of the entities mentioned in text.The resulted type information can help with grounding human language components to real-world concepts (Chandu et al., 2021), and provide valuable prior knowledge for natural language understanding tasks such as entity linking (Ling et al., 2015;Onoe and Durrett, 2020), question answering (Yavuz et al., 2016), and information extraction (Koch et al., 2014).Prior studies have mainly formulated the task as a multi-way classification problems (Wang et al., 2021;Zhang et al., 2019;Chen et al., 2020a;Hu et al., 2020).
However, earlier efforts for entity typing are far from enough for representing real-world scenarios, where types of entities can be extremely diverse.Accordingly, the community has recently paid much attention to more fine-grained modeling of types for entities.One representative work is the Ultra-fine Entity Typing (UFET) benchmark created by Choi et al. (2018).The task seeks to search for the most appropriate types for an entity among over ten thousand free-form type candidates.The drastic increase of types enforces us to doubt if the multi-way classification framework is still suitable for UFET.In this context, two main issues are noticed from prior work.First, prior studies have not tried to understand the target types since most classification systems converted all types into indices.Without knowing the semantics of types, it is hard to match an entity mention to a correct type especially when there is not sufficient annotated data for each type.Second, existing entity typing systems are far behind the desired capability in real-world applications in which any open-form types can appear.Specifically, those pre-trained multi-way classifiers cannot recognize types that are unseen in training, especially when there is no reasonable mapping from existing types to unseen type labels, unless the classifiers are re- trained to include those new types.
To alleviate the aforementioned challenges, we propose a new learning framework that seeks to enhance ultra-fine entity typing with indirect supervision from natural language inference (NLI) (Dagan et al., 2006).Specifically, our method LITE , namely (Language Inference based Typing of Entities), treats each entity-mentioning sentence as a premise in NLI.Using simple, template-based generation techniques, a candidate type is transformed into a textual description and is treated as the hypothesis in NLI.Based on the premise sentence and a hypothesis description of a candidate type, the entailment score given by an NLI model is regarded as the confidence of the type.On top of the pre-trained NLI model, LITE conducts a learning-to-rank objective, which aims at scoring hypotheses of positive types higher than the hypotheses of sampled negative types.Finally, the label candidates whose hypotheses obtain scores above a threshold are given as predictions by the model.
Technically, LITE benefits ultra-fine entity typing from three perspectives.First, the inference ability of a pre-trained NLI model can provide effective indirect supervision to improve the prediction of type information.Second the hypothesis, as a type description, also provides a semantically rich representation of the type, which further benefits few-shot learning with insufficient labeled data.Moreover, to handle the dependency of type labels in different granularities, we also utilize the inference ability of NLI model to learn that the finer label hypothesis of an entity mention entails its general label hypothesis.Experimental results on the UFET benchmark (Choi et al., 2018) show that LITE drastically outperforms the recent state-of-the-art (SOTA) systems (Dai et al., 2021;Onoe et al., 2021;Liu et al., 2021) without any need of distantly supervised data as they do.In addition, our LITE also yields the best performance on traditional (less) fine-grained entity typing tasks.2What's more, since we adopt a learning-to-rank objective to optimize the inference ability of LITE rather than classification on a specified label space, it is feasible to apply the trained model across different typing data sets.We therefore test its transferability by training on UFET and evaluate on traditional fine-grained benchmarks to get promising results.Moreover, we also examined the time efficiency of LITE, and discussed about the trade-off between training and inference costs in comparison with prior methods.
To summarize, the contributions of our work are three-folds.First, to our knowledge, this is the first work that uses NLI formulation and NLI supervision to handle entity typing.As a result, our system is able to keep the labels' semantics and encode the label dependency effectively.Second, our system offers SOTA performance on both ultrafine entity typing and regular fine-grained typing tasks, being particularly strong at predicting zeroshot and few-shot cases.Finally, we show that our system, once trained, can also work on different test sets which are free to have unseen types.

Related Work
Entity Typing.Traditional entity typing was introduced and thoroughly studied by Ling and Weld (2012).One main challenge that earlier efforts have focused on was to obtain sufficient training data to develop the typing model.To do so, automatic annotation has been commonly used in the a series of works (Gillick et al., 2014;Ling and Weld, 2012;Yogatama et al., 2015).Later works were developed for further improvement by modeling the label dependency with a hierarchy-aware loss (Ren et al., 2016;Xu and Barbosa, 2018).External knowledge from knowledge bases has also been introduced to capture the semantic relations or relatedness of type information (Jin et al., 2019;Dai et al., 2019;Obeidat et al., 2019).Ding et al. (2021) adopt prompts to model the relationship between entities and type labels, which is similar to our template-based type description generation.However, their prompts are intended for label generation from masked language models while our templates realize the supervision from NLI.
More recently, Choi et al. (2018) proposed the ultra-fine entity typing (UFET) task which involved free-form type labeling to realize the opendomain label space with much more comprehensive coverage of types.As the UFET tasks nontrivial learning and inference problems, several methods have been explored by more effectively modeling the structure of the label space.Xiong et al. (2019) utilized a graph propagation layer to impose label-relation bias in order to capture type dependencies implicitly.Onoe and Durrett (2019) trained a filtering and relabeling model with the human annotated data to denoise the automatically generated data for training.Onoe et al. (2021) introduced box embeddings (Vilnis et al., 2018) to represent the dependency among multiple levels of type labels as topology of axis-aligned hyperrectangles (boxes).To further cope with insufficient training data, Dai et al. (2021) used pretrained language model for augmenting (noisy) training data with masked entity generation.Different to their strategy of augmenting training data, our approach generates type descriptions to leverage indirect supervision from NLI which requires no more data samples.
Natural Language Inference and Its Applications.Early approaches towards NLI problems were based on studying lexical semantics and syntactic relations (Dagan et al., 2006).Following research then introduced deep-learning methods into this task to capture contextual semantics.Parikh et al. (2016) utilize Bi-LSTM (Hochreiter and Schmidhuber, 1997) to encode the input tokens and use attention mechanism to capture substructures of input sentences.Most recent works develop end-to-end trained NLI models that leverage pre-trained language models (Devlin et al., 2019;Liu et al., 2019) for sentence pair representation and large learning resources (Bowman et al., 2015;Williams et al., 2018) for training.
Specifically, since pre-trained NLI models benefit generalizable logical inference, current literature has also proposed to leverage NLI models to improve prediction tasks with insufficient training labels, including zero-shot and few-shot text classification (Yin et al., 2019).Shen et al. (2021) adopted RoBERTa-large-MNLI (Liu et al., 2019) to calculate the document similarity for document multi-class classification.Chen et al. (2021) proposed to verify the output of a QA system with NLI models by converting the question and answer into a hypothesis and extracting textual evidence from the reference document as the premise.
Recent works by Yin et al. (2020) and White et al. (2017) are particularly relevant to this topic, which utilize NLI as a unified solver for several text classification tasks such as co-reference resolution and multiple choice QA in few-shot or fully-supervised manner.Yet our work handles a learning-to-rank objective for inference in a large candidate space, which not only enhances learning under a data-hungry condition, but also is free to be adapted to infer new labels that are unseen to training.Yin et al. (2020) also proposed an approach to transform co-reference resolution task into NLI manner and we modified it as one of our template generation methods, which is discussed in §3.2.

Method
In this section, we introduce the proposed method for (ultra-fine) entity typing with NLI.We start with the preliminary of problem definition and the overview of our NLI-based entity typing framework ( §3.1), followed by technical details of type

Contextual Explanation
In this context, career at a com--pany is referring to duration.
Premise: "No one expects a career at a company any more, . . ." Hypothesis: "In this context, career at a company is referring to duration."

Label Substitution
Musician knows how to make a hip-hop record sound good.
Premise: "He knows how to make a hip-hop record sound good."Hypothesis: "Musician knows how to make a hip-hop record sound good." Table 1: Type description instances of three templates.Entity mentions are boldfaced and underlined while label words are only boldfaced.

Preliminaries
Problem Definition.The input of an entity typing task is a sentence s and an entity mention of interest e ∈ s.This task aims at typing e with one or more type labels from the label space L.
For instance, in "Jay is currently working on his Spring 09 collection , which is being sponsored by the YKK Group.", the entity "Jay" should be labeled as person, designer or creator instead of organization or location.
The structure of the label space L can vary.For example, in some benchmarks like OntoNotes (Gillick et al., 2014), labels are provided in canonical form and strictly depend on their ancestor types.In this case, a type label bridge appears as /location/transit/bridge.However, in benchmarks like FIGER (Ling and Weld, 2012), partial labels have a dependency with their ancestors while the others are free-form and uncategorized.For instance, label film is given as /art/film but currency appears as a single word.For our primary task, for ultra fine-grained entity typing, the UFET benchmark (Choi et al., 2018) provides no ontology of the labels and the label vocabulary consists of freeform words only.In this case, film star and person can appear independently in an annotation set with no dependency information provided.
Overview of LITE.Given a sentence with at least an entity mention, LITE treats the sentence as the premise in NLI, and then learns to type the entity in three consecutive steps (Fig. 1).First, LITE employs a simple, low-cost template-based technique to generate a natural language description for a type candidate.This type description is treated as the hypothesis in NLI.For this step, we explore with three different description generation templates ( §3.2).Second, to capture label dependency, whether or not the type ontology is provided, LITE consistently generates type descriptions for any ancestors of the original type label on the previous sentence and learns their logical dependencies ( §3.3).These two steps create positive cases of type descriptions for the entity mention in the previous sentence.Last, LITE finetunes a pre-trained NLI model with a learning-torank objective that ranks the positive case(s) over negative-sampled type descriptions according to the entailment score ( §3.5).During the inference phase, given another sentence that mentions an entity to be typed, our model predicts type that leads to the hypothetical type description with the highest entailment score.In this way, LITE can effectively leverage indirect supervision signals of a (pre-trained) NLI model to infer the type information of a mentioned entity.
We hereby describe the technical details of training and inference steps of LITE in the rest of the section.

Type Description Generation
Given each sentence s with an annotated entity mention e, LITE first generates a natural language type description T (a) for the type label annotation a.The description will later act as a hypothesis in NLI.Specifically, we consider several generation technique to obtain such type descriptions, for which the details are described as follows.
• Taxonomic statement.The first template directly connects the entity mention and the type label with an "is-a" statement, i..e."[ENTITY] is a [LABEL]".
• Contextual explanation.The second template generates a declarative sentence which adds a context-related connective.The generated type description is in the form of "In this context, [ENTITY] is referring to [LABEL]".
• Label substitution.Yin et al. (2020) proposed to transform co-reference resolution problem into NLI manner by replacing the pronoun mentions with candidate entities.Inspired by their transformation, this technique directly replaces the [ENTITY] in the original sentence with [LABEL].Therefore, the NLI model will treat the modified sentence with a "type mention" as the hypothesis of the original sentence with the entity mention.
As shown in Tab. 1, each template provides a semantically meaningful way to connect the entity with label.In this way, the inference ability of an NLI model can be leveraged to capture the relationship of entity and label, given the original entity-mentioning sentence as the premise.
Particularly, we have also tried automatic template generation method proposed by Gao et al. (2021), which has led to the adoption of the contextual explanation template.Such a template technique adopts the pre-trained text-to-text Transformer T5 (Raffel et al., 2020) to generate prompt sentences for fine-tuning language models.In our case, T5 mask tokens are added between the sentence, the entity and the label.Since T5 is trained to fill in the blanks within its input, the output tokens can be used as the template for our type description.For example, given the sentence "Anyway, Nell is their new singer, and I would never interrupt her show.",the entity Nell and the annotations (singer, musician, person), we can formulate the input to T5 as "Anyway, Nell is their new singer, and I would never interrupt her show.<X> Nell <Y> singer <Z>".T5 will then fill in the placeholders <X>, <Y>, <Z> and output "...I would never interrupt her show.In fact, Nell is a singer."We observe that most of the generated templates given by T5 have appeared as the format where a prepositional phrase (e.g. in fact, in this context, in addition, etc.) followed by a statement such as "[ENTITY] is a [LABEL]" or "[EN-TITY] became [LABEL]".Accordingly, we select the above contextual explanation template, which is the most representative pattern observed in the generations.
In the training process, we use one of the three templates to generate the hypotheses, for which the same template will also be used to obtain the candidate hypotheses in inference.According to our preliminary results on dev set, the taxonomic statement generation generally gives better performance than the others under most settings, for which the analysis is presented in §4.3.Thus, the main experimentation is reported as the configuration where LITE uses the type descriptions based on taxonomic statement.

Modeling Label Dependency
The rich entity type vocabulary may form hierarchies that enforce logical dependency among labels of different specificity.Hence, we extend the generation process of type description to better capture such the label dependency.In detail, for a specific type label that LITE has generated a type description, if there are ancestor types, we not only generate descriptions for each of the ancestor types, but also conduct learning among these type descriptions.The descendant type description would act as the premise and the ancestor type description would act as the hypothesis.For instance, in OntoNotes (Gillick et al., 2014) or FIGER (Ling and Weld, 2012), suppose a sentence mentions the entity London and is labeled as /location/city, if the taxonomic statement based description generation is used, LITE will yield descriptions for both levels of types, i.e. "London is a city" and "London is a location".In such a case, the more fine-grained type description "London is a city" can act as the premise of the more coarse-grained description "London is a location", so as to help capturing the dependency between two labels "city" and "location".Such paired type descriptions are added to training and will be captured by the dependency loss L d as being described in §3.4.
This technique to capture label dependency can be easily adapted to tasks where a type ontology is unavailable, but each instance is directly annotated with multiple type labels of different specificity.Particularly for the UFET task (Choi et al., 2018), while no ontology is provided for the label space, the task separates the type label vocabulary into different specificity, i.e. general, fine and ultrafine ones.Since its annotation to an entity from a sentence includes multiple labels of different specificity, we can still utilize the aforementioned dependency modeling method.For instance, an entity Mike Tyson may be simultaneously labeled as person (general), sportsman (fine), and boxer (ultra-fine).Similar to using an ontology, each pair of descendant and ancestor descriptions among the three generations "Mike Tyson is a sportsman", "Mike Tyson is a person" and "Mike Tyson is a sportsman" are also added to training.

Learning Objective
Let L be the type vocabularies, the learning objective of LITE is to conduct learning-to-rank on top of the NLI model.Given a sentence s with mentioned entity e, we use P to denote all true type labels of e that may include the original label and any induced ancestor labels as described in §3.3.Then, for each label p ∈ P whose type description is generated as H(p) by one of the techniques in §3.2, the NLI model calculates the entailment score ε(s, H(p)) ∈ [0, 1] for the premise s and hypothesis H(p).Meanwhile, negative sampling randomly selects a false label p ∈ L \ P .Following the same procedure above, the entailment score ε(s, H(p )) is obtained for the premise s and the negative-sample hypothesis H(p ).The margin ranking loss for an annotated training case is then defined as [x] + denotes the positive part of the input x (i.e.max(x, 0)) and γ is a non-negative constant.
We also similarly define a ranking loss to model the label dependency.Still given the above annotated sentence s and the set of all true type labels P , as described in §3.3, for any exiting pair of ancestor type p an and descendant type p de from P , the training phase also captures the entailment relation between their descriptions.This process regards H(p de ) as the premise and H(p an ) as the hypothesis, and the NLI model therefore yields an entailment score ε(H(p de ), H(p an )).The label dependency loss is then defined as the following ranking loss where p an is negative-sampled type label.
The eventual learning objective is to optimize the following joint loss: where S denotes the dataset containing sentences with typed entities, and P s denotes the set of true labels on an entity of the sentence instance s.In this way, all annotations of each entity mention will be involved in training.λ here is a nonnegative hyper-parameter that controls the influence of dependency modeling.

Inference
The inference phase of LITE performs ranking on descriptions for all type labels from the vocabulary.For any given sentence s mentioning an entity e, LITE accordingly generates a type description for each candidate type label.Then, taking the sentence s as the premise, the finetuned NLI model ranks the hypothetical type descriptions according to their entailment scores.Finally, LITE selects the type label whose description receives the highest entailment score, or predicts with a threshold of entailment scores in cases where multi-label prediction is required.

Experiment
In this section, we present the experimental evaluation for LITE framework, based on both UFET ( §4.1) and traditional (less) fine-grained entity typing tasks ( §4.2).In addition, we also conduct comprehensive ablation studies to understand the effectiveness of the incorporated techniques( §4.3).

Ultra-Fine Entity Typing
We use the UFET benchmark created by Choi et al. (2018) for evaluation.The UFET dataset consists of two parts.(i) Human-labeled data (L): 5,994 instances split into train/dev/test by 1:1:1 (1,998 for each); (ii) Distant supervision data (D): including 5.2M instances that are automatically labeled by linking entity to KB, and 20M instances generated by headword extraction.We follow the original design of the benchmark to evaluate loose macro-averaged precision (P), recall (R) and F1.
Training Data.In our approach, the supervision can come from the MNLI data (NLI) (Williams et al., 2018), distant supervision data (D) and the human-labeled data (L).Therefore, we investigate the best combination of training data by exploring the following different training pipelines: • LITE NLI : Pre-train on MNLI3 , then predict directly, without any tuning on D or L; • LITE L : Only fine-tune on L; • LITE NLI+L : Pre-train on MNLI, then fine-tune on L; • LITE D+L : Pre-train on D, then fine-tune on L; • LITE NLI+D+L : First pre-train on MNLI, then on D, finally fine-tune on L.
Model Configurations.Our system is first initialized as RoBERTa-large (Liu et al., 2019) and AdamW (Loshchilov and Hutter, 2018)  • UFET-biLSTM (Choi et al., 2018) represents words using the GloVe embedding (Pennington et al., 2014) and captures semantic information of sentences, entities as well as labels with a bi-LSTM and a character-level CNN.It also learns a type label embedding matrix to operate inner product with the context and mention representation for classification.
• LabelGCN (Xiong et al., 2019) improves UFET-biLSTM by stacking a GCN layer on the top to capture the latent label dependency.
• LDET (Onoe and Durrett, 2019) applies ELMo embeddings (Peters et al., 2018) for word representation and adopts LSTM as its sentence and mention encoders.Similar to UFET-biLSTM, it learns a matrix to compute inner product with each input representation for classification.Besides, LDET also trains a filter and relabeler to fix the label inconsistency in the distant supervision training data.Overall, LITE NLI+L demonstrates SOTA performance over other baselines, outperforming the prior top system MLMET (Dai et al., 2021) with 1.5% absolute improvement on F1.Recall that MLMET built a multi-way classifier on the its newly collected distant supervision data and the human-labeled data, our LITE optimizes a textual entailment scheme on the entailment data (i.e., MNLI) and the human-labeled entity typing data.This comparison verifies the effectiveness of using Hierarchy-Typing (Chen et al., 2020b) 73.0 68.1 83.0 79.8 Box4Types (Onoe and Durrett, 2020) 77.3 70.9 79.4 75.0 DSAM (Hu et al., 2020) 83 the entailment scheme and the indirect supervision from NLI.
The bottom block in Tab. 2 further explores the best combination of available training data.First, training on MNLI (i.e., LITE NLI ) alone does not provide promising results.This could be due to that the MNLI does not generalize well to this UFET task.LITE L removes the supervision from NLI as compared to LITE NLI+L , causing a noticeable performance drop.In addition, the comparison between LITE NLI+L and LITE D+L illustrates that the MNLI data, as an out-of-domain resource, even provides more beneficial supervision than the distant annotations.To our knowledge, this is already the first work that shows rather than relying on gathering distant supervision data in the (entity-mentioning context, type) style, it is possible to find more effective supervision from other tasks (e.g., from entailment data) to boost the performance.However, when we incorporate the distant supervision data (D) into LITE NLI+L , the new system LITE NLI+D+L performs worse.We present more detailed analyses in §4.3.
In addition, we also investigate the contribution of label dependency modeling by removing it from LITE NLI+L .As results shown in Tab. 2, incorpo-rating label dependency helps improve the recall with a large margin (from 46.6 to 48.9) despite a minor drop for the precision, leading to notable overall improvement in F1.

Fine-grained Entity Typing
In addition to UFET, we are also interested in (i) the effectiveness of our LITE to entity typing tasks with much fewer types, and (ii) if our learned LITE model from the ultra-fine task can be used for inference on other entity typing tasks, which often has unseen types, even without further tuning.To the end, we evaluate LITE on OntoNotes (Gillick et al., 2014) and FIGER (Ling and Weld, 2012), two popular fine-grained entity typing benchmarks.
OntoNotes contains 3.4M automatically labeled entity mentions for training and 11k manually annotated instances that are split into 8k for dev set and 2k for test set.Its label space consists of 88 types and one more other type.In inference, LITE outputs other if none of the 88 types is scored over the threshold described in §3.5.FIGER contains 2M data samples labeled with 113 types.The dev set and test set include 1,000 and 562 samples respectively.Within its label space, 82 types have a dependency relation with their ancestor or descendant types while the other 30 types are uncategorized free-form words.
Results.Tab. 3 reports baseline results as well as results of two variants of LITE: one is pretrained on UFET and directly transfer to predict on the two target benchmarks, the other conducts task-specific training on the target benchmark after pre-training on MNLI.The task-specific training variant outperforms respective prior SOTA on both benchmarks (OntoNotes: 86.4 vs. 85.4 in macro-F1, 80.9 vs. 80.4 in micro-F1; FIGER: 86.7 vs. 84.9 in macro-F1, 83.3 vs. 81.5 in micro-F1).
An interesting advantage of LITE lies in its transferability across benchmarks.Tab. 3 demonstrates that our LITE (pre-trained on UFET) offers competitive performance on both OntoNotes and FIGER even with only zero-shot transfer (it even exceeds the "task-specific training" version on OntoNotes). 4 Although there are disjoint type labels between these two datasets and UFET, there exist manually-crafted mappings from UFET labels to them (e.g."musician" to "/person/artist/music").In this way, traditional multi-way classifiers still work across the datasets after type mapping though we do not prefer human-involvement in real-world applications.To further test the transferability of LITE, a more challenging experimental setting for zeroshot type prediction is conducted and analyzed in §4.3.

Analysis
Through the following analyses, we try to answer following questions: (i) Why did not the distant supervision data help (as Tab. 2 indicates)?(ii) How effective is each type description template (Tab.1)? (iii) With the NLI-style formulation and the indirect supervision, does LITE generalize better for zero-shot and few-shot prediction?Is trained LITE transferable to new benchmarks with unseen types?(iv) On which entity types does our model perform better, and which ones remain challenging?(vi) How efficient is LITE?Distant Supervision Data.As Tab. 2 indicates, adding distant supervision data in LITE NLI+D+L even leads to a drop of 3.2% absolute score in F1 from LITE NLI+L .This should be due to the fact that the distant supervision data (D) are overall noisy (Onoe and Durrett, 2019).Tab. 4 lists some frequent and typical problems that exist in D based on entity linking and head-word extraction.In general, they will lead to two problems.
On the one hand, a large number of false positive types are introduced.Considering the example (a) in Tab. 4, the state Connecticut is labeled as author, cemetery and person.For the example (c), hash brown is labeled as brown, turning the concept of food into color.Additionally, the headword method is short in capturing the semantics.
In the example (d), number is falsely extracted as the type for a number of short stories because of the preposition "of".
On the other, such distant supervision may not comprehensively recall positive types.For instance, examples (b) and (e) are both about the entity "film" where the recalled types are correct.However, in the human annotated data, entity "film" may also be labeled as ("film", "art", "movie", "show", "entertainment", "creation").In this situation, those missed positive types (i.e., "movie", "show", "entertainment" and "creation") will be selected by the negative sampling process of LITE and therefore negatively influence the performance.The comparison between LITE NLI+L and LITE D+L can further justify the superiority of the indirect supervision from NLI over that from the distant supervision data.
Type Description Templates.Tab. 5 reveals how template choices affect the typing performance.It is obvious that taxonomic statement outperforms the other two under all of the three training settings.The contextual explanation template yields close, while worse results but the label substitution leads to more noticeable F1 drop.This may result from the absence of entity mention in hypothesis by label substitution.For instance, in "Soft eye shields are placed on the babies to protect their eyes.",LITE with label substitution generates related but incorrect type labels such as treatment, attention or tissue.
Few-& Zero-shot Prediction.In §4.2, we discussed about transferring LITE trained on UFET to other fine-grained entity typing benchmarks.Nevertheless, since UFET labels are still inclusive of them with mapping, we conducted fur-  ther experiment in which portions of UFET training labels are randomly filtered out so that 40% of the testing labels are unseen in training.We then investigated the LITE NLI+L performance on test types which have zero or a few labeled examples in the training set.Fig. 2 shows the results of LITE NLI+L and the strongest baseline, MLMET.Note that while the held-out set of type labels are completely unseen to LITE, the full type vocabulary is however provided for MLMET during its LM-based data augmentation process in this experiment.
As shown in the results, it is as expected that the performance on more frequent labels are better than on rare labels.LITE NLI+L outperforms ML-MET on all the listed frequency of labels which reveals the strong low-shot prediction performance of our model.Particularly, on the extremely challenging zero-shot labels, LITE NLI+L drastically exceeds MLMET by 32.9% vs. 10.8% in F1.Hence, it is demonstrated that the NLI-based entity typing succeeds in more reliably representing and inferring rare and unseen entity types.
The main difference between the NLI framework and multi-way classifiers is NLI makes use of the semantics of input text as well as the label text; conventional classifiers, however, only model the semantics of input text.Encoding the semantics of labels' side is particularly beneficial when the type set is super large and many types lack training data.When some test labels are filtered out in the training process, LITE still performs well with its inference manner but classifiers (like MLMET) fail to recognize the semantics of unseen labels merely with their features.In this way, LITE maintains high performance when transfers across benchmarks with disjoint type vocabularies.
Case Study.We randomly sampled 100 labels on which LITE improves MLMET by at least 50% in F1 and here are the recognized typical patterns: 50-meter backstroke gold medal", LITE successfully types her with swimmer in addition to athlete that is given by MLMET.
• Coreference (20%): In case (b), LITE correctly refers the pronoun entity it to "apology" but MLMET merely captures local information "tv network airing" to obtain the label words event, message.
• Hypernym (19%): In the case (c), even if there is no mention of furniture in the text, LITE gives a high confidence score to this type that is a hypernym of mechanical desks.Nevertheless, MLMET only get trivial answers such as desk, object.
On the other hand, we also sampled 100 labels on which MLMET performs better and it can be concluded that LITE falls short mainly in following scenarios: • Multiple nominal words (30%): In the sample (d) of Tab. 6, due to ambiguous meaning of the type hypothesis "basketball and baseball is a basketball", LITE fails to predict the groundtruth label basketball.
• Clause (28%) Instance (e) illustrates a common situation when clauses are included in the entity mention, where the effectiveness of type descriptions is harmed.The clausal information distracts LITE from focusing on the key part of the entity.
Prediction on Different Categories of Entity Mentions.We also investigated the prediction of LITE on three different categories of entity mentions from the UFET test data: named entities, pronouns and nominals.For each category of mentions, we randomly sample 100 instances and the performance comparison against MLMET is reported in Tab. 7. According to the results, LITE consistently outperforms MLMET on all three categories of entities and the improvement on nominal phrases (46.2% vs 43.5% in F1) is most significant.This partly aligns with the capability of making inference based on noun hypernyms, as being discussed in Case Study.Meanwhile, typing on nominals seeks to be more challenging than on the other two categories of entities, which, from our observation, is mainly due to two reasons.First, Nominal phrases with multiple words are more difficult to capture by the language model in general.Second, nominals are sometimes less concrete than pronouns and named entities, hence LITE also generates more abstract type labels.For example, LITE has labeled the drink in an instance as substance, which is too abstract and is not recognized by human annotators.Time Efficiency.In general, LITE has much less training cost, of around 40 hours, than the previous strongest (data-augmentation-based) model ML-MET, which requires over 180 hours, on the UFET task. 5During the inference step, it takes about 35 seconds per new sentence for our model to do inference with a fixed type vocabulary of over 10,000 different labels while a common multi-way classifier merely requires around 0.2 seconds.In fact, such a big difference in inference cost results from encoding longer texts and multiple encoding calculation for the same text.It can be accelerated by modifying the encoding model structure which will be discussed in §5.However, LITE is much more efficient on dynamic type vocabulary.It requires almost no re-calculation when new, unmappable labels are added to an existing type set but multi-way classifiers need re-training with an extended classifier every time (e.g. over 180 hours by the previous SOTA).

Conclusion and Future Work
We propose a new model LITE that leverages indirect supervision from NLI to type entities in texts.Through template-based type hypothesis generation, LITE formulates the entity typing task as a language inference task and meanwhile the semantically rich hypothesis remedy the data scarcity problem in the UFET benchmark.
Besides, the learning-to-rank objective further help LITE with generalized prediction across benchmarks with disjoint type sets.Our experimental results illustrate that LITE promisingly offer SOTA on UFET, OntoNotes and FIGER, and yields strong performance on zero-shot and few-shot types.
LITE pretrained on UFET also yields strong transferability by outperforming SOTA baselines when directly make predictions on OntoNotes and FIGER.
For future research, as mentioned in §4.3, we first plan to investigate ways to accelerate LITE by utilizing a late-binding cross-encoder (Pang et al., 2020) for linear-complexity NLI, and incorporating high-dimensional indexing techniques like ball trees in inference.To be specific, the premise and hypotheses can first be encoded respectively and the resulting representations can later be used to evaluate the confidence score of premise-hypothesis representation pairs through a trained network.With little expected loss in performance, LITE can still maintain its feature of strong transferability and zero-shot prediction. In

Figure 1 :
Figure 1: Entity typing by LITE with indirect supervision from NLI. FIGER addition, we plan to extend NLI-based indirect supervision to information extraction tasks such as relation extraction and event extraction.Incorporating abstention-awareness (Dhamija et al., 2018) for handling unknown types is another meaningful direction.Besides, Poliak et al. (2018) recasted diverse types of reasoning dataset including NER, relation extraction and sentiment analysis into NLI structure, which we plan to incorporate as extra indirect supervision for LITE to further enhance the robustness of entity typing.

Table 2 :
Results on the ultra-fine entity typing task.LITE series are equipped with the Taxonomic Statement template."w/o label dependency" is applied to the "NLI+L" setting.The F1 result by LITE NLI+L is statistically significant (p-value < 0.01 in t-test) in comparison with the best baseline result by MLMET.
problems.It usesBERT-large-uncased (Devlin et al., 2019)as the backbone and projects the hidden classification vector to a hyperrectangular (box) space.Each type from the label space is also represented as a box and the Results.Tab. 2 compares LITE with baselines, in which LITE adopts the taxonomic statement template (i.e."[ENTITY] is a [LABEL]").

Table 3 :
Results for fine-grained entity typing.All LITE model results are statistically significant (p-value < 0.05 in t-test) in comparison with the best baseline results by MLMET on OntoNotes and by SEPREM on FIGER.Once Upon Andalasia is a video game based on the film of the same name.art, film (c) You can also use them in casseroles and they can be grated and fried if you want to make hash browns.
number (e) Despite obvious parallels and relationships , video art is not film.filmTable4:Examples of two sources of distant supervision data (one from entity linking, the other from head word extraction).In the right "Labels" column, correct types are boldfaced while incorrect ones are in grey.

Table 5 :
Behavior of different type description templates under three training settings.

Table 6 :
Case Study of labels on which LITE improves MLMET or MLMET outperforms LITE.Correct predictions are in blue and * indicates the representative label words for the discussed pattern.

Table 7 :
Performance comparison of LITE and prior SOTA, MLMET, on named entity, pronoun and nominal entities respectively.Performance comparison of our system LITE and the prior SOTA system, MLMET, on the filtered version of UFET for zero-shot and few-shot typing.The zero-shot labels correspond to the 40% test set type labels that are unseen in training.We also report the performance on other few-shot type labels.