Revisiting Few-shot Relation Classification: Evaluation Data and Classification Schemes

We explore Few-Shot Learning (FSL) for Relation Classification (RC). Focusing on the realistic scenario of FSL, in which a test instance might not belong to any of the target categories (none-of-the-above, aka NOTA), we first revisit the recent popular dataset structure for FSL, pointing out its unrealistic data distribution. To remedy this, we propose a novel methodology for deriving more realistic few-shot test data from available datasets for supervised RC, and apply it to the TACRED dataset. This yields a new challenging benchmark for FSL RC, on which state of the art models show poor performance. Next, we analyze classification schemes within the popular embedding-based nearest-neighbor approach for FSL, with respect to constraints they impose on the embedding space. Triggered by this analysis we propose a novel classification scheme, in which the NOTA category is represented as learned vectors, shown empirically to be an appealing option for FSL.


Introduction
We consider relation classification (RC)-an important sub-task of relation extraction (RE)-in which one is interested in determining, giving a text with two marked entities, if the entities conform to one of pre-determined relations, or not. While supervised methods for this task exist and work relatively well (Baldini Soares et al., 2019;Zhang et al., 2018;Wang et al., 2016;Miwa and Bansal, 2016), they require large amounts of training data, which is hard to obtain in practice.
We are therefore interested in a data-lean scenario in which users provide only a handful of training examples for each relation they are interested in. This has been formalized in the ML community as Few-Shot Learning (FSL) ( §2).
FSL for Relation Classification has been recently addressed by the work of Han et al. (2018); Gao Distribution of Relation Across Episodes Figure 1: Relation distribution across episodes in our newly derived Few-Shot TACRED and the existing FewRel 2.0 RC task. On the left side we demonstrate the relations distribution in Few-Shot TACRED episodes, which follows a real-world distribution. On the right, we present the relations distribution in FewRel 2.0, which is synthetic. The y-axis for both figures is in log scale. Few-Shot TACRED NOTA's proportion is 97.5% while in FewRel 2.0 it is 50%. et al. (2019), who introduced the FewRel 1.0 and shortly after the FewRel 2.0 challenges, in which researchers are provided with a large labeled dataset of background relations, and are tasked with producing strong few-shot classifiers: classifiers that will work well given a few labeled examples of relations not seen in the training set. The task became popular, with scores on FewRel 1.0 achieving an accuracy of 93.9% (Baldini Soares et al., 2019), surpassing the human level performance of 92.2%. Results on FewRel 2.0 are lower, at 80.3% for the best system ), but are still very high considering the difficulty of the task.
Is few-shot relation classification solved? We show that this is far from being the case. We argue that the evaluation protocol in FewRel 1.0 is based on highly unrealistic assumptions on how the models will be used in practice, and while FewRel 2.0 tried to amend it, its evaluation setup remains highly unrealistic ( §3.1). Therefore, we propose a methodology to transform supervised datasets into corresponding realistic few-shot evaluation scenarios ( §3.2) . We then apply our transformation on the supervised TACRED dataset (Zhang et al., 2017) to create such a new few-shot evaluation set ( §3.3). Our experiments ( §6.2) reveal that indeed, moving to this realistic setup, the performance of existing State-Of-The-Art (SOTA) models drop considerably, from scores of around 80 F1 (as well as accuracy) to around 30.
A core factor in a realistic few-shot setup is the NOTA (none-of-the-above) option; allowing a case where a particular test instance does not conform to any of the predefined target relations. Triggered by presenting an analysis of possible decision rules for handling the NOTA category ( §5), we propose a novel enhancement which models NOTA by an explicit set of vectors in the embedding space ( §5.2). This explicit "NOTA as vectors" approach achieves new SOTA performance for the FewRel 2.0 dataset, and outperforms other models on our new dataset ( §6). Yet, the realistic scenario of our TACREDderived dataset remains far from being solved, calling for substantial future research. We release our models, data, and, more importantly, our data conversion procedure, to encourage such future work.

Relation Classification
The relation extraction (RE) task takes as input a set of documents and a list of pre-specified relations, and aims to extract tuples of the form (e 1 , e 2 , r) where e 1 and e 2 are entities, r is a relation that holds between them (r belongs to a prespecified list of relations of interest). This task is often approached by a pipeline that generates candidate (e 1 , e 2 , s) triplets, classifies each one to a relation (or indicates there is no relation). The classification task from such triplets to an expressed relation is called relation classification (RC). It is often isolated and addressed on its own, and is also the focus of the current work. Zhang et al. (2017) demonstrate that improvements in RC carry over to improvements in RE.
In the RC task each input x i = (e 1 , e 2 , s) i consists of a sentence s with a (ordered) pair of marked entities (each entity is a span over s), and the output is one of |R| + 1 classes, indicating that the entities in s conform to one of the relations in a set R of target relations, or to none of them. We refer to a triplet x i as a relation instance. For example, if the target relations are R = {Owns, WorksFor}, the relation instance "Wired reports that in a surprising reshuffle at Microsoft e 2 , Satya Nadella e 1 has taken over as the managing director of the company." should be classified as WorksFor. The same sentence with the entity pair e 1 =Satya Nadella and e 2 =Wired should be classified as "NoRelation" (NOTA).

The Few-Shot N-Way K-Shot Setup
As supervised datasets are often hard and expensive to obtain, there is a growing interest in the fewshot scenario, where the user is interested in |R| target-relations, but can provide only a few labeled instances for each relation. In this work, we follow the increasingly popular N-Way K-Shot setup of Few-Shot Learning (FSL), proposed by Vinyals et al. (2016); Snell et al. (2017). This setup was adapted to relation classification, resulting in the FewRel and FewRel 2.0 datasets (Han et al., 2018;. We further discuss the datasets in §3. The N-Way K-Shot setup assumes the user is interested in N target relations (R target = {c 1 , ..., c N }), and has access to K instances (typically few) of each one, called the support set for class c j , denoted by σ: where r(x) is the gold relation of instance x; σ c j is the support set for relation c j ; and σ is the support set for all N relations in R target .
A set of target relations and the corresponding support sets is called a scenario. Given a scenario S = (R target , σ), our goal is to create a decision function f S (x) : x → R target ∪ {⊥}, where ⊥ indicates "none of the relations in R target ". Let X = x 1 , ..., x m be a set of instances with corresponding true labels r(x 1 ), ..., r(x m ), our aim is to minimize the average cumulative evaluation loss The performance of an N-Way K-Shot FSL algorithm on a dataset X is highly dependent on the specific scenario S: both the choice of the the N relations that needs to be distinguished as well as the choice of the specific K examples for each relation can greatly influence the results. In a real-life scenario, the user is interested in a specific set of relations and examples, but when developing and evaluating FSL algorithms, we are concerned with the expected performance of a method given an arbitrary set of categories and examples: ] which can be approximated by averaging the losses for several random scenarios S j , each varying the relation set and the example set. In a practical evaluation, the number of N-Way K-Shot scenarios that can be considered is limited, relative to the combinatorial number of possible scenarios. To maximize the number of considered scenarios, we re-write the loss to consider expectations also over the data points: This gives rise to an evaluation protocol that considers the loss over many episodes, where each episode is composed of: (1) a random choice (2) a corresponding random support set σ = {σ c 1 , ..., σ c N } of N * K instances (K instances in each σ c j ); and (3) a single randomly-chosen labeled example considered as a query, (x, r(x)), which does not appear in the support set. To summarize, an evaluation set for N-Way K-Shot FSL is a set of episodes, each consisting of a N target relations, K supporting examples for each relation, and a query. For each episode, the algorithm should classify the query instance to one of the relations in the support set, or none of them.
In practice, the episodes in an evaluation set are obtained by sampling episodes from a labeled dataset. As we discuss in the following section, the specifics of the labeled dataset and the sampling procedure can greatly influence the realism of the evaluation, and the difficulty of the task.

Low-resource Relation Classification -Related Work
Other than FSL, several setups for investigating RC under low resource setting have been proposed. Obamuyide and Vlachos (2019) experimented with limited supervision settings on TACRED. Their setting is different though than the transfer-based few-shot setting, addressed in our paper. In most of their experiments the amount of training instances per relations is much higher, not fitting the ad-hoc nature of the few-shot setting. Further, they train a model on all classes, not addressing inference on new class types at test time.
Distant supervision is another approach for handling low-resource RC (Mintz et al., 2009). This approach leverages noisy labels for training a model, produced by aligning relation instances to a knowledge-base. Particularly, it considers sentences containing a pair of entities holding a known relation as instances of that relation. For example, a sentence containing the entities 'Barack Obama', and 'Hawaii' will be labeled as an instance of the born_in relation between these entities, even though that sentence might describe, for example, a later visit of Obama to Hawaii.
Finally, another line of work is the Zero-Shot setup, where the RC task is reduced to another inference task, leveraging trained models for that task. Specifically, Levy et al. (2017) proposed a method that leverages reading comprehension models, while Obamuyide and Vlachos (2018) suggest using textual entailment models.

Desired Versus Existing Few-Shot Relation Classification Datasets
A FSL system is intended to be used in a reallife scenario. Thus, evaluation procedures for FSL should attempt to mimic the conditions under which the FSL system will be applied in practice. In a realistic FSL scenario, the user has a set of relations of interest ("target relations"), and can come up with a handful of examples for each. The relations in the set are often related to each other. The user may potentially have access to a labeled dataset of a different set of relations ("background relations"), which they may want to use to train, or to improve, their FSL system. The resulting classifier will then be applied to unlabeled data aiming to detect new target relations, in which, realistically: (a) some relations are rarer than others. (b) most instances do not correspond to a target relation.
(c) many instances may not correspond also to a background relation.
(d) relation instances may include named entities, as well as pronouns and common noun entities.
Ideally, the episodes in an FSL evaluation should be chosen in a way that reflects (a)-(d) above. 1 The first characteristic (a) naturally follows the nonuniform distribution of relation types in a (nonartificial) text collection. The second point (b) stems from the fact that a natural text refers to a broad, inherently unbound, range of relation types, while in an RC setting, particularly for FSL, there is typically a restricted set of target relations. Similarly, while available RC training sets (for the supervised setting) may annotate more relation types than in a typical few-shot setting, they still contain a limited number of relation types in comparison to the full range of relations expressed in the corpus. This is prototypically evidenced in the naturally distributed RC dataset TACRED ( §3.3), where 78.56% of the labels are NOTA (Table 1). Finally, naturally occurring textual relations may be used to relate named entities as well as common nouns or pronouns (d); therefore, we expect the annotated RC dataset entities to include all such entity types. As we show below, existing FSL-RC datasets do not conform to these properties, resulting in artificial-and substantially easier-classification tasks. This in turn leads to inflated accuracy numbers that are not reflective of the real potential performance of a system. We propose a refined sampling procedure that adheres to the realistic setting, and results in a substantially more realistic evaluation set, while conforming to the same N-Way K-Shot protocol. As we show in the experiments section ( §6), this setup proves to be substantially more challenging for existing algorithms. We propose to use this procedure for future evaluation of FSL-RC algorithms, and release the corresponding code and data. 2

Existing FSL RC Datasets
An N-Way K-Shot RC dataset was introduced by Han et al. (2018), called FewRel 1.0. The dataset became popular, yet proved to be rather easy: the current best leaderboard entry by Baldini Soares et al. (2019) obtain results of over 93.86% accuracy for 5-way 1-shot, above the 92% accuracy of human performance. The dataset was then updated to FewRel 2.0 , using an updated episode sampling procedure (see below), with the current best system obtaining a 5-way 1-shot score of 80.31 .
Underlying labeled data Both FewRel versions are based on the same underlying labeled dataset containing 100 distinct relations, with 700 instances per relation, totalling in 70, 000 labeled instances. The sentences are based on Wikipedia and the entities and relation labels are assigned automatically using Wikidata, followed by a human verification step.
Note that while extensive, each relation type contains the same number of instances, regardless of any real truthful distribution in a corpus, resulting in a highly synthetic dataset, contradicting the realistic assumption (a) above. In contrast, instances in supervised RC datasets such as TACRED and DocRED (Zhang et al., 2017;Yao et al., 2019) do respect the relation distribution in a real corpus.
Finally, FewRel target entities are mostly named entities, not including important entity types such as pronouns and common nouns, which are present in supervised RC datasets (including TACRED), thus contradicting assumption (d).
Train/Dev/Test splits The 100 relations are split into three disjoint sets, R train , R dev , and R test , consisting of 64, 16 and 20 relations, respectively. The relations in R train and their corresponding instances are used as the labeled corpus of background relations, while evaluation episodes consist of relations in either R dev or R test . We refer to this set (either test or dev) as R eval . Each episode consists of random subset R target ⊂ R eval .
Sampling procedures The episode sampling procedure of FewRel 1.0 works by sampling N relations from R eval resulting in a target set R target , sampling a corresponding size k support set σ c j for each c j ∈ R target , and then sampling a query example in which r(q) ∈ R target . That is, the query in each episode is guaranteed to be in R target . This setup is artificial, negating realistic condition (b) above. This explains the high performance on FewRel 1.0.
NOTA Following the aforementioned observation, the FewRel 2.0 work introduced a none-ofthe-above (NOTA) scenario. Here, after sampling the target relation set R target ⊂ R eval , the query class r is sampled from R target with probability p and from R eval \ R target with probability 1 − p.
That is, 1 − p of the episodes contain a query for which the answer does not correspond to any support set, in which case the answer is NOTA.
While a step in the right direction (indeed, results in this setup drop from over 90% to around 80%), this setup is still highly unrealistic: not only all the NOTA instances are guaranteed to be valid relations, they also always come from the same small set, contradicting assumption (c). In a realistic setup, we would expect the vast majority of test instances to be NOTA, but the set of NOTA instances is expected to vary greatly: some of them will correspond to relations from the background relations, some of them will correspond to unseen relations, and many will not correspond to any concrete relation. Furthermore, some of the NOTA cases will appear in sentences that do contain a target relation, but between different entities. Supervised relation extraction and relation classification datasets reflect this situation, and we argue that the FSL evaluation sets should also do so.

Better FSL-RC Evaluation Sets
We propose a methodology for transforming a supervised RC dataset into a few-shot RC dataset, while attempting to maintain properties (a)-(d) of the realistic evaluation scenario. This methodology can be applied to existing and future supervised datasets, thus reducing the need of collecting new dedicated FSL datasets.

Realistic underlying labeled data
We assume a given supervised dataset, with C categories, divided into train and test sections, where each section contains all C categories, with distinct instances in each section (the typical setting for supervised multi-class classification). Some instances (in all sections) may be labeled with "Noneof-the-above" (also known as "other" in the classic supervised setting, or "no relation" in TACRED terminology), hereafter NOTA, meaning these instances do not belong to any of the C categories.
Transformation We transform the supervised dataset into an FSL dataset containing (as in FewRel) a set of background relations for training and a disjoint set of relations for evaluation. To perform this transformation, we begin by choosing M categories as R eval . 3 The remaining C − M categories are designated as background relations R train . 4 We now keep the same instancelevel train/dev/test splits of the original supervised dataset, but relabel the instances in each section: train set instances whose labels are in R train retain their original labels, while all other training instances are labeled as NOTA. Similarly for the test and dev splits. This results in sets where each set has distinct labels, but some of the NOTA instances in one set correspond to labels in other sets.
Multiple splits The choice of relations for each set influences the resulting dataset: some relations are more similar to each other than others, and splits that put several similar relations in an eval set are harder than splits in which similar relations are split between the train an eval sets. Moreover, as the number of labeled instances for each relation differ, splitting by relation results in different number of train/dev/test instances. We thus repeat the process several times, each time with a distinct set of eval relations.

Realistic episode sampling
To create an episode, we first sample the N * K instances, which constitute the N support set as in previous episodic sampling: sample N out of M relations, and then sample K instances for each relation from the underlying eval set. However, the query for the episode is then sampled uniformly from all remaining instances in the eval set. If the label of the instance chosen as query differs from the N target relations in the episode, it is labeled as NOTA. This query sampling procedure maintains both the label distribution and NOTA rate of the underlying supervised dataset.

Few-Shot TACRED: Realistic Few-Shot
Relation Classification We apply our transformation methodology to the TACRED Relation Classification dataset (Zhang et al., 2017). The TACRED dataset was collected from a news corpus, purposing extracting relations involving 100 target entities. Accordingly, each sentence containing a mention of one of these target entities was used to generate candidate relation instances for the RC task. The relation label was annotated as one of 41 pre-defined relation categories, when appropriate, or into an additional "no_relation" category. The "no_relation" category corresponds to cases where some other relation type holds between the two arguments, as well as cases in which no relation holds between them, where we consider both types of cases as falling under our NOTA category.
We choose M = 10 of the 41 relations for the test set, and divide the remaining 31 relations into 25 and 6 for training and development, respectively, and release this split for future research. Table  1 lists the respective number of train/dev/test instances in our Few-Shot TACRED, along with the resulting NOTA rate in the test instances, as well as the corresponding numbers for the original TA-CRED dataset. As we expected, in a typical fewshot setting over natural text (as in Few-Shot TA-CRED, unlike FewRel), where the number of the targeted classes (N-way) is small, most instances would correspond to the NOTA case. This is indeed illustrated in Table 1, where the original TACRED dataset includes 41 target classes, vs. 10 in Few-Shot TACRED, and hence have a lower NOTA rate (conversely, in a 5-way setting, the NOTA rate is even higher, see Figure 2).
Evaluation sets For evaluation, we consider sets of 150,000 episodes, sampled according to the procedure above. For robustness, we create 5 evaluation sets of 30,000 episodes each, and report the mean and STD scores over the 5 sets. Figure  1 (shown in §1) presents the distribution differences between Few-Shot TACRED and FewRel 2.0 episodes. As we show in Section 6, the Few-Shot TACRED evaluation set proves to be a substantial challenge for Few-Shot algorithms. In the nearest neighbor approach, classification is done via a scoring function score(q, c i ), which assigns a score for a query instance, q, and a target class, c i . Since the class is represented by its Support Set, σ c i , the scoring function can be a similarity function between the query and the class's support set: Most often, an embedding-based approach is taken to compute similarity, decomposing the process into two separate components (Snell et al., 2017;Baldini Soares et al., 2019;. First, instances are embedded into an explicit, typically dense, vector space, by an embedding function. Then, query-support similarity is measured over embedded vectors. Specifically, the prototypical network of Snell et al. (2017) represents a target class c i by a class prototype vector µ i , which is the average embedding of the K instances in the support set of the class. The similarity between the query and each support set, sim(q, σ c i ), is then measured as the similarity between the query and the corresponding prototype vector, assuming some similarity function between vectors in the embedding space: This approach was adopted in the state-of-the-art method (Baldini Soares et al., 2019) for few-shot Relation Classification (FewRel 1.0, excluding the NOTA category), as well as by several other works for FSL in NLP (Bao et al., 2020a;.
Nearest-neighbor classification rule Similarity is computed between a test instance and each support set, selecting the nearest class: Instance representation Baldini Soares et al. (2019) further conducted an empirical analysis of embedding functions for few-shot Relation Classification. Their most effective embedding method augments the sentence with special tokens at the beginning and at the end of each of the two entities of the relation instance. The instance representation is then obtained by concatenating the two corresponding start tokens from BERT's last layer (Devlin et al., 2019). In our experiments, we adopt this embedding function, denoted BERT EM (BERTbased Entity Marking), as well as the use of dot product as the vector similarity function (after we reassessed its effectiveness as well).

FewRel 2.0 BERT sentence-pair model
The FewRel 2.0 work presented a model for the NOTA setting, which skips the embedding learning phase . Instead, it utilizes the embeddings-based next sentence prediction score of BERT (Devlin et al., 2019), as the similarity score between a query and each support set instance. Then, similarly to the approach described above, a nearest-neighbor criterion is applied over the average similarity score between the query and all support instances of each class. A parallel scoring mechanism is implemented to decide whether the NOTA category should be chosen.

Related FSL Classification Models
In this section we first review some prominent Few-Shot Learning work addressing other machinelearning tasks. Additionally, we compare between the notions of Out-Of-Domain detection and NOTA detection.
In a recent work on FSL, Tseng et al. (2020) aim to improve generalization abilities by providing supervision for the category transfer phase. In their learning setting, the classes of each training episode are divided into two subsets, the first acts as the "typical" training set while the second simulates the test set. To improve generalization they add an additional encoding layer which is optimized to maximaize performance on the simulated test categories.
Another recent FSL work, addressing text classification, suggests to weigh words by their frequency over the training set (Bao et al., 2020b). The model employs two components to classify the given text into one of the episode's categories. The first component computes the inverse frequency of each support set token over the training set. The second component estimates the inductive level of support set tokens with respect to classification. Finally, the output of these two components is used to train a linear classifier, by which the query is classified.
Out-Of-Domain detection The essence of the NOTA category resembles Out-Of-Domain detection, as in both cases the goal is to detect instances not falling under the known categories. Tan et al. (2019) define the OOD classes as the set of all classes which were not part of the training classes (vs. NOTA, which means that none of the given support classes in an episode is present). In their work, the authors suggest a representation learning approach for Out-Of-Domain (OOD) detection in text classification. Their method combines hinge loss with the classic cross-entropy loss function. The former is used to push away the representation of the OOD instances, while the latter is used to learn correct classification within the in-domain classes.

Classification Rules: Analysis and Extension
In this section, we provide an analytic perspective on the bias that different nearest-neighbor classification rules impose on the learned embedding space. We start with an analysis of the classification rule for the basic few-shot RC setting, without the NOTA category, as was applied in prior work (Section 4). This analysis follows directly the constraint presented in the influential work of Weinberger and Saul (2009) , and utilized in subsequent work (e.g. (Shen et al., 2010;Dhillon et al., 2010)). We then extend this analysis to the setting which does include the NOTA category. First, we analyze the straightforward threshold-based approach for this setting. Then, inspired by this analysis, we propose an alternative approach, with a corresponding constraint, which represents the NOTA category by one or more explicit learned vectors. As shown in subsequent sections, this new approach performs consistently better than other methods on both the FewRel 2.0 and our new Few-Shot TA-CRED benchmarks, and is thus suggested as an appealing approach for few-shot Relation Classification.

Constraints Imposed by Nearest-neighbor Classification
Classification without NOTA As described earlier, the nearest neighbor approach assigns a query instance to the class of its nearest support set. We start our analysis by adapting inequality (10) in Weinberger and Saul (2009), which was introduced to formulate the training goal for metric learning in k-nearest neighbor classification. To this end, we adapt the original inequality to our nearestneighbor few-shot classification setting (Section 4). The obtained inequality below specifies the, necessary and sufficient constraints that the embedding space, along with the similarity function over it, should satisfy in order to reach perfect classification, over all possible episodes in a given dataset. 6 For every possible query instance q, a support set σ r(q) from the same class as q and a support set σ ¬r(q) for a different class, the following constraint should hold: ∀ q, σ r(q) , σ ¬r(q) sim(q, σ r(q) ) > sim(q, σ ¬r(q) ) That is, to achieve perfect classification, each possible relation instance q imposes that support sets of different classes should be positioned further away from it (being less similar) than the most distant support set it might have from its own class. Generally speaking, the nearest neighbor classification rule implies that instances that are rather close to their class mates may also be rather close to other classes, while instances that are far from their class mates should also be positioned at least as far from other classes. In the few-shot setting, the embedding function is learned during training, over the training categories. As the learning process tries to optimize classification on the training set, it effectively attempts to learn an embedding function that would satisfy the above constraint as much as possible. Indeed, we often observed almost perfect performance over the training data, indicating that, for the training instances, this constraint is mostly satisfied by the learned embedding function. Yet, while it is 6 Notice that we drop the margin element in the adapted inequality, as it is not needed for the analytic purpose of our constraint.
hoped that the embedding function would separate properly also instances of new, previously unseen, classes, in practice this holds to a lesser degree, as indicated by lower test performance.
Thresholded classification with NOTA When the NOTA option is present, the nearest neighbor classification rule can be naturally augmented by assigning the NOTA category to test queries whose similarity to all of the target classes does not surpass a predetermined (possibly learned) threshold, θ. Extending our analysis to such classification rule, to achieve perfect classification, the embedding space must fulfil the following, necessary and sufficient constraint, whose left-hand-side is relevant only for episodes that include a support set for the query's class: ∀ q,σ r(q) , σ ¬r(q) sim(q, σ r(q) ) > θ > sim(q, σ ¬r(q) ) (2) Since the same threshold is applied to all queries, to achieve perfect classification in this setting θ should be smaller than all within-class similarities, for any possible pair of query q and a support set of its class σ r(q) . Concurrently, it should be larger than all cross-class similarities, for any possible query q and a support set of a different class σ ¬r(q) . 7 We observe that Inequality 2 imposes a global constraint over the embedding space. It implies that the degree to which all classes should be separated from each other is imposed, globally, by those queries in the entire space which are the furthest away from their own class support sets. Accordingly, it requires all classes to be positioned equally far from each other, regardless of their own "compactness". This makes a much harsher constraint, and challenge for the embedding learning, than Inequality (1), which allows certain classes to be nearer if their within-class similarities are high.

NOTA As a Vector (NAV)
Motivated by the last observation, we propose an alternative classification approach for few-shot classification with the NOTA category. In this approach, we represent the NOTA category by an explicit vector in the embedding space, denoted V N , which is learned during training. At test time, the similarity between the query q and this vector, sim(q, V N ), is computed and regarded as the similarity between the query and the NOTA category: Then, q is assigned to its nearest class, by the usual nearest-neighbor classification rule. Thus, the NOTA class is selected if sim(q, V N ) is higher than q's similarity to all target classes. Effectively, this mechanism considers an individual NOTA classification threshold for each query, namely sim(q, V N ), which depends on q's position in the embedding space relative to V N . We term this approach "NOTA As a Vector" (NAV).
Classification under the NAV scheme implies the following constraint on the embedding space, considering perfect classification: 8 ∀ q, σ r(q) , σ ¬r(q) sim(q, σ r(q) ) > sim(q, V N ) > sim(q, σ ¬r(q) ) ( 3) This constraint implies that, to achieve perfect classification, the similarity between a query and the NOTA vector V N should be smaller than q's similarity to all possible support sets of its own class, while being larger than its similarity to all support sets of other classes. In comparison to the prior classification rules, this approach does allow instances that are rather close to their class mates to be closer to other classes than instances that are positioned further from some of their class mates, similarly to the lighter constraint in Inequality (1). Yet, to enable such "geometry" of the embedding space, it is also required that instances would be positioned appropriately relative to the NOTA vector, in a way that satisfies the two constraints in Inequality (3). Using the NAV approach, it is hoped that the learning process would position the NOTA vector, and adjust the embedding parameters, such that these constraints would be mostly satisfied. Overall, the NAV approach imposes different constraints on the similarity space than using a single global classification threshold for the NOTA category (as in Inequality (2)), and it is not clear apriori which approach would be more effective to learn. This question is investigated empirically in Section 6.

Multiple NOTA vectors
A natural extension of the NAV approach, denoted as MNAV, is to represent the NOTA category by multiple vectors, whose number is an empirically tuned hyper-parameter. During classification, the model picks the closest vector to the query as V N , which accordingly defines sim(q, NOTA). Then, classification is determined as in the NAV method, where adding multiple NOTA vectors is expected to effectively ease the embedding space constraints. In practice, we treat the number of NOTA vectors as a hyperparameter.

Training Procedure
For training, we use the same episode sampling procedure that generated the dev/test sets, but where the target relations are sampled from a set of train relations, disjoint from the dev/test relations. We define an epoch to include a fixed number of episodes, considered a tuned hyper-parameter, independently sampling episodes for each epoch. We measure dev set performance after each epoch, and use early stopping. For each episode E = (R target , {σ c 1 , ..., σc N }, q), we encode the query using BERT EM encoding function (Baldini Soares et al., 2019), described in §4, q = BERT EM (q) and similarly for each item x in each support set, obtaining for each σ c j the corresponding average prototype vector µ j = 1 We define the prototype of the NOTA class to be the learned NAV vector: µ ⊥ = v N . Our loss term for each episode considers q and the prototype vectors µ i and tries to optimize Inequality (3): dot( q, µ r(q) ) > dot( q, µ ⊥ ) > dot( q, σ ¬r(q) ). Concretely we use cross-entropy loss, as used in previous work (Baldini Soares et al., 2019): − log e dot( q, µ r(q) ) i∈R target ∪{⊥} e dot( q, µ i ) Note that this works towards satisfying the conditions in Inequality (3): in episodes where r(q) =⊥, the loss attempts to increase the first term in Inequality (3) (the similarity between the query and the prototypical vector of its class), while decreasing the similarity of the two other terms (the similarity between q and all other prototypical vectors, including the NAV one). In particular, it drives towards satisfying sim(q, σ r(q) ) > sim(q, V N ). In episodes where r(q) =⊥, the loss increases the second term, decreasing the similarity in the third term, driving towards satisfying sim(q, V N ) > sim(q, σ ¬r(q) ). Analogously, the same dynamics apply when the learned (scalar) threshold value determines the NOTA score.
Following Weinberger and Saul (2009), who derived a triplet loss objective, and similar to subsequent lines of work (e.g. (Schroff et al., 2015;Hoffer and Ailon, 2015;Ming et al., 2017)), we experimented also with adapted versions of triplet loss. Under this objective, instances not belonging to the same class are pushed away while same-class instances are pulled together, aiming to reach the desired ordering as in inequalities 2 and 3. We tried multiple variants of this objective for FSL training, including objective versions with a margin element, but these experiments resulted in consistently lower results than the methods described above.
NOTA vectors initialization For the NAV method, we straightforwardly initialized the single NOTA vector randomly. Random initialization of the multiple NOTA vectors in MNAV evolved to a single vector being dominantly picked as the NAV vector by the MNAV decision process. Consequently, results were very similar to the (single vector) NAV model. Presumably, this happened because a single random vector turned out to be closest to the sub-space initially populated by the pre-trained BERT EM embedding function. To avoid this, we wish to scatter all the initial vectors within the initially populated subspace. To this end, we initialize a NOTA vector by sampling a relation and then averaging 10 random instances from that relation. We repeat this process for each NOTA vector.

Experiments and Results
In this section, we assess our two main contributions. With respect to our Few-Shot TACRED dataset, we show that models that perform well on FewRel 2.0 perform poorly on this much more realistic setting, leaving a huge gap for improvement by future research. With respect to our proposed NOTA As Vectors modeling approach, we show that it is a viable, and advantageous, alternative to the threshold approach.
Implemented models We conduct our investigation in the framework of the common embedding based approach to FSL, with respect to the MNAV, NAV, and threshold-based methods described in §5. These methods are implemented following the best-performing embedding and similarity methods identified for the state-of-the-art method on FewRel 1.0 (Baldini Soares et al., 2019), namely BERT EM applied using BERT BASE , and dot product simi-larity ( §4). In addition, we train and evaluate the baseline Sentence-Pair model, described in §4.1.
To select the number of NOTA vectors in the MNAV model, we experimented with 5 different values, ranging from 1 to 20. In practice, the choice of the number of vectors had rather little impact on the results (less than one F1 point). We use the best performing value for this hyperparameter, which was 20.
In terms of memory utilization, as 5-way 5-shot episodes require feeding the 25 instances of the support set in addition to the query instances into BERT simultaneously, they often occupy nearly the entire 32GB of GPU memory. To leverage the memory taken by the support set instances, we include as many queries as we can fit into the GPU's memory. In our experiments, we construct 3 episodes for each sampled support set (by sampling 3 different queries for it), which fully utilizes the GPU capacity. Since these episodes occupy the entire GPU memory, we use a single episode per batch.
We further note that it may be possible to perform the N-way classification by transforming it into a pair-wise classification, repeated N times (both in training and evaluation). This technique would allow to reduce the memory usage but would increases the run-time. As we managed to fit the entire episode to our GPU memory, we followed the standard N-way approach, for faster computation, as was previously done by .
Test methodology and metrics Like prior work, evaluation is conducted over randomly sampled episodes from the test data, as described in §2. Prior results for FewRel 2.0 (and FewRel 1.0) were reported in terms of Accuracy. However, in realistic, highly imbalanced, relation classification datasets, like our Few-Shot TACRED, accuracy becomes meaningless. Hence, we propose micro F1 over the target relations as a more appropriate measure for future research. Accordingly, we report micro F1 for both datasets, as well as accuracy for FewRel experiments, for compatibility. For both measures we report average values and standard deviation over 5 different random samples of episodes (Zhang et al., 2018(Zhang et al., , 2017. In all experiments, we train and evaluate five models and report the results of the median performing model. Unless otherwise mentioned, reported result differences are significant under one-tailed t-test at 0.05 confidence.

FewRel 2.0 Result
We first confirm the appropriateness of our investigation by comparing performance on the prior FewRel 2.0 test data. Table 2 presents the figures on the two official (synthetic) test NOTA rates for this benchmark. We use 50% NOTA rate to train all our models, with 6,000 episodes per epoch. As shown, the MNAV model performs best across all FewRel settings, obtaining a new SOTA for this task. 9 We next turn to a more comprehensive comparison of the investigated embedding-based few-shot models, namely threshold-based, NAV, and MNAV, over the publicly available FewRel development set, with 50% NOTA rate. The results in Table 3 show that, here as well, the MNAV model outperforms the others in both settings. The gap between MNAV and the threshold model is significant for the two settings, while the gap relative to the NAV model is significant only in the 5-shot setting.

Few-Shot TACRED Results
We compare the MNAV, NAV, Sentence-Pair and threshold-based models over our more realistic 9 Our MNAV results are also reported at the official FewRel 2.0 leader-board, as Anonymous Cat, at https://thunlp. github.io/2/fewrel2_nota.html. We note that the FewRel test set is kept hidden, where models are submitted to the FewRel authors, who produce (only) accuracy scores.  Few-Shot TACRED test set (here, epoch size is 2000). As seen in Table 4, the MNAV model outperforms the others, as was the case over FewRel 2.0. Notably, performance is drastically lower over Few-Shot TACRED. We suggest that this indicates the much more challenging nature of a realistic setting, relative to the FewRel 2.0 setting, while indicating the limitation of all current models. We further analyze this performance gap in the next section.

Differentiating characteristics of FewRel vs. Few-Shot TACRED
As seen in Tables 3 vs 4, the results on Few-Shot TACRED are drastically lower than those obtained for FewRel 2.0, by at least 50 points. Yet, the performance figures are difficult to compare due to several differences between the datasets, including training size, NOTA rate, and different entity types.
To analyze the possible impact of these differences, we control for each of them and observe performance differences. For brevity, we focus on the MNAV model (1-shot and 5-shot).
Training size We train the model on FewRel 2.0, taking the same amount of training instances as in Few-Shot TACRED. Compared to full training, results dropped by five micro F1 points in the 1shot setting and by 1.5 points for 5-shot, suggesting that the training size explains only a small portion of the performance gap between the two datasets.
NOTA rates We control for the unrealistic NOTA rate in FewRel 2.0 by training and evaluating our model on higher NOTA rates. The results in Figure 2 indicate that realistic higher NOTA rates are indeed much more challenging: moving from the original FewRel 50% NOTA rate to the 97.5% rate as in Few-Shot TACRED shrank the performance gap by 33 points in the 1-shot setting and by 35 for 5-shot. Entity types In this experiment, we evaluate performance differences when including all entity types (named entities, common nouns and pronouns), as in Few-Shot TACRED, versus including only named entities, as in FewRel. To this end, we sampled two corresponding subsets of relation instances from Few-Shot TACRED, of the same size, with either all entity types or named entities only. 10 Further, we control for the distributions of relation types in the two subsets, making them equal, since, as discussed in Section 3, this distribution impacts performance in RC datasets. Apparently, the impact of entity composition was different in the 1-shot and 5-shot settings. For 1-shot, the named entities subset yielded slightly lower performance (6.65 vs. 9.03 micro F1), which is hard to interpret. For 5-shot, performance on the named entities subset was substantially higher than when including all entity types (33.48 vs. 18.74), possibly suggesting that a larger diversity of entity types is more challenging for the model. In any case, we argue that RC datasets should include all entity types, to reflect real-world corpora.
Summary Overall, the differences we analyzed account for much of the large performance gap between the two datasets, particularly in the more promising 5-shot setting. As argued earlier, we suggest that Few-Shot TACRED represents more realistic properties of few-shot RC, including realistic 10 Entity types were automatically identified by SpaCy NER model (Honnibal and Montani, 2017), as well as certain fixed types included in FewRel, such as ranks and titles. non-uniform distribution, "no_relation" instances and inclusion of all entity types, and hence should be utilized in future evaluations.

Few-Shot versus Supervised TACRED
We next analyze the impact of category transfer in Few-Shot TACRED. To this end, we apply our same MNAV model in a supervised (non-transfer) setting, termed Supervised MNAV, and compare it to the few-shot MNAV (FSL MNAV). Concretely, we trained the supervised MNAV model on the training instances of the same categories as those in the Few-Shot TACRED test data (vs. training on different background relations in the transferbased FSL setting). The supervised model was then tested for 5-way 5-shot classification on Few-Shot TACRED, identically to the FSL MNAV 5-way 5-shot testing in Table 4. The results showed a 31 point gap, with the Supervised MNAV yielding 61.19 micro F1 while FSL MNAV scored 30.04, indicating the substantial challenge when moving from the supervised to the category transfer setting.

Qualitative Error Analysis
To obtain some insight on current performance, we manually analyzed 50 episodes for which the model predicted an incorrect support class (precision error) and 50 in which it missed identifying the right support class (recall error). We sampled 1-shot episodes since these can be more easily interpreted, examining a single support instance per class.
For the precision errors, we found a single prominent characteristic. Across all sampled episodes, both the query and the falsely selected support instance shared the same (ordered) pair of entity types. For instance, they may both share the entity types of person and location, albeit having different relations, such as city of death vs. state of residence, or having no meaningful relation for the query (no relation case). This behavior suggests that pre-training, together with fine tuning on the background relations, allowed the BERT-based model to learn to distinguish entity types, to realize their criticality for the RC task, and to successfully match entity types between a query and a support instance. On the other hand, the low overall performance suggests that the model does not recognize well the patterns indicating a target relation based on a small support set. Additional evidence for this conjecture is obtained when examining confused class pairs in the predictions' confusion matrices (1-shot and 5-shot settings). Out of 10 confused class pairs, 8 pairs have matching entity types; in the other two pairs, the location type is confused with organization in the context of school attended, which often carries a sense of location.
For the recall errors, manual inspection of the 50 episodes did not reveal any prominent insights. Therefore, we sampled 100, 000 1-shot episodes over which we analyzed various statistics which may be related to recall errors. Of these, we present two analyses that seem to explain aspects of recall misses, in a statistically significant manner (onetailed t-test at 0.01 significance level), though only to a partial extent.
The first analysis examines the impact of whether the relative order of the two marked argument entities flips between the query and support instance sentences. To that end, we examined the about 2, 600 episodes in our sample in which the query belongs to one of the support classes. We found that for episodes in which argument order is consistent across the query and support instance, the model identified the correct class in 15.68% of the cases, while when the order is flipped only 10.95% of the episodes are classified correctly. This suggests that a flipped order makes it more challenging for the model to match the relation patterns across the query and support sentences. The second analysis examines the impact of lexical overlap between the query and support instance. To that end, we compared 300 episodes in which the correct support class was successfully identified (true positive) and 300 in which it was missed (false negative). In each episode, we measured Intersection over Union (IoU) (aka Jaccard Index) for the two sets of lemmas in the query and support instance. As expected, the IoU value was significantly higher for the true positive set (0.17) that for the false negative set (0.12), suggesting that higher lexical match eases recognizing the correct support instance.

Conclusions
In this work, we lay several required criteria for realistic FSL datasets, while proposing a methodology to derive such benchmarks from available datasets designed for supervised learning. We then applied our methodology on the TACRED relation classification dataset, creating a challenging benchmark for future research. Indeed, previous models that achieved impressive results on FewRel, a synthetic dataset for FSL, failed miserably on our naturally distributed dataset. These results call for better models and loss functions for FSL, and indicate that we are far from having satisfying results on this setup. Our methodology may be further applied to additional datasets, enriching the availability of realistic datasets for FSL.
Next, we analyzed the constraints imposed embedding functions by nearest-neighbor classification schemes, common for FSL. This analysis led us to derive a new method for representing the NOTA category as one or more explicit learned vectors, yielding a novel classification scheme, which achieves new state-of-the-art performance. We suggest that our analysis may further inspire additional innovations in few-shot learning.