We explore few-shot learning (FSL) for relation classification (RC). Focusing on the realistic scenario of FSL, in which a test instance might not belong to any of the target categories (none-of-the-above, [NOTA]), we first revisit the recent popular dataset structure for FSL, pointing out its unrealistic data distribution. To remedy this, we propose a novel methodology for deriving more realistic few-shot test data from available datasets for supervised RC, and apply it to the TACRED dataset. This yields a new challenging benchmark for FSL-RC, on which state of the art models show poor performance. Next, we analyze classification schemes within the popular embedding-based nearest-neighbor approach for FSL, with respect to constraints they impose on the embedding space. Triggered by this analysis, we propose a novel classification scheme in which the NOTA category is represented as learned vectors, shown empirically to be an appealing option for FSL.
1 Introduction
We consider relation classification—an important sub-task of relation extraction—in which one is interested in determining, given a text with two marked entities, whether the entities conform to one of pre-determined relations, or not. While supervised methods for this task exist and work relatively well (Baldini Soares et al., 2019; Zhang et al., 2018; Wang et al., 2016; Miwa and Bansal, 2016), they require large amounts of training data, which is hard to obtain in practice.
We are therefore interested in a data-lean scenario in which users provide only a handful of training examples for each relation they are interested in. This has been formalized in the machine learning community as few-shot learning (FSL) (§2).
FSL for RC has been recently addressed by the work of Han et al. (2018) and Gao et al. (2019), who introduced the FewRel 1.0 and shortly after the FewRel 2.0 challenges, in which researchers are provided with a large labeled dataset of background relations, and are tasked with producing strong few-shot classifiers: classifiers that will work well given a few labeled examples of relations not seen in the training set. The task became popular, with scores on FewRel 1.0 achieving an accuracy of 93.9% (Baldini Soares et al., 2019), surpassing the human level performance of 92.2%. Results on FewRel 2.0 are lower, at 80.3% for the best system (Gao et al., 2019), but are still very high considering the difficulty of the task.
Is few-shot relation classification solved? We show that this is far from being the case. We argue that the evaluation protocol in FewRel 1.0 is based on highly unrealistic assumptions on how the models will be used in practice, and while FewRel 2.0 tried to amend it, its evaluation setup remains highly unrealistic (§3.1). Therefore, we propose a methodology to transform supervised datasets into corresponding realistic few-shot evaluation scenarios (§3.2). We then apply our transformation on the supervised TACRED dataset (Zhang et al., 2017) to create such a new few-shot evaluation set (§3.3). Our experiments (§6.2) reveal that indeed, moving to this realistic setup, the performance of existing state-of-the-art (SOTA) models drop considerably, from scores of around 80 F1 (as well as accuracy) to around 30.
A core factor in a realistic few-shot setup is the NOTA (none-of-the-above) option; allowing a case where a particular test instance does not conform to any of the predefined target relations. Triggered by presenting an analysis of possible decision rules for handling the NOTA category (§5), we propose a novel enhancement that models NOTA by an explicit set of vectors in the embedding space (§5.2). This explicit “NOTA as vectors” approach achieves new SOTA performance for the FewRel 2.0 dataset, and outperforms other models on our new dataset (§6). Yet, the realistic scenario of our TACRED-derived dataset remains far from being solved, calling for substantial future research. We release our models, data, and, more importantly, our data conversion procedure, to encourage such future work.
2 Task Setup and Formulation
2.1 Relation Classification
The relation extraction (RE) task takes as input a set of documents and a list of pre-specified relations, and aims to extract tuples of the form (e1,e2,r) where e1 and e2 are entities, r is a relation that holds between them (r belongs to a pre-specified list of relations of interest). This task is often approached by a pipeline that generates candidate (e1,e2,s) triplets, classifies each one to a relation (or indicates there is no relation). The classification task from such triplets to an expressed relation is called relation classification (RC). It is often isolated and addressed on its own, and is also the focus of the current work. Zhang et al. (2017) demonstrate that improvements in RC carry over to improvements in RE.
In the RC task each input xi = (e1,e2,s)i consists of a sentence s with a (ordered) pair of marked entities (each entity is a span over s), and the output is one of |R| + 1 classes, indicating that the entities in s conform to one of the relations in a set R of target relations, or to none of them. We refer to a triplet xi as a relation instance. For example, if the target relations are R = {Owns, WorksFor}, the relation instance “Wired reports that in a surprising reshuffle at Microsoft, Satya Nadella has taken over as the managing director of the company.” should be classified as WorksFor. The same sentence with the entity pair e1 = Satya Nadella and e2 = Wired should be classified as “NoRelation” (NOTA).
2.2 The Few-Shot N-Way K-Shot Setup
As supervised datasets are often hard and expensive to obtain, there is a growing interest in the few-shot scenario, where the user is interested in |R| target-relations, but can provide only a few labeled instances for each relation. In this work, we follow the increasingly popular N-Way K-Shot setup of FSL, proposed by Vinyals et al. (2016) and Snell et al. (2017). This setup was adapted to relation classification, resulting in the FewRel and FewRel 2.0 datasets (Han et al., 2018; Gao et al., 2019). We further discuss the datasets in §3.
A set of target relations and the corresponding support sets is called a scenario. Given a scenario , our goal is to create a decision function , where ⊥ indicates “none of the relations in Rtarget”. Let X = x1,…,xm be a set of instances with corresponding true labels r(x1),…,r(xm), our aim is to minimize the average cumulative evaluation loss , where ℓ is a per-instance loss function, usually zero-one loss.
When treating FSL as a transfer learning problem, as we do here, there is also a background set of relations Rbackground, disjoint from the target relation set, for which there is plenty of labeled data available. This data can also be used for constructing the decision function.
The performance of an N-Way K-Shot FSL algorithm on a dataset X is highly dependent on the specific scenario : Both the choice of the the N relations that needs to be distinguished as well as the choice of the specific K examples for each relation can greatly influence the results. In a real-life scenario, the user is interested in a specific set of relations and examples, but when developing and evaluating FSL algorithms, we are concerned with the expected performance of a method given an arbitrary set of categories and examples: which can be approximated by averaging the losses for several random scenarios , each varying the relation set and the example set. In a practical evaluation, the number of N-Way K-Shot scenarios that can be considered is limited, relative to the combinatorial number of possible scenarios. To maximize the number of considered scenarios, we re-write the loss to consider expectations also over the data points: .
This gives rise to an evaluation protocol that considers the loss over many episodes, where each episode is composed of: (1) a random choice Rtarget of N distinct target relations Rtarget = {c1,…,cN}, ci≠cj; (2) a corresponding random support set of N * K instances (K instances in each ); and (3) a single randomly chosen labeled example considered as a query, (x,r(x)), which does not appear in the support set. To summarize, an evaluation set for N-Way K-Shot FSL is a set of episodes, each consisting of a N target relations, K supporting examples for each relation, and a query. For each episode, the algorithm should classify the query instance to one of the relations in the support set, or none of them.
In practice, the episodes in an evaluation set are obtained by sampling episodes from a labeled dataset. As we discuss in the following section, the specifics of the labeled dataset and the sampling procedure can greatly influence the realism of the evaluation, and the difficulty of the task.
2.3 Low-resource Relation Classification — Related Work
Other than FSL, several setups for investigating RC under low resource setting have been proposed.
Obamuyide and Vlachos (2019) experimented with limited supervision settings on TACRED. Their setting is different than the transfer-based few-shot setting addressed in our paper, however. In most of their experiments the amount of training instances per relations is much higher, not fitting the ad hoc nature of the few-shot setting. Further, they train a model on all classes, not addressing inference on new class types at test time.
Distant supervision is another approach for handling low-resource RC (Mintz et al., 2009). This approach leverages noisy labels for training a model, produced by aligning relation instances to a knowledge-base. Particularly, it considers sentences containing a pair of entities holding a known relation as instances of that relation. For example, a sentence containing the entities ‘Barack Obama’, and ‘Hawaii’ will be labeled as an instance of the born_in relation between these entities, even though that sentence might describe, for example, a later visit of Obama to Hawaii.
Finally, another line of work is the Zero-Shot setup, where the RC task is reduced to another inference task, leveraging trained models for that task. Specifically, Levy et al. (2017) proposed a method that leverages reading comprehension models, while Obamuyide and Vlachos (2018) suggest using textual entailment models.
3 Desired Versus Existing Few-Shot Relation Classification Datasets
A FSL system is intended to be used in a real-life scenario. Thus, evaluation procedures for FSL should attempt to mimic the conditions under which the FSL system will be applied in practice. In a realistic FSL scenario, the user has a set of relations of interest (“target relations”), and can come up with a handful of examples for each. The relations in the set are often related to each other. The user may potentially have access to a labeled dataset of a different set of relations (“background relations”), which they may want to use to train, or to improve, their FSL system.
The resulting classifier will then be applied to unlabeled data aiming to detect new target relations, in which, realistically:
- (a)
some relations are rarer than others.
- (b)
most instances do not correspond to a target relation.
- (c)
many instances may not correspond also to a background relation.
- (d)
relation instances may include named entities, as well as pronouns and common noun entities.
Ideally, the episodes in an FSL evaluation should be chosen in a way that reflects (a)–(d) above.1 The first characteristic (a) naturally follows the non-uniform distribution of relation types in a (non-artificial) text collection. The second point (b) stems from the fact that a natural text refers to a broad, inherently unbound, range of relation types, while in an RC setting, particularly for FSL, there is typically a restricted set of target relations. Similarly, while available RC training sets (for the supervised setting) may annotate more relation types than in a typical few-shot setting, they still contain a limited number of relation types in comparison to the full range of relations expressed in the corpus. This is prototypically evidenced in the naturally distributed RC datasetTACRED (§3.3), where 78.56% of the labels are NOTA (Table 1). Finally, naturally occurring textual relations may be used to relate named entities as well as common nouns or pronouns (d); therefore, we expect the annotated RC dataset entities to include all such entity types.
Dataset . | Train . | Dev . | Test . |
---|---|---|---|
TACRED | 13,012 | 5,436 | 3,325 (78.56%) |
FS TACRED | 8,163 | 633 | 804 (94.81%) |
Dataset . | Train . | Dev . | Test . |
---|---|---|---|
TACRED | 13,012 | 5,436 | 3,325 (78.56%) |
FS TACRED | 8,163 | 633 | 804 (94.81%) |
As we show below, existing FSL-RC datasets do not conform to these properties, resulting in artificial—and substantially easier—classification tasks. This in turn leads to inflated accuracy numbers that are not reflective of the real potential performance of a system. We propose a refined sampling procedure that adheres to the realistic setting, and results in a substantially more realistic evaluation set, while conforming to the same N-Way K-Shot protocol. As we show in the experiments section (§6), this setup proves to be substantially more challenging for existing algorithms. We propose to use this procedure for future evaluation of FSL-RC algorithms, and release the corresponding code and data.
3.1 Existing FSL-RC Datasets
An N-Way K-Shot RC dataset was introduced by Han et al. (2018), called FewRel 1.0. The dataset became popular, yet proved to be rather easy: The current best leaderboard entry by Baldini Soares et al. (2019) obtains results of over 93.86% accuracy for 5-way 1-shot, above the 92% accuracy of human performance. The dataset was then updated to FewRel 2.0 (Gao et al., 2019), using an updated episode sampling procedure (see below), with the current best system obtaining a 5-way 1-shot score of 80.31 (Gao et al., 2019).
Underlying Labeled Data
Both FewRel versions are based on the same underlying labeled dataset containing 100 distinct relations, with 700 instances per relation, totalling in 70,000 labeled instances. The sentences are based on Wikipedia and the entities and relation labels are assigned automatically using Wikidata, followed by a human verification step.
Note that while extensive, each relation type contains the same number of instances, regardless of any real truthful distribution in a corpus, resulting in a highly synthetic dataset, contradicting the realistic assumption (a) above. In contrast, instances in supervised RC datasets such as TACRED and DocRED (Zhang et al., 2017; Yao et al., 2019) do respect the relation distribution in a real corpus.
Finally, FewRel target entities are mostly named entities, not including important entity types such as pronouns and common nouns, which are present in supervised RC datasets (including TACRED), thus contradicting assumption (d).
Train/Dev/Test Splits
The 100 relations are split into three disjoint sets, Rtrain, Rdev, and Rtest, consisting of 64, 16, and 20 relations, respectively. The relations in Rtrain and their corresponding instances are used as the labeled corpus of background relations, while evaluation episodes consist of relations in either Rdev or Rtest. We refer to this set (either test or dev) as Reval. Each episode consists of random subset Rtarget ⊂ Reval.
Sampling Procedures
The episode sampling procedure of FewRel 1.0 works by sampling N relations from Reval resulting in a target set Rtarget, sampling a corresponding size k support set for each cj ∈ Rtarget, and then sampling a query example in which r(q) ∈ Rtarget. That is, the query in each episode is guaranteed to be in Rtarget. This setup is artificial, negating realistic condition (b) above. This explains the high performance on FewRel 1.0.
NOTA
Following the aforementioned observation, the FewRel 2.0 work introduced a NOTA scenario. Here, after sampling the target relation set Rtarget ⊂ Reval, the query class r is sampled from Rtarget with probability p and from Reval ∖ Rtarget with probability 1 − p. That is, 1 − p of the episodes contain a query for which the answer does not correspond to any support set, in which case the answer is NOTA.
Although a step in the right direction (indeed, results in this setup drop from over 90% to around 80%), this setup is still highly unrealistic: not only all the NOTA instances are guaranteed to be valid relations, they also always come from the same small set, contradicting assumption (c). In a realistic setup, we would expect the vast majority of test instances to be NOTA, but the set of NOTA instances is expected to vary greatly: some of them will correspond to relations from the background relations, some of them will correspond to unseen relations, and many will not correspond to any concrete relation. Furthermore, some of the NOTA cases will appear in sentences that do contain a target relation, but between different entities. Supervised relation extraction and relation classification datasets reflect this situation, and we argue that the FSL evaluation sets should also do so.
3.2 Better FSL-RC Evaluation Sets
We propose a methodology for transforming a supervised RC dataset into a few-shot RC dataset, while attempting to maintain properties (a)–(d) of the realistic evaluation scenario. This methodology can be applied to existing and future supervised datasets, thus reducing the need of collecting new dedicated FSL datasets.
3.2.1 Realistic Underlying Labeled Data
We assume a given supervised dataset, with C categories, divided into train and test sections, where each section contains all C categories, with distinct instances in each section (the typical setting for supervised multi-class classification). Some instances (in all sections) may be labeled with “None-of-the-above” (also known as “other” in the classic supervised setting, or “no relation” in TACRED terminology), hereafter NOTA, meaning these instances do not belong to any of the C categories.
Transformation
We transform the supervised dataset into an FSL dataset containing (as in FewRel) a set of background relations for training and a disjoint set of relations for evaluation. To perform this transformation, we begin by choosing M categories as Reval.2 The remaining C − M categories are designated as background relations Rtrain.3 We now keep the same instance-level train/dev/test splits of the original supervised dataset, but relabel the instances in each section: train set instances whose labels are in Rtrain retain their original labels, while all other training instances are labeled as NOTA. Similarly for the test and dev splits. This results in sets where each set has distinct labels, but some of the NOTA instances in one set correspond to labels in other sets.
Multiple Splits
The choice of relations for each set influences the resulting dataset: Some relations are more similar to each other than others, and splits that put several similar relations in an eval set are harder than splits in which similar relations are split between the train an eval sets. Moreover, as the number of labeled instances for each relation differ, splitting by relation results in different number of train/dev/test instances. We thus repeat the process several times, each time with a distinct set of eval relations.
3.2.2 Realistic Episode Sampling
To create an episode, we first sample the N * K instances, which constitute the N support set as in previous episodic sampling: Sample N out of M relations, and then sample K instances for each relation from the underlying eval set. However, the query for the episode is then sampled uniformly from all remaining instances in the eval set. If the label of the instance chosen as query differs from the N target relations in the episode, it is labeled as NOTA. This query sampling procedure maintains both the label distribution and NOTA rate of the underlying supervised dataset.
3.3 Few-Shot TACRED: Realistic Few-Shot Relation Classification
We apply our transformation methodology to the TACRED RC dataset (Zhang et al., 2017). The TACRED dataset was collected from a news corpus, purposing extracting relations involving 100 target entities. Accordingly, each sentence containing a mention of one of these target entities was used to generate candidate relation instances for the RC task. The relation label was annotated as one of 41 pre-defined relation categories, when appropriate, or into an additional “no_relation” category. The “no_relation” category corresponds to cases where some other relation type holds between the two arguments, as well as cases in which no relation holds between them, where we consider both types of cases as falling under our NOTA category.
We choose M = 10 of the 41 relations for the test set, and divide the remaining 31 relations into 25 and 6 for training and development, respectively, and release this split for future research. Table 1 lists the respective number of train/dev/ test instances in our Few-Shot TACRED, along with the resulting NOTA rate in the test instances, as well as the corresponding numbers for the original TACRED dataset. As we expected, in a typical few-shot setting over natural text (as in Few-Shot TACRED, unlike FewRel), where the number of the targeted classes (N-way) is small, most instances would correspond to the NOTA case. This is indeed illustrated in Table 1, where the original TACRED dataset includes 41 target classes, vs. 10 in Few-Shot TACRED, and hence have a lower NOTA rate (conversely, in a 5-way setting, the NOTA rate is even higher, see Figure 2).
Evaluation Sets
For evaluation, we consider sets of 150,000 episodes, sampled according to the procedure above. For robustness, we create 5 evaluation sets of 30,000 episodes each, and report the mean and STD scores over the 5 sets. Figure 1 (shown in §1) presents the distribution differences between Few-Shot TACRED and FewRel 2.0 episodes. As we show in Section 6, the Few-Shot TACRED evaluation set proves to be a substantial challenge for Few-Shot algorithms.
4 Background: Prior Few-Shot RC Models
As mentioned earlier, only a handful of examples are provided for the target classes in the Few-Shot setting. It is therefore quite challenging to utilize these examples effectively for learning or updating model parameters. Consequently, quite a few existing few-shot models, in the machine learning literature as well as in NLP (Vinyals et al., 2016; Ravi and Larochelle, 2017; Baldini Soares et al., 2019), perform a representation learning phase (typically known as embedding learning or metric learning), followed by nearest neighbor classification. Here, model parameters are first learned over the background classes, for which substantial training is available. Then, classification of test instances is based on the trained model, with the hope that this model would generalize reasonably well for the target classes.4
Nearest-neighbor Classification Rule
Instance Representation
Baldini Soares et al. (2019) further conducted an empirical analysis of embedding functions for few-shot RC. Their most effective embedding method augments the sentence with special tokens at the beginning and at the end of each of the two entities of the relation instance. The instance representation is then obtained by concatenating the two corresponding start tokens from BERT’s last layer (Devlin et al., 2019). In our experiments, we adopt this embedding function, denoted BERTEM (BERT-based Entity Marking), as well as the use of dot product as the vector similarity function (after we reassessed its effectiveness as well).
4.1 FewRel 2.0 BERT Sentence-pair Model
The FewRel 2.0 work presented a model for the NOTA setting, which skips the embedding learning phase (Gao et al., 2019). Instead, it utilizes the embeddings-based next sentence prediction score of BERT (Devlin et al., 2019), as the similarity score between a query and each support set instance. Then, similarly to the approach described above, a nearest-neighbor criterion is applied over the average similarity score between the query and all support instances of each class. A parallel scoring mechanism is implemented to decide whether the NOTA category should be chosen.
4.2 Related FSL Classification Models
In this section we first review some prominent FSL work addressing other machine-learning tasks. Additionally, we compare between the notions of Out-Of-Domain (OOD) detection and NOTA detection.
In a recent work on FSL, Tseng et al. (2020) aim to improve generalization abilities by providing supervision for the category transfer phase. In their learning setting, the classes of each training episode are divided into two subsets, the first acts as the “typical” training set while the second simulates the test set. To improve generalization they add an additional encoding layer that is optimized to maximaize performance on the simulated test categories.
Another recent FSL work, addressing text classification, suggests weighing words by their frequency over the training set (Bao et al., 2020b). The model uses two components to classify the given text into one of the episode’s categories. The first component computes the inverse frequency of each support set token over the training set. The second component estimates the inductive level of support set tokens with respect to classification. Finally, the output of these two components is used to train a linear classifier, by which the query is classified.
Out-Of-Domain Detection
The essence of the NOTA category resembles OOD detection, as in both cases the goal is to detect instances not falling under the known categories. Tan et al. (2019) define the OOD classes as the set of all classes that were not part of the training classes (vs. NOTA, which means that none of the given support classes in an episode is present). In their work, the authors suggest a representation learning approach for OOD detection in text classification. Their method combines hinge loss with the classic cross-entropy loss function. The former is used to push away the representation of the OOD instances, while the latter is used to learn correct classification within the in-domain classes.
5 Classification Rules: Analysis and Extension
In this section, we provide an analytic perspective on the bias that different nearest-neighbor classification rules impose on the learned embedding space. We start with an analysis of the classification rule for the basic few-shot RC setting, without the NOTA category, as was applied in prior work (Section 4). This analysis follows directly the constraint presented in the influential work of Weinberger and Saul (2009), and utilized in subsequent work (e.g., Shen et al., 2010; Dhillon et al., 2010). We then extend this analysis to the setting that does include the NOTA category. First, we analyze the straightforward threshold-based approach for this setting. Then, inspired by this analysis, we propose an alternative approach, with a corresponding constraint, which represents the NOTA category by one or more explicit learned vectors. As shown in subsequent sections, this new approach performs consistently better than other methods on both the FewRel 2.0 and our new Few-Shot TACRED benchmarks, and is thus suggested as an appealing approach for few-shot Relation Classification.
5.1 Constraints Imposed by Nearest-neighbor Classification
Classification without NOTA
In the few-shot setting, the embedding function is learned during training, over the training categories. As the learning process tries to optimize classification on the training set, it effectively attempts to learn an embedding function that would satisfy the above constraint as much as possible. Indeed, we often observed almost perfect performance over the training data, indicating that, for the training instances, this constraint is mostly satisfied by the learned embedding function. Yet, while it is hoped that the embedding function would separate properly also instances of new, previously unseen, classes, in practice this holds to a lesser degree, as indicated by lower test performance.
Thresholded Classification with NOTA
We observe that Inequality (2) imposes a global constraint over the embedding space. It implies that the degree to which all classes should be separated from each other is imposed, globally, by those queries in the entire space which are the furthest away from their own class support sets. Accordingly, it requires all classes to be positioned equally far from each other, regardless of their own “compactness”. This makes a much harsher constraint, and challenge for the embedding learning, than Inequality (1), which allows certain classes to be nearer if their within-class similarities are high.
5.2 NOTA As a Vector (NAV)
This constraint implies that, to achieve perfect classification, the similarity between a query and the NOTA vector VN should be smaller than q’s similarity to all possible support sets of its own class, while being larger than its similarity to all support sets of other classes. In comparison to the prior classification rules, this approach does allow instances that are rather close to their class mates to be closer to other classes than instances that are positioned further from some of their class mates, similarly to the lighter constraint in Inequality (1). Yet, to enable such “geometry” of the embedding space, it is also required that instances would be positioned appropriately relative to the NOTA vector, in a way that satisfies the two constraints in Inequality (3). Using the NAV approach, it is hoped that the learning process would position the NOTA vector, and adjust the embedding parameters, such that these constraints would be mostly satisfied. Overall, the NAV approach imposes different constraints on the similarity space than using a single global classification threshold for the NOTA category (as in Inequality (2)), and it is not clear apriori which approach would be more effective to learn. This question is investigated empirically in Section 6.
5.3 Multiple NOTA Vectors
A natural extension of the NAV approach, denoted as MNAV, is to represent the NOTA category by multiple vectors, whose number is an empirically tuned hyper-parameter. During classification, the model picks the closest vector to the query as VN, which accordingly defines sim(q,NOTA). Then, classification is determined as in the NAV method, where adding multiple NOTA vectors is expected to effectively ease the embedding space constraints. In practice, we treat the number of NOTA vectors as a hyperparameter.
5.4 Training Procedure
For training, we use the same episode sampling procedure that generated the dev/test sets, but where the target relations are sampled from a set of train relations, disjoint from the dev/test relations. We define an epoch to include a fixed number of episodes, considered a tuned hyper-parameter, independently sampling episodes for each epoch. We measure dev set performance after each epoch, and use early stopping. For each episode , we encode the query using BERTEM encoding function (Baldini Soares et al., 2019), described in §4, and similarly for each item x in each support set, obtaining for each the corresponding average prototype vector .
Following Weinberger and Saul (2009), who derived a triplet loss objective, and similar to subsequent lines of work (e.g., Schroff et al., 2015; Hoffer and Ailon, 2015; Ming et al., 2017), we experimented also with adapted versions of triplet loss. Under this objective, instances not belonging to the same class are pushed away while same-class instances are pulled together, aiming to reach the desired ordering as in Inequalities (2) and (3). We tried multiple variants of this objective for FSL training, including objective versions with a margin element, but these experiments resulted in consistently lower results than the methods described above.
NOTA Vector Initialization
For the NAV method, we straightforwardly initialized the single NOTA vector randomly. Random initialization of the multiple NOTA vectors in MNAV evolved to a single vector being dominantly picked as the NAV vector by the MNAV decision process. Consequently, results were very similar to the (single vector) NAV model. Presumably, this happened because a single random vector turned out to be closest to the sub-space initially populated by the pre-trained BERTEM embedding function. To avoid this, we wish to scatter all the initial vectors within the initially populated subspace. To this end, we initialize a NOTA vector by sampling a relation and then averaging 10 random instances from that relation. We repeat this process for each NOTA vector.
6 Experiments and Results
In this section, we assess our two main contributions. With respect to our Few-Shot TACRED dataset, we show that models that perform well on FewRel 2.0 perform poorly on this much more realistic setting, leaving a huge gap for improvement by future research. With respect to our proposed NAV modeling approach, we show that it is a viable, and advantageous, alternative to the threshold approach.
Implemented Models
We conduct our investigation in the framework of the common embedding based approach to FSL, with respect to the MNAV, NAV, and threshold-based methods described in §5. These methods are implemented following the best-performing embedding and similarity methods identified for the state-of- the-art method on FewRel 1.0 (Baldini Soares et al., 2019), namely, BERTEM applied using BERTBASE, and dot product similarity (§4). In addition, we train and evaluate the baseline Sentence-Pair model, described in §4.1.
To select the number of NOTA vectors in the MNAV model, we experimented with 5 different values, ranging from 1 to 20. In practice, the choice of the number of vectors had rather little impact on the results (less than one F1 point). We use the best performing value for this hyperparameter, which was 20.
In terms of memory utilization, as 5-way 5-shot episodes require feeding the 25 instances of the support set in addition to the query instances into BERT simultaneously, they often occupy nearly the entire 32GB of GPU memory. To leverage the memory taken by the support set instances, we include as many queries as we can fit into the GPU’s memory. In our experiments, we construct 3 episodes for each sampled support set (by sampling 3 different queries for it), which fully utilizes the GPU capacity. Since these episodes occupy the entire GPU memory, we use a single episode per batch.
We further note that it may be possible to perform the N-way classification by transforming it into a pair-wise classification, repeated N times (both in training and evaluation). This technique would allow to reduce the memory usage but would increases the run-time. As we managed to fit the entire episode to our GPU memory, we followed the standard N-way approach, for faster computation, as was previously done by Gao et al. (2019).
Test Methodology and Metrics
Like prior work, evaluation is conducted over randomly sampled episodes from the test data, as described in §2. Prior results for FewRel 2.0 (and FewRel 1.0) were reported in terms of Accuracy. However, in realistic, highly imbalanced, relation classification datasets, like our Few-Shot TACRED, accuracy becomes meaningless. Hence, we propose micro F1 over the target relations as a more appropriate measure for future research. Accordingly, we report micro F1 for both datasets, as well as accuracy for FewRel experiments, for compatibility. For both measures we report average values and standard deviation over 5 different random samples of episodes (Zhang et al., 2018, 2017). In all experiments, we train and evaluate five models and report the results of the median performing model. Unless otherwise mentioned, reported result differences are significant under one-tailed t-test at 0.05 confidence.
6.1 FewRel 2.0 Result
We first confirm the appropriateness of our investigation by comparing performance on the prior FewRel 2.0 test data. Table 2 presents the figures on the two official (synthetic) test NOTA rates for this benchmark. We use 50% NOTA rate to train all our models, with 6,000 episodes per epoch. As shown, the MNAV model performs best across all FewRel settings, obtaining a new SOTA for this task.8
Model . | 5-way . | 1-shot . | 5-way . | 5-shot . |
---|---|---|---|---|
NOTA Rate | 15% | 50% | 15% | 50% |
Sentence-Pair | 77.67 | 80.31 | 84.19 | 86.06 |
Threshold | 63.41 | 76.48 | 65.43 | 78.95 |
NAV | 77.17 | 81.47 | 82.97 | 87.08 |
MNAV | 79.06 | 81.69 | 85.52 | 87.74 |
Model . | 5-way . | 1-shot . | 5-way . | 5-shot . |
---|---|---|---|---|
NOTA Rate | 15% | 50% | 15% | 50% |
Sentence-Pair | 77.67 | 80.31 | 84.19 | 86.06 |
Threshold | 63.41 | 76.48 | 65.43 | 78.95 |
NAV | 77.17 | 81.47 | 82.97 | 87.08 |
MNAV | 79.06 | 81.69 | 85.52 | 87.74 |
We next turn to a more comprehensive comparison of the investigated embedding-based few-shot models, namely, threshold-based, NAV, and MNAV, over the publicly available FewRel development set, with 50% NOTA rate. The results in Table 3 show that, here as well, the MNAV model outperforms the others in both settings. The gap between MNAV and the threshold model is significant for the two settings, while the gap relative to the NAV model is significant only in the 5-shot setting.
Model . | Metric . | 5-way 1-shot . | 5-way 5-shot . |
---|---|---|---|
Sentence-Pair | Accuracy | 75.48 ± 0.33% | 78.43 ± 0.25% |
F1 | 71.85 ± 0.44% | 75.43 ± 0.31% | |
Threshold | Accuracy | 76.32 ± 0.12% | 80.30 ± 0.09% |
F1 | 73.34 ± 0.25% | 78.89 ± 0.11% | |
NAV | Accuracy | 78.54 ± 0.08% | 80.44 ± 0.11% |
F1 | 75.00 ± 0.22% | 79.20 ± 0.14% | |
MNAV | Accuracy | 78.23 ± 0.13% | 81.25 ± 0.18% |
F1 | 75.22±0.19% | 80.06±0.11% |
Model . | Metric . | 5-way 1-shot . | 5-way 5-shot . |
---|---|---|---|
Sentence-Pair | Accuracy | 75.48 ± 0.33% | 78.43 ± 0.25% |
F1 | 71.85 ± 0.44% | 75.43 ± 0.31% | |
Threshold | Accuracy | 76.32 ± 0.12% | 80.30 ± 0.09% |
F1 | 73.34 ± 0.25% | 78.89 ± 0.11% | |
NAV | Accuracy | 78.54 ± 0.08% | 80.44 ± 0.11% |
F1 | 75.00 ± 0.22% | 79.20 ± 0.14% | |
MNAV | Accuracy | 78.23 ± 0.13% | 81.25 ± 0.18% |
F1 | 75.22±0.19% | 80.06±0.11% |
6.2 Few-Shot TACRED Results
We compare the MNAV, NAV, Sentence-Pair, and threshold-based models over our more realistic Few-Shot TACRED test set (here, epoch size is 2,000). As seen in Table 4, the MNAV model outperforms the others, as was the case over FewRel 2.0.
model . | 5-way 1-shot . | 5-way 5-shot . |
---|---|---|
Sentence-Pair | 10.19 ± 0.81% | − |
Threshold | 6.87 ± 0.48% | 13.57 ± 0.46% |
NAV | 8.38 ± 0.80% | 18.38 ± 2.01% |
MNAV | 12.39 ±1.01% | 30.04 ±1.92% |
model . | 5-way 1-shot . | 5-way 5-shot . |
---|---|---|
Sentence-Pair | 10.19 ± 0.81% | − |
Threshold | 6.87 ± 0.48% | 13.57 ± 0.46% |
NAV | 8.38 ± 0.80% | 18.38 ± 2.01% |
MNAV | 12.39 ±1.01% | 30.04 ±1.92% |
Notably, performance is drastically lower over Few-Shot TACRED. We suggest that this indicates the much more challenging nature of a realistic setting, relative to the FewRel 2.0 setting, while indicating the limitation of all current models. We further analyze this performance gap in the next section.
7 Analysis
7.1 Differentiating Characteristics of FewRel vs. Few-Shot TACRED
As seen in Tables 3 vs. 4, the results on Few-Shot TACRED are drastically lower than those obtained for FewRel 2.0, by at least 50 points. Yet, the performance figures are difficult to compare due to several differences between the datasets, including training size, NOTA rate, and different entity types. To analyze the possible impact of these differences, we control for each of them and observe performance differences. For brevity, we focus on the MNAV model (1-shot and 5-shot).
Training Size
We train the model on FewRel 2.0, taking the same amount of training instances as in Few-Shot TACRED. Compared to full training, results dropped by five micro F1 points in the 1-shot setting and by 1.5 points for 5-shot, suggesting that the training size explains only a small portion of the performance gap between the two datasets.
NOTA Rates
We control for the unrealistic NOTA rate in FewRel 2.0 by training and evaluating our model on higher NOTA rates. The results in Figure 2 indicate that realistic higher NOTA rates are indeed much more challenging: Moving from the original FewRel 50% NOTA rate to the 97.5% rate as in Few-Shot TACRED shrank the performance gap by 33 points in the 1-shot setting and by 35 for 5-shot.
Entity Types
In this experiment, we evaluate performance differences when including all entity types (named entities, common nouns, and pronouns), as in Few-Shot TACRED, versus including only named entities, as in FewRel. To this end, we sampled two corresponding subsets of relation instances from Few-Shot TACRED, of the same size, with either all entity types or named entities only.9 Further, we control for the distributions of relation types in the two subsets, making them equal, since, as discussed in Section 3, this distribution impacts performance in RC datasets.
Apparently, the impact of entity composition was different in the 1-shot and 5-shot settings. For 1-shot, the named entities subset yielded slightly lower performance (6.65 vs. 9.03 micro F1), which is hard to interpret. For 5-shot, performance on the named entities subset was substantially higher than when including all entity types (33.48 vs. 18.74), possibly suggesting that a larger diversity of entity types is more challenging for the model. In any case, we argue that RC datasets should include all entity types, to reflect real-world corpora.
Summary
Overall, the differences we analyzed account for much of the large performance gap between the two datasets, particularly in the more promising 5-shot setting. As argued earlier, we suggest that Few-Shot TACRED represents more realistic properties of few-shot RC, including realistic non-uniform distribution, “no_relation” instances and inclusion of all entity types, and hence should be utilized in future evaluations.
7.2 Few-Shot versus Supervised TACRED
We next analyze the impact of category transfer in Few-Shot TACRED. To this end, we apply our same MNAV model in a supervised (non-transfer) setting, termed Supervised MNAV, and compare it to the few-shot MNAV (FSL MNAV). Concretely, we trained the supervised MNAV model on the training instances of the same categories as those in the Few-Shot TACRED test data (vs. training on different background relations in the transfer-based FSL setting). The supervised model was then tested for 5-way 5-shot classification on Few-Shot TACRED, identically to the FSL MNAV 5-way 5-shot testing in Table 4. The results showed a 31 point gap, with the Supervised MNAV yielding 61.19 micro F1 while FSL MNAV scored 30.04, indicating the substantial challenge when moving from the supervised to the category transfer setting.
7.3 Qualitative Error Analysis
To obtain some insight on current performance, we manually analyzed 50 episodes for which the model predicted an incorrect support class (precision error) and 50 in which it missed identifying the right support class (recall error). We sampled 1-shot episodes since these can be more easily interpreted, examining a single support instance per class.
For the precision errors, we found a single prominent characteristic. Across all sampled episodes, both the query and the falsely selected support instance shared the same (ordered) pair of entity types. For instance, they may both share the entity types of person and location, albeit having different relations, such as city of death vs. state of residence, or having no meaningful relation for the query (no relation case). This behavior suggests that pre-training, together with fine tuning on the background relations, allowed the BERT-based model to learn to distinguish entity types, to realize their criticality for the RC task, and to successfully match entity types between a query and a support instance. On the other hand, the low overall performance suggests that the model does not recognize well the patterns indicating a target relation based on a small support set. Additional evidence for this conjecture is obtained when examining confused class pairs in the predictions’ confusion matrices (1-shot and 5-shot settings). Out of 10 confused class pairs, 8 pairs have matching entity types; in the other two pairs, the location type is confused with organization in the context of school attended, which often carries a sense of location.
For the recall errors, manual inspection of the 50 episodes did not reveal any prominent insights. Therefore, we sampled 100,000 1-shot episodes over which we analyzed various statistics which may be related to recall errors. Of these, we present two analyses that seem to explain aspects of recall misses, in a statistically significant manner (one-tailed t-test at 0.01 significance level), though only to a partial extent.
The first analysis examines the impact of whether the relative order of the two marked argument entities flips between the query and support instance sentences. To that end, we examined the about 2,600 episodes in our sample in which the query belongs to one of the support classes. We found that for episodes in which argument order is consistent across the query and support instance, the model identified the correct class in 15.68% of the cases, while when the order is flipped only 10.95% of the episodes are classified correctly. This suggests that a flipped order makes it more challenging for the model to match the relation patterns across the query and support sentences. The second analysis examines the impact of lexical overlap between the query and support instance. To that end, we compared 300 episodes in which the correct support class was successfully identified (true positive) and 300 in which it was missed (false negative). In each episode, we measured Intersection over Union (IoU) (aka Jaccard Index) for the two sets of lemmas in the query and support instance. As expected, the IoU value was significantly higher for the true positive set (0.17) than for the false negative set (0.12), suggesting that higher lexical match eases recognizing the correct support instance.
8 Conclusions
In this work, we offer several required criteria for realistic FSL datasets, while proposing a methodology to derive such benchmarks from available datasets designed for supervised learning. We then applied our methodology on the TACRED relation classification dataset, creating a challenging benchmark for future research. Indeed, previous models that achieved impressive results on FewRel, a synthetic dataset for FSL, failed miserably on our naturally distributed dataset. These results call for better models and loss functions for FSL, and indicate that we are far from having satisfying results on this setup. Our methodology may be further applied to additional datasets, enriching the availability of realistic datasets for FSL.
Next, we analyzed the constraints imposed embedding functions by nearest-neighbor classification schemes, common for FSL. This analysis led us to derive a new method for representing the NOTA category as one or more explicit learned vectors, yielding a novel classification scheme, which achieves new state-of-the-art performance. We suggest that our analysis may further inspire additional innovations in few-shot learning.
Notes
Additional concern of a realistic setup, which we do not consider in this work, is the accuracy of the entity-extractor that marks entity boundaries and assigns entity types, prior to the RC setup.
In practice we have MT categories for test and MD for dev, we refer to both as eval for brevity.
To preform the N-Way K-Shot setup, M is required to be larger than N; in case the original data has a NOTA label, M may be equal to N.
While in the remainder of this paper we focus on this similarity-based approach, it is worth noting that there exist other approaches for FSL, which further utilize the few labeled support examples. These include data augmentation methods, which generate additional examples based on the few initial ones, as well as optimization-based methods (Ravi and Larochelle, 2017; Finn et al., 2017), where the model does utilize the small support sets of the target classes for parameter learning. Integrating our contributions with these approaches is left for future work.
Notice that we drop the margin element in the adapted inequality, as it is not needed for the analytic purpose of our constraint.
Proof provided in the Appendix.
Our MNAV results are also reported at the official FewRel 2.0 leader-board, as Anonymous Cat, at https://thunlp.github.io/2/fewrel2_nota.html. We note that the FewRel test set is kept hidden, where models are submitted to the FewRel authors, who produce (only) accuracy scores.
Entity types were automatically identified by the SpaCy NER model (Honnibal and Montani, 2017), as well as certain fixed types included in FewRel, such as ranks and titles.
References
A Appendix
A.1 Proof of Inequality 2
We prove that the two sides of the inequality are necessary and sufficient to guarantee perfect classification by the threshold-based classification rule, over all possible episodes in a given dataset.
We first prove necessity. As the LHS refers to σr(q), it is relevant only for episodes where q belongs to one of the support classes. If it is violated for some episode, then that episode cannot be classified to r(q) (the correct class)
by the threshold-based classification rule. As for RHS necessity, consider an episode in which sim(q,σ¬r(q)) > θ, violating the RHS. Without loss of generality, we can construct a possible episode with the same q and σ¬r(q), whose correct classification is NOTA (making sure to exclude r(q) from the support classes). This episode cannot be correctly classified by the classification rule to NOTA, since q’s similarity to at least one class, ¬r(q), surpasses θ.
To prove sufficiency, we consider the two cases where an episode’s correct classification is either NOTA or one of the support classes. If the correct classification is NOTA, then r(q) is not within the Support Set. The RHS then guarantees that sim(q,σ¬r(q)) < θ for all support classes, implying a correct NOTA classification. Otherwise, the correct classification is r(q), being one of the support classes. In this case, the LHS guarantees excluding a NOTA classification, while the RHS excludes classification to any other category different than r(q). QED.
A.2 Sentence-Pair High GPU Demand
The Sentence-Pair model (Gao et al., 2019) requires at least twice more GPU memory than a standard embedding learning model, such as the threshold model (described in Sec 5). The higher memory demand arises from feeding BERT with the concatenation of each support instance to each query instance. This concatenation effectively doubles the average input sentence length. Due to the Transformer architecture, doubling the input sentence length requires higher GPU RAM memory. In particular, a fully connected layer requires four times more memory when fed with a double-length sequence. Hence, representation learning models, which embed a single instance into an embedded vector space, are more memory efficient than the sentence-pair model.
As mentioned in Section 6.2, we could not train the sentence-pair model for the Few-Shot TACRED 5-shot setting, due to memory limitations, even though we used NVIDIA TESLA V100-32GB GPU. This stems from the fact that the average sentence length in Few-Shot TACRED is higher than in FewRel, which did not fit into our server memory with the higher memory consumption of the sentence-pair model.