Abstract
Interpretable rationales for model predictions are crucial in practical applications. We develop neural models that possess an interpretable inference process for dependency parsing. Our models adopt instance-based inference, where dependency edges are extracted and labeled by comparing them to edges in a training set. The training edges are explicitly used for the predictions; thus, it is easy to grasp the contribution of each edge to the predictions. Our experiments show that our instance-based models achieve competitive accuracy with standard neural models and have the reasonable plausibility of instance-based explanations.
1 Introduction
While deep neural networks have improved prediction accuracy in various tasks, rationales underlying the predictions have been more challenging for humans to understand (Lei et al., 2016). In practical applications, interpretable rationales play a crucial role in driving humans’ decisions and promoting human–machine cooperation (Ribeiro et al., 2016). From this perspective, the utility of instance-based learning (Aha et al., 1991), a traditional machine learning method, has been realized again (Papernot and McDaniel, 2018).
Instance-based learning is a method that learns similarities between training instances and infers a value or class for a test instance on the basis of similarities against the training instances. On the one hand, standard neural models encode all the knowledge in the parameters, making it challenging to determine what knowledge is stored and used for predictions (Guu et al., 2020). On the other hand, models with instance-based inference explicitly use training instances for predictions and can exhibit the instances that significantly contribute to the predictions. The instances play a role in an explanation to the question: why did the model make such a prediction? This type of explanation is called instance-based explanation (Caruana et al., 1999; Baehrens et al., 2010; Plumb et al., 2018), which facilitates the users’ understandings of model predictions and allows users to make decisions with higher confidence (Kolodneer, 1991; Ribeiro et al., 2016).
It is not trivial to combine neural networks with instance-based inference processes while keeping high prediction accuracy. Recent studies in image recognition seek to develop such methods (Wang et al., 2014; Hoffer and Ailon, 2015; Liu et al., 2017; Wang et al., 2018; Deng et al., 2019). This paradigm is called deep metric learning. Compared to image recognition, there are much fewer studies on deep metric learning in natural language processing (NLP). As a few exceptions, Wiseman and Stratos (2019) and Ouchi et al. (2020) developed neural models that have an instance-based inference process for sequence labeling tasks. They reported that their models have high explainability without sacrificing the prediction accuracy.
As a next step from targeting consecutive tokens, we study instance-based neural models for relations between discontinuous elements. To correctly recognize relations, systems need to capture associations between elements. As an example of relation recognition, we address dependency parsing, where systems seek to recognize binary relations between tokens (hereafter edges). Traditionally, dependency parsers have been a useful tool for text analysis. An unstructured text of interest is parsed, and its structure leads users to a deeper understanding of the text. By successfully introducing instance-based models to dependency parsing, users can extract dependency edges along with similar edges as a rationale for the parse, which further helps the process of text analysis.
In this paper, we develop new instance-based neural models for dependency parsing, equipped with two inference modes: (i) explainable mode and (ii) fast mode. In the explainable mode, our models make use of similarities between the candidate edge and each edge in a training set. By looking at the similarities, users can quickly check which training edges significantly contribute to the prediction. In the fast mode, our models run as fast as standard neural models, while general instance-based models are much slower than standard neural models because of the dependence on the number of training instances. The fast mode is motivated by the actual situation: In many cases, users want only predictions, and when the predictions seem suspicious, they want to check the rationales. Thus, the fast mode does not offer rationales, but instead enables faster parsing that outputs exactly the same predictions as the explainable mode. Users can freely switch between the explainable and fast modes according to their purposes. This property is realized by taking advantage of the linearity of score computation in our models and avoids comparing a candidate edge to each training edge one by one for computing the score at test time (see Section 4.4 for details).
Our experiments on multilingual datasets show that our models can achieve competitive accuracy with standard neural models. In addition, we shed light on the plausibility of instance-based explanations, which has been underinvestigated in dependency parsing. We verify whether our models meet a minimal requirement related to the plausibility (Hanawa et al., 2021). Additional analysis reveals the existence of hubs (Radovanovic et al., 2010), a small number of specific training instances that often appear as nearest neighbors, and that hubs have a terrible effect on the plausibility. Our main contributions are as follows:
This is the first work to develop and study instance-based neural models1 for dependency parsing (Section 4);
Our empirical results show that our instance- based models achieve competitive accuracy with standard neural models (Section 6.1);
Our analysis reveals that L2-normalization for edge representations suppresses the hubs’ occurrence, and, as a result, succeeds in improving the plausibility of instance-based explanations (Section 6.2 and 6.3).
2 Related Work
2.1 Dependency Parsing
There are two major paradigms for dependency parsing (Kübler et al., 2009): (i) the transition- based paradigm (Nivre, 2003; Yamada and Matsumoto, 2003) and (ii) the graph-based paradigm (McDonald et al., 2005). Recent literature often adopts the graph-based paradigm and achieves high accuracy (Dozat and Manning, 2017; Zhang et al., 2017; Hashimoto et al., 2017; Clark et al., 2018; Ji et al., 2019; Zhang et al., 2020). The first-order edge-factored models under this paradigm factorize the score of a dependency tree into independent scores of single edges (McDonald et al., 2005). The score of each edge is computed on the basis of its edge feature. This decomposable property is preferable for our work because we want to model similarities between single edges. Thus, we adopt the basic framework of the first-order edge-factored models for our instance-based models.
2.2 Instance-Based Methods in NLP
Traditionally, instance-based methods (memory- based learning) have been applied to a variety of NLP tasks (Daelemans and Van den Bosch, 2005), such as part of speech tagging (Daelemans et al., 1996), NER (Tjong Kim Sang, 2002; De Meulder and Daelemans, 2003; Hendrickx and van den Bosch, 2003), partial parsing (Daelemans et al., 1999; Sang, 2002), phrase-structure parsing (Lebowitz, 1983; Scha et al., 1999; Kübler, 2004; Bod, 2009), word sense disambiguation (Veenstra et al., 2000), semantic role labeling (Akbik and Li, 2016), and machine translation (MT) (Nagao, 1984; Sumita and Iida, 1991).
Nivre et al. (2004) proposed an instance-based (memory-based) method for transition-based dependency parsing. The subsequent actions of a transition-based parser are selected at each step by comparing the current parser configuration to each of the configurations in the training set. Here, each parser configuration is treated as an instance and plays a role of rationales for predicted actions but not for predicted edges. Generally, parser configurations are not directly mapped to each predicted edge one by one, so it is troublesome to interpret which configurations significantly contribute to edge predictions. By contrast, since we adopt the graph-based one, our models can naturally treat each edge as an instance and exhibit similar edges as rationales for edge predictions.
2.3 Instance-Based Neural Methods in NLP
Most of the studies above were published before the current deep learning era. Very recently, instance-based methods have been revisited and combined with neural models in language modeling (Khandelwal et al., 2019), MT (Khandelwal et al., 2020), and question answering (Lewis et al., 2020). They augment a main neural model with a non-parametric sub-module that retrieves auxiliary objects, such as similar tokens and documents. Guu et al. (2020) proposed to parameterize and learn the sub-module for a target task.
These studies assume a different setting from ours. There is no ground-truth supervision signal for retrieval in their setting, so they adopt non-parametric approaches or indirectly train the sub-module to help a main neural model from the supervision signal of the target task. In our setting, the main neural model plays a role in retrieval and is directly trained with ground-truth objects (annotated dependency edges). Thus, our findings and insights are orthogonal to theirs.
2.4 Deep Metric Learning
Our work can be categorized into deep metric learning research in terms of the methodological perspective. Although the origins of metric learning can be traced back to some earlier work (Short and Fukunaga, 1981; Friedman et al., 1994; Hastie and Tibshirani, 1996), the pioneering work is Xing et al. (2002).2 Since then, many methods using neural networks for metric learning have been proposed and studied.
Deep metric learning methods can be categorized into two classes from the training loss perspective (Sun et al., 2020): (i) learning with class-level labels and (ii) learning with pair-wise labels. Given class-level labels, the first one learns to classify each training instance to its target class with a classification loss, for example, Neighbourhood Component Analysis (NCA) (Goldberger et al., 2005), L2-constrained softmax loss (Ranjan et al., 2017), SpereFace (Liu et al., 2017), CosFace (Wang et al., 2018), and ArcFace (Deng et al., 2019). Given pair-wise labels, the second one learns pair-wise similarity (the similarity between a pair of instances), for example, contrastive loss (Hadsell et al., 2006), triplet loss (Wang et al., 2014; Hoffer and Ailon, 2015), N-pair loss (Sohn, 2016), and multi-similarity loss (Wang et al., 2019). Our method is categorized into the first group because it adopts a classification loss (Section 4).
2.5 Neural Models Closely Related to Ours
Among the metric learning methods above, NCA (Goldberger et al., 2005) shares the same spirit as our models. In this framework, models learn to map instances with the same label to the neighborhood in a feature space. Wiseman and Stratos (2019) and Ouchi et al. (2020) developed NCA-based neural models for sequence labeling. We discuss the differences between their models and ours later in more detail (Section 4.5).
3 Dependency Parsing Framework
We adopt a two-stage approach (McDonald et al., 2006; Zhang et al., 2017): we first identify dependency edges (unlabeled dependency parsing) and then classify the identified edges (labeled dependency parsing). More specifically, we solve edge identification as head selection and solve edge classification as multi-class classification.3
3.1 Edge Identification
To identify unlabeled edges, we adopt the head selection approach (Zhang et al., 2017), in which a model learns to select the correct head of each token in a sentence. This simple approach enables us to train accurate parsing models in a GPU-friendly way. We learn the representation for each edge to be discriminative for identifying correct heads.
3.2 Label Classification
4 Instance-Based Scoring Methods
4.1 Edge Scoring
We would like to assign a higher score to the correct edge than other candidates (Eq. 1). Here, we compute similarities between each candidate edge and ground-truth edges in a training set (hereafter training edge). By summing the similarities, we then obtain the score that indicates how likely the candidate edge is the correct one.
4.2 Label Scoring
4.3 Edge Representation
4.4 Fast Mode
Do users want rationales for all the predictions? Maybe not. In many cases, all they want to do is to parse sentences as fast as possible. Only when they find a suspicious prediction will they check the rationale for it. To fulfill the demand, our parser provides two modes: (i) explainable mode and (ii) fast mode. The explainable mode, as described in the previous subsections, enables exhibiting similar training instances as rationales, but its time complexity depends on the size of the training set. By contrast, the fast mode does not provide rationales, but instead enables faster parsing than the explainable mode and outputs exactly the same predictions as the explainable mode. Thus, at test time, users can freely switch between the modes: For example, they first use the fast mode, and if they find a suspicious prediction, then they will use the explainable mode to obtain the rationale for it.
4.5 Relations to Existing Models
The Closest Models to Ours.
Standard Models Using Weights.
5 Experimental Setup
5.1 Data
We use English PennTreebank (PTB) (Marcus et al., 1993) and Universal Dependencies (UD) (McDonald et al., 2013). Following previous studies (Kulmizev et al., 2019; Smith et al., 2018; de Lhoneux et al., 2017), we choose a variety of 13 languages8 from the UD v2.7. Table 1 shows information about each dataset. We follow the standard training-development-test splits.
Language . | Treebank . | Family . | Order . | Train . |
---|---|---|---|---|
Arabic | PADT | non-IE | VSO | 6.1k |
Basque | BDT | non-IE | SOV | 5.4k |
Chinese | GSD | non-IE | SVO | 4.0k |
English | EWT | IE | SVO | 12.5k |
Finnish | TDT | non-IE | SVO | 12.2k |
Hebrew | HTB | non-IE | SVO | 5.2k |
Hindi | HDTB | IE | SOV | 13.3k |
Italian | ISDT | IE | SVO | 13.1k |
Japanese | GSD | non-IE | SOV | 7.1k |
Korean | GSD | non-IE | SOV | 4.4k |
Russian | SynTagRus | IE | SVO | 48.8k |
Swedish | Talbanken | IE | SVO | 4.3k |
Turkish | IMST | non-IE | SOV | 3.7k |
Language . | Treebank . | Family . | Order . | Train . |
---|---|---|---|---|
Arabic | PADT | non-IE | VSO | 6.1k |
Basque | BDT | non-IE | SOV | 5.4k |
Chinese | GSD | non-IE | SVO | 4.0k |
English | EWT | IE | SVO | 12.5k |
Finnish | TDT | non-IE | SVO | 12.2k |
Hebrew | HTB | non-IE | SVO | 5.2k |
Hindi | HDTB | IE | SOV | 13.3k |
Italian | ISDT | IE | SVO | 13.1k |
Japanese | GSD | non-IE | SOV | 7.1k |
Korean | GSD | non-IE | SOV | 4.4k |
Russian | SynTagRus | IE | SVO | 48.8k |
Swedish | Talbanken | IE | SVO | 4.3k |
Turkish | IMST | non-IE | SOV | 3.7k |
5.2 Neural Encoder Architecture
To compute hdep and hhead (in Eq. 11), we adopt the encoder architecture proposed by Dozat and Manning (2017). First, we map the input sequence x = (x0,…,xT)9 to a sequence of token representations, h0:Ttoken = (h0token,…,hTtoken), each of which is httoken = [et;ct;bt], where et, ct, and bt are computed by word embeddings,10 character- level CNN, and BERT (Devlin et al., 2019),11 respectively. Second, the sequence h0:Ttoken is fed to bidirectional LSTM (BiLSTM) (Graves et al., 2013) for computing contextual ones: h0:Tlstm = (h0lstm,…,hTlstm) = BiLSTM(h0:Ttoken). Finally, htlstm ∈ℝ2d is transformed as htdep =Wdephtlstm and hthead =Wheadhtlstm, where Wdep ∈ℝd×2d and Whead ∈ℝd×2d are parameter matrices.
5.3 Mini-Batching
We train models with the mini-batch stochastic gradient descent method. To make the current mini-batch at each time step, we follow a standard technique for training instance-based models (Hadsell et al., 2006; Oord et al., 2018).
At training time, we make the mini-batch that consists of query and support sentences at each time step. A model encodes the sentences and the edge representations used for computing similarities between each candidate edge in the query sentences and each gold edge in the support sentences. Here, due to the memory limitation of GPUs, we randomly sample a subset from the training set at each time step: that is, (x(n),y(n),. In edge identification, for query sentences, we randomly sample a subset of N sentences from . For support sentences, we randomly sample a subset of M sentences from , and construct and use the support set instead of in Eq. 7. In label classification, we would like to guarantee that the support set in every mini-batch always contains at least one edge for each label. To do so, we randomly sample a subset of U edges from the support set for each label r: that is, in Eq. 9. Note that each edge is in the n-th sentence x(n) in the training set , so we put the sentence x(n) into the mini-batch to compute the representation for . Actually, we use N = 32 query sentences in both edge identification and label classification, M = 10 support sentences in edge identification,12 and U = 1 support edge (sentence) for each label in label classification.13
At test time, we encode each test (query) sentence and compute the representation for each candidate edge on-the-fly. The representation is then compared to the precomputed support edge representation, in Eq 12. To precompute , we first encode all the training sentences and obtain the edge representations. Then, in edge identification, we sum all of them and obtain one support edge representation . In label classification, similarly to , we sum only the edge representations with label r and obtain one support representation for each label .14
5.4 Training Configuration
Table 2 lists the hyperparameters. To optimize the parameters, we use Adam (Kingma and Ba, 2014) with β1 = 0.9 and β2 = 0.999. The initial learning rate is η0 = 0.001 and is updated on each epoch as ηt = η0/(1 + ρt), where ρ = 0.05 and t is the epoch number completed. A gradient clipping value is 5.0 (Pascanu et al., 2013). The number of training epochs is 100. We save the parameters that achieve the best score on each development set and evaluate them on each test set. It takes less than one day to train on a single GPU, NVIDIA DGX-1 with Tesla V100.
Name . | Value . |
---|---|
Word Embedding | GloVe (PTB) / fastText (UD) |
BERT | BERT-Base |
CNN window size | 3 |
CNN filters | 30 |
BiLSTM layers | 2 |
BiLSTM units | 300 dimensions |
Optimization | Adam |
Learning rate | 0.001 |
Rescaling factor τ | 64 |
Dropout ratio | {0.1, 0.2, 0.3} |
Name . | Value . |
---|---|
Word Embedding | GloVe (PTB) / fastText (UD) |
BERT | BERT-Base |
CNN window size | 3 |
CNN filters | 30 |
BiLSTM layers | 2 |
BiLSTM units | 300 dimensions |
Optimization | Adam |
Learning rate | 0.001 |
Rescaling factor τ | 64 |
Dropout ratio | {0.1, 0.2, 0.3} |
6 Results and Discussion
6.1 Prediction Accuracy on Benchmark Tests
We report averaged unlabeled attachment scores (UAS) and labeled attachment scores (LAS) across three different runs of the model training with random seeds. We compare 6 systems, each of which consists of two models for edge identification and label classification, respectively. For reference, we list the results by the graph-based parser with BERT in Kulmizev et al. (2019), whose architecture is the most similar to ours.
Table 3 shows UAS and LAS by these systems. The systems WWd and WWc are the standard ones that consistently use the weight-based scores (Eqs. 13 and 14) during learning and inference. Between these systems, the difference of the similarity functions does not make a gap in the accuracies. In other words, the dot product and the cosine similarity are on par in terms of the accuracies. The systems WId and WIc use the weight-based scores during learning and the instance-based ones during inference. While the system WId using dot achieved competitive UAS and LAS to those by the standard weight-based system WWd, the system WIc using achieved lower accuracies than those by the system WWc. The systems IId and IIc consistently use the instance-based scores during learning and inference. Both of them succeeded in keeping competitive accuracies with those by the standard weight-based ones WWd and WWc.
Learning . | Weight-based . | Weight-based . | Instance-based . | ||||
---|---|---|---|---|---|---|---|
Inference . | Weight-based . | Weight-based . | Instance-based . | Instance-based . | |||
Similarity . | dot . | dot . | . | dot . | . | dot . | . |
System ID | Kulmizev+’19 | WWd | WWc | WId | WIc | IId | IIc |
PTB-English | – | 96.4/95.3 | 96.4/95.3 | 96.4/94.4 | 93.0/91.8 | 96.4/95.3 | 96.4/95.3 |
UD-Average | – /84.9 | 89.0/85.6 | 89.0/85.6 | 89.0/85.2 | 83.0/79.5 | 89.3/85.7 | 89.0/85.5 |
UD-Arabic | – /81.8 | 87.8/82.1 | 87.8/82.1 | 87.8/81.6 | 84.9/79.0 | 88.0/82.1 | 87.6/81.9 |
UD-Basque | – /79.8 | 84.9/81.1 | 84.9/80.9 | 84.9/80.6 | 82.0/77.9 | 85.1/80.9 | 85.0/80.8 |
UD-Chinese | – /83.4 | 85.6/82.3 | 85.8/82.4 | 85.7/81.6 | 80.9/77.3 | 86.3/82.8 | 85.9/82.5 |
UD-English | – /87.6 | 90.9/88.1 | 90.7/88.0 | 90.9/87.8 | 88.1/85.3 | 91.1/88.3 | 91.0/88.2 |
UD-Finnish | – /83.9 | 89.4/86.6 | 89.1/86.3 | 89.3/86.1 | 84.1/81.2 | 89.6/86.6 | 89.4/86.4 |
UD-Hebrew | – /85.9 | 89.4/86.4 | 89.5/86.5 | 89.4/85.9 | 82.7/79.7 | 89.8/86.7 | 89.6/86.6 |
UD-Hindi | – /90.8 | 94.8/91.7 | 94.8/91.7 | 94.8/91.4 | 91.4/88.0 | 94.9/91.8 | 94.9/91.6 |
UD-Italian | – /91.7 | 94.1/92.0 | 94.2/92.1 | 94.1/91.9 | 91.5/89.4 | 94.3/92.2 | 94.1/92.0 |
UD-Japanese | – /92.1 | 94.3/92.8 | 94.5/93.0 | 94.3/92.7 | 92.5/90.9 | 94.6/93.1 | 94.4/92.8 |
UD-Korean | – /84.2 | 88.0/84.4 | 87.9/84.3 | 88.0/84.2 | 84.3/80.4 | 88.1/84.4 | 88.2/84.5 |
UD-Russian | – /91.0 | 94.2/92.7 | 94.1/92.7 | 94.2/92.4 | 57.7/56.5 | 94.3/92.8 | 94.1/92.6 |
UD-Swedish | – /86.9 | 90.3/87.6 | 90.3/87.5 | 90.4/87.1 | 88.6/85.8 | 90.5/87.5 | 90.4/87.5 |
UD-Turkish | – /64.9 | 73.0/65.3 | 73.2/65.4 | 73.1/64.5 | 69.9/61.9 | 73.7/65.5 | 72.9/64.7 |
Learning . | Weight-based . | Weight-based . | Instance-based . | ||||
---|---|---|---|---|---|---|---|
Inference . | Weight-based . | Weight-based . | Instance-based . | Instance-based . | |||
Similarity . | dot . | dot . | . | dot . | . | dot . | . |
System ID | Kulmizev+’19 | WWd | WWc | WId | WIc | IId | IIc |
PTB-English | – | 96.4/95.3 | 96.4/95.3 | 96.4/94.4 | 93.0/91.8 | 96.4/95.3 | 96.4/95.3 |
UD-Average | – /84.9 | 89.0/85.6 | 89.0/85.6 | 89.0/85.2 | 83.0/79.5 | 89.3/85.7 | 89.0/85.5 |
UD-Arabic | – /81.8 | 87.8/82.1 | 87.8/82.1 | 87.8/81.6 | 84.9/79.0 | 88.0/82.1 | 87.6/81.9 |
UD-Basque | – /79.8 | 84.9/81.1 | 84.9/80.9 | 84.9/80.6 | 82.0/77.9 | 85.1/80.9 | 85.0/80.8 |
UD-Chinese | – /83.4 | 85.6/82.3 | 85.8/82.4 | 85.7/81.6 | 80.9/77.3 | 86.3/82.8 | 85.9/82.5 |
UD-English | – /87.6 | 90.9/88.1 | 90.7/88.0 | 90.9/87.8 | 88.1/85.3 | 91.1/88.3 | 91.0/88.2 |
UD-Finnish | – /83.9 | 89.4/86.6 | 89.1/86.3 | 89.3/86.1 | 84.1/81.2 | 89.6/86.6 | 89.4/86.4 |
UD-Hebrew | – /85.9 | 89.4/86.4 | 89.5/86.5 | 89.4/85.9 | 82.7/79.7 | 89.8/86.7 | 89.6/86.6 |
UD-Hindi | – /90.8 | 94.8/91.7 | 94.8/91.7 | 94.8/91.4 | 91.4/88.0 | 94.9/91.8 | 94.9/91.6 |
UD-Italian | – /91.7 | 94.1/92.0 | 94.2/92.1 | 94.1/91.9 | 91.5/89.4 | 94.3/92.2 | 94.1/92.0 |
UD-Japanese | – /92.1 | 94.3/92.8 | 94.5/93.0 | 94.3/92.7 | 92.5/90.9 | 94.6/93.1 | 94.4/92.8 |
UD-Korean | – /84.2 | 88.0/84.4 | 87.9/84.3 | 88.0/84.2 | 84.3/80.4 | 88.1/84.4 | 88.2/84.5 |
UD-Russian | – /91.0 | 94.2/92.7 | 94.1/92.7 | 94.2/92.4 | 57.7/56.5 | 94.3/92.8 | 94.1/92.6 |
UD-Swedish | – /86.9 | 90.3/87.6 | 90.3/87.5 | 90.4/87.1 | 88.6/85.8 | 90.5/87.5 | 90.4/87.5 |
UD-Turkish | – /64.9 | 73.0/65.3 | 73.2/65.4 | 73.1/64.5 | 69.9/61.9 | 73.7/65.5 | 72.9/64.7 |
Out-of-Domain Robustness.
We evaluate the robustness of our instance-based models in out-of-domain settings by using the five domains of UD-English: we train each model on the training set of the source domain “Yahoo! Answers” and test it on each test set of the target domains, Emails, Newsgroups, Reviews, and Weblogs. As Table 4 shows, the out-of-domain robustness of our instance-based models is comparable to the weight-based models. This tendency is observed when using different source domains.
. | Weight-Based . | Instance-Based . | ||
---|---|---|---|---|
dot . | . | dot . | . | |
WWd . | WWc . | IId . | IIc . | |
Emails | 81.7 | 81.7 | 81.6 | 81.4 |
Newsgroups | 83.1 | 83.3 | 83.1 | 82.9 |
Reviews | 88.5 | 88.7 | 88.7 | 88.8 |
Weblogs | 81.9 | 80.9 | 80.9 | 81.9 |
Average | 83.8 | 83.7 | 83.6 | 83.8 |
. | Weight-Based . | Instance-Based . | ||
---|---|---|---|---|
dot . | . | dot . | . | |
WWd . | WWc . | IId . | IIc . | |
Emails | 81.7 | 81.7 | 81.6 | 81.4 |
Newsgroups | 83.1 | 83.3 | 83.1 | 82.9 |
Reviews | 88.5 | 88.7 | 88.7 | 88.8 |
Weblogs | 81.9 | 80.9 | 80.9 | 81.9 |
Average | 83.8 | 83.7 | 83.6 | 83.8 |
Sensitivity of M for Inference.
In the experiments above, we used all the training sentences for support sentences at test time. What if we reduce the number of support sentences? Here, in the same out-of-domain settings above, we evaluate the instance-based system using the cosine similarity IIc with M support sentences randomly sampled at each time step. Intuitively, if using a smaller number of randomly sampled support sentences (e.g., M = 1), the prediction accuracies would drop. Surprisingly, however, Table 5 shows that the accuracies do not drop even if reducing M. This tendency is observed when using the other three systems WId, WIc, and IId. One possible reason for it is that the feature space is appropriately learned: that is, because positive edges are close to each other and far from negative edges in the feature space, the accuracies do not drop even if randomly sampling a single support sentence and using the edges.
. | M . | |||
---|---|---|---|---|
1 . | 10 . | 100 . | ALL . | |
Emails | 81.5 | 81.4 | 81.5 | 81.5 |
Newsgroups | 82.8 | 83.0 | 82.9 | 82.9 |
Reviews | 88.7 | 88.7 | 88.8 | 88.8 |
Weblogs | 81.8 | 82.1 | 82.0 | 81.9 |
Average | 83.7 | 83.8 | 83.8 | 83.8 |
. | M . | |||
---|---|---|---|---|
1 . | 10 . | 100 . | ALL . | |
Emails | 81.5 | 81.4 | 81.5 | 81.5 |
Newsgroups | 82.8 | 83.0 | 82.9 | 82.9 |
Reviews | 88.7 | 88.7 | 88.8 | 88.8 |
Weblogs | 81.8 | 82.1 | 82.0 | 81.9 |
Average | 83.7 | 83.8 | 83.8 | 83.8 |
6.2 Sanity Check for Plausible Explanations
It is an open question how to evaluate the “plausibility” of explanations: that is, whether or not the retrieved instances as explanations are convincing for humans. As a reasonable compromise, Hanawa et al. (2021) designed the identical subclass test for evaluating the plausibility. This test is based on a minimal requirement that interpretable models should at least satisfy: training instances to be presented as explanations should belong to the same latent (sub)class as the test instance. Consider the examples in Figure 1. The predicted unlabeled edge “wrote novels” in the test sentence has the (unobserved) latent label, obj. To this edge, two training instances are given as explanations: The above one seems more convincing than the below one because “published books” has the same latent label, obj, as that of “wrote novels” while “novels the” has the different one, det. As these show, the agreement between the latent classes are likely to correlate with plausibility. Note that this test is not perfect for the plausibility assessment, but it works as a sanity check for verifying whether models make obvious violations in terms of plausibility.
This test can be used for assessing unlabeled parsing models because the (unobserved) relation labels can be regarded as the latent subclasses of positive unlabeled edges. We follow three steps; (i) identifying unlabeled edges in a development set; (ii) retrieving the nearest training edge for each identified edge; and (iii) calculating LAS, that is, if the labels of the query and retrieved edges are identical, we regard them as correct.15
Table 6 shows LAS on PTB and UD-English. The systems using instance-based inference with the cosine similarity, WIc and IIc, succeeded in retrieving the support training edges with the same label as the queries. Surprisingly, the system IIc achieved over 70% LAS on PTB without label supervision. The results suggest that systems using instance-based inference with the cosine similarity meet the minimal requirement, and the retrieved edges are promising as plausible explanations.
System ID . | Weight-Based . | Instance-Based . | ||
---|---|---|---|---|
dot . | . | dot . | . | |
WId . | WIc . | IId . | IIc . | |
PTB-English | 1.8 | 67.5 | 7.0 | 71.6 |
UD-English | 16.4 | 51.5 | 3.9 | 54.0 |
System ID . | Weight-Based . | Instance-Based . | ||
---|---|---|---|---|
dot . | . | dot . | . | |
WId . | WIc . | IId . | IIc . | |
PTB-English | 1.8 | 67.5 | 7.0 | 71.6 |
UD-English | 16.4 | 51.5 | 3.9 | 54.0 |
To facilitate the intuitive understanding of model behaviors, we show actual examples of the retrieved support edges in Table 7. As the first query-support pair shows, for query edges whose head or dependent is a function word (e.g., if), the training edges with the same (unobserved) label tend to be retrieved. On the other hand, as the second pair shows, for queries whose head is a noun (e.g., appeal), the edges whose head is also a noun (e.g., food) tend to be retrieved regardless of different latent labels.
6.3 Geometric Analysis on Feature Spaces
The identical subclass test suggests a big difference between the feature spaces learned by using the dot product and the cosine similarity. Here we look into them in more detail.
6.3.1 Observation of Nearest Neighbors
First, we look into training edges retrieved as nearest support ones. Specifically, we use the edges in the UD-English development set as queries and retrieve the top k similar support edges in the UD-English training set. Table 8 shows the examples retrieved by the WId system. Here, the same support edge, 〈ROOT, find〉, was retrieved for the different queries, 〈Jennifer, Anderson〉 and 〈all, after〉. As this indicates, when using the dot product as the similarity function, a small number of specific edges are extremely often selected as support ones for any queries. Such edges are called hubs (Radovanovic et al., 2010). This phenomenon is not desirable for users in terms of the plausible interpretation of predictions. If a system always exhibits the same training instance(s) as rationales for predictions, users are likely to doubt the system’s validity.
6.3.2 Quantitative Measurement of Hubness
Second, we quantitatively measure the hubness of each system. Specifically, for the hubness, we measure the k-occurrences of instance x, Nk(x) (Radovanovic et al., 2010; Schnitzer et al., 2012). In the case of our dependency parsing experiments, Nk(x) indicates the number of times each support training edge x occurs among the k nearest neighbors of all the query edges. The support training edges with an extremely high Nk value can be regarded as hubs. In this study, we set k = 10 and measure N10(x) of unlabeled support training edges. For query edges, we use the UD-English development set that contains 25,148 edges. For support edges, we use the UD-English training set that contains 204,585 edges.
Table 9 shows the highest N10 support training edges. In the case of the system WId, the unlabeled support edge 〈ROOT, find〉 appeared 19,407 times in the 10 nearest neighbors of the 25,148 query edges. A similar tendency was observed in the instance-based system using the dot product IId. By contrast, in the case of the systems using the cosine similarity, WIc and IIc, it was not observed that specific support edges were retrieved so often. In Figure 2, we plot the top 100 support training edges in terms of N10 with scale. The N10 distributions of the systems using the dot product, WId and IId, look very skew; that is, hubs emerge. This indicates that when using the dot product, a small number of specific support training edges appear in the nearest neighbors so often, regardless of query edges.
System ID . | sim . | N10 . | Instances . |
---|---|---|---|
WId | dot | 19,407 | 〈ROOT, find〉 |
WIc | cos | 82 | 〈help, time〉 |
IId | dot | 22,493 | 〈said, airlifted〉 |
IIc | cos | 34 | 〈force, Israel〉 |
System ID . | sim . | N10 . | Instances . |
---|---|---|---|
WId | dot | 19,407 | 〈ROOT, find〉 |
WIc | cos | 82 | 〈help, time〉 |
IId | dot | 22,493 | 〈said, airlifted〉 |
IIc | cos | 34 | 〈force, Israel〉 |
To sum up, systems using instance-based inference and the dot product are often in trouble with hubs and have difficulty retrieving plausible support edges for predictions. The occurrence of hubs are likely to be related to the norms of edge representations since L2-normalization for the edges in the cosine similarity tends to suppress hubs’ occurrence. We leave a more detailed analysis of the cause of hubs’ occurrence for future work.
7 Conclusion
We have developed instance-based neural dependency parsing systems, each of which consists of our edge identification model and our label classification model (Section 4). We have analyzed them from the perspectives of the prediction accuracy and the explanation plausibility. The first analysis shows that our instance-based systems and achieve competitive accuracy with weight- based neural ones (Section 6.1). The second indicates that our instance-based systems using the cosine similarity (L2-normalization for edge representations) meet the minimal requirement of plausible explanations (Section 6.2). The additional analysis reveals that when using the dot product, hubs emerge, which degrades the plausibility (Section 6.3). One interesting future direction is investigating the cause of hubs’ occurrence in more detail. Another direction is using the learned edge representations in downstream tasks, such as semantic textual similarity.
Acknowledgments
The authors are grateful to the anonymous reviewers and the Action Editor who provided many insightful comments that improve the paper. Special thanks also go to the members of Tohoku NLP Laboratory for the interesting comments and energetic discussions. The work of H. Ouchi was supported by JSPS KAKENHI grant number 19K20351. The work of J. Suzuki was supported by JST Moonshot R&D grant number JPMJMS2011 (fundamental research) and JSPS KAKENHI grant number 19H04162. The work of S. Yokoi was supported by JST ACT-X grant number JPMJAX200S, Japan. The work of T. Kuribayashi was supported by JSPS KAKENHI grant number 20J22697. The work of M. Yoshikawa was supported by JSPS KAKENHI grant number 20K23314. The work of K. Inui was supported by JST CREST grant number JPMJCR20D2, Japan.
Notes
Our code is publicly available at https://github.com/hiroki13/instance-based-dependency-parsing.
If you would like to know the history of metric learning in more detail, please read Bellet et al. (2013).
Although some previous studies adopt multi-task learning methods for edge identification and classification tasks (Dozat and Manning, 2017; Hashimoto et al., 2017), we independently train a model for each task because the interaction effects produced by multi-task learning make it challenging to analyze models’ behaviors.
While this greedy formulation has no guarantee to produce well-formed trees, we can produce well-formed ones by using the Chu-Liu-Edmonds algorithm in the same way as Zhang et al. (2017). In this work, we would like to focus on the representation for each edge and evaluate the goodness of the learned edge representation one by one. With such a motivation, we adopt the greedy formulation.
In our preliminary experiments, we set τ by selecting a value from {16,32,64,128}. As a result, whichever we chose, the prediction accuracy was stably better than τ = 1.
In our preliminary experiments, we also tried an additive composition and the concatenation of the two vectors. The accuracies by these techniques for unlabeled dependency parsing, however, were both about 20%, which is much inferior to that by the multiplicative composition.
These languages have been selected by considering the perspectives of different language families, different morphological complexity, different training sizes, and domains.
We use the gold tokenized sequences in PTB and UD.
We first conduct subword segmentation for each token of the input sequence. Then, the BERT encoder takes as input the subword-segmented sequences and computes the representation for each subword. Here, we use the (last layer) representation of the first subword within each token as its token representation. For PTB, we use “BERT-Base, Cased.” For UD, we use “BERT-Base, Multilingual Cased.”
As a result, the whole mini-batch size is 32 + 10 = 42.
When U = 1, the whole mini-batch size is .
The total number of the support edge representations is equal to the size of the label set .
If the parsed edge is incorrect, we regard it as incorrect.
References
Author notes
Action Editor: Yue Zhang