Abstract
We propose a new, more actionable view of neural network interpretability and data analysis by leveraging the remarkable matching effectiveness of representations derived from deep networks, guided by an approach for class-conditional feature detection. The decomposition of the filter-n-gram interactions of a convolutional neural network (CNN) and a linear layer over a pre-trained deep network yields a strong binary sequence labeler, with flexibility in producing predictions at—and defining loss functions for—varying label granularities, from the fully supervised sequence labeling setting to the challenging zero-shot sequence labeling setting, in which we seek token-level predictions but only have document-level labels for training. From this sequence-labeling layer we derive dense representations of the input that can then be matched to instances from training, or a support set with known labels. Such introspection with inference-time decision rules provides a means, in some settings, of making local updates to the model by altering the labels or instances in the support set without re-training the full model. Finally, we construct a particular K-nearest neighbors (K-NN) model from matched exemplar representations that approximates the original model’s predictions and is at least as effective a predictor with respect to the ground-truth labels. This additionally yields interpretable heuristics at the token level for determining when predictions are less likely to be reliable, and for screening input dissimilar to the support set. In effect, we show that we can transform the deep network into a simple weighting over exemplars and associated labels, yielding an introspectable—and modestly updatable—version of the original model.
1. Introduction
The promise and peril of deep learning in computational linguistics, and AI in general, would seem, on the surface, to be that the strong effectiveness of the large neural networks is unavoidably accompanied by inscrutable model predictions. The models are often right, but when they are wrong, it is difficult to ascertain why, and furthermore, it is typically not obvious how to course-correct a model when errors are discovered, beyond altogether abandoning the model. The non-identifiable (cf., Hwang and Ding 1997; Jain and Wallace 2019) and extraordinarily large number of parameters suggest a lost cause, in general, for tracing model predictions back to particular parameters, and it would seem then that deep networks are of limited use in settings where interpretability is paramount. However, interestingly, and surprisingly, we show that there is nonetheless a sense in which the deep networks can be leveraged to create a notion of actionable interpretability against the data that is not necessarily possible with simpler, less expressive models alone, and may yield precisely the characteristics desired in certain real-world applications. By leveraging the strong pattern matching behavior and the dense representations of the deep networks, we can form a mapping between test instances and training instances with known labels, which enables introspection of the model with respect to the data. In some settings, we can then update the model by updating the data and labels in these mappings. Interestingly, in this way, the application of deep neural networks begins to resemble some of the classic instance-based and metric learning methods from machine learning, as well as the exemplar systems (Clark 1990) from an earlier era of AI, but with less dependence on human-mediated feature engineering, which may prove critical for applications with high-dimensional input, at the very least as tools for data analysis.
A model for analyzing a natural language data set ideally needs some facility for class-conditional feature detection at the word level. However, the compositional, high-dimensional nature of language makes feature detection a challenging endeavor, with further empirical complications arising from the need to label at a granularity that is typically more fine-grained than many existing human-annotated data sets. We propose and demonstrate that a single-layer, one-dimensional, kernel-width-one max-pooled convolutional neural network (CNN) and a linear layer, as the final layer of a network, can be trained for document-level classification, and then decomposed in a straightforward way to produce token-level labels. This particular set of operations over a CNN and a linear layer yields flexibility in learning and predicting at disparate label resolutions and is efficient and simple to calculate and train. It can readily replace the standard final linear layer often used for classification in Transformer models (Vaswani et al. 2017), adding the properties described here. We empirically show across tasks, using data sets that have token-level labels for verification, that it yields surprisingly sharp token-level binary detections even when trained at the document level, when the input to the layer is a large, masked-language-model-trained BERT model (Devlin et al. 2019).
Feature detection in this way is a useful tool for analyzing data sets, detecting rather subtle distributional differences within documents that can be otherwise challenging to find at scale. Further, we show that the CNN filter applications corresponding to the token-level predictions are effective dense representations of the model predictions, with which we can form a mapping between test predictions and instances with known labels. We find qualitatively and quantitatively that the matches correspond to similar features in similar contexts, at least when the distances between representations are low. Finally, without loss of predictive effectiveness, we can altogether replace the model’s output with a simple weighting over exemplar representations, converting the deep network into a K-nearest neighbor (K-NN) model, with concomitant benefits for interpretability, and straightforward heuristics for detecting domain-shifted and out-of-domain data.
In summary, this work contributes the following new approaches:
We present a new, effective model for supervised and zero-shot binary sequence labeling. We evaluate on token-level annotations for grammatical error detection and diff annotations on a sentiment data set, detecting both sentiment features and surprisingly, also subtle re-annotation artifacts.
We propose a method for data and model analysis via dense representation matching, exemplar auditing, enabled by our binary sequence labeling method, creating inference-time decision rules linking feature-level exemplar representations and associated predictions from test with representations from a support set with known labels. We show that in some settings we can make local updates to the model by updating the data and labels in the support set without re-training the full model.
We approximate the model’s token-level output with a K-NN over the support set that is at least as effective as the original model, and can be used as an interpretable substitute for the original model. Incorrect model predictions tend to also be more difficult to approximate; our proposed approach yields simple, understandable heuristics at the token level for determining when predictions are less likely to be reliable, and for screening input unlike that seen in the support set.
We proceed by first introducing the notation for the tasks across label resolutions (Section 2) and the core methods (Section 3) used across all experiments, and then we apply these ideas to three tasks. First, we demonstrate effectiveness on the challenging, well-defined error detection task (Section 4), which enables careful examination of the behavior using available token-level labels. Next, we use sentiment data that has been usefully re-annotated via local changes (Section 5) to further examine updating the support set over domain-shifted data, and to motivate and analyze our approach for constraining out-of-domain data in the context of an existing approach for robust classification. Finally, we also use these sentiment data sets to examine the model’s ability to detect subtle distributional changes across re-annotated and original data (Section 6), discovering features that are not readily detectable at scale without model-based assistance.
2. Tasks
Given a document, which may consist of a single sentence, we seek binary labels over the words in the document. For learning such a model, we may be given training examples with associated labels for each of the “words,”1 which is the standard fully supervised binary sequence labeling setting, or we may only be given document-level labels, which is the zero-shot binary sequence labeling setting. This latter setting corresponds to notions of feature detection for document-level classification models, enabling quantitative evaluation when given token-level labeled held-out data.
Supervised Binary Sequence Labeling.
Specifically, in the standard fully supervised sequence labeling setting, we are given a training data set 𝔻* = {(xd, yd)|1 ≤ d ≤ |𝔻*|} of |𝔻*| documents paired with their corresponding token-level ground-truth labels. Each of N tokens in a document, x = x1, …, xn, …, xN, has a known token-level label, yn ∈ {−1, 1}. We seek a learned mapping, x ↦ ŷ, for predicting the labels for a given document: At inference, we are given a new, previously unseen document instance, x|𝔻*|+1, over which we predict ŷ|𝔻*|+1 = ŷ1, …, ŷn, …, ŷN, the token-level labels for each token in the document. We will subsequently drop the subscript label, “|𝔻*| + 1”, on test-time instances when the distinction from training is otherwise unambiguous. We aim to minimize the distance between the predicted ŷ and the ground-truth y.
Throughout we use * to indicate a data set includes, or a model otherwise has access to, token-level labels. Otherwise, the label signal is limited to the document level, with the exception of clearly indicated reference experiments simply tuning the decision boundary of document-level models with a limited number of token-level labels.
Document-level Binary Classification.
In the standard document-level classification setting, we are given a training data set 𝔻 = {(xd, Yd)|1 ≤ d ≤ |𝔻|} of |𝔻| documents paired with their corresponding document-level ground-truth labels. Token-level labels are not present in 𝔻. At inference, we seek to predict Ŷ given a new, unseen document x, via the learned mapping F : x ↦ Ŷ. We aim for Ŷ to be close to the true document-level label, Y ∈ {−1, 1}.
Zero-shot Binary Sequence Labeling.
The zero-shot binary sequence labeling models have access to the same training data set 𝔻 as in the standard document-level classification task. However, at inference, we then seek to predict the token-level labels, ŷ, for each token in the new document instance x, via a mapping x ↦ ŷ, even though we can only query the document-level labels of 𝔻 during training. In other words, the learning signal is the same for document-level classification and zero-shot sequence labeling, but the inference-time task is the same in the zero-shot sequence labeling and fully supervised sequence labeling settings.
We will be primarily concerned with analyzing the sequence labeling settings. We also report document-level classification results for a subset of the zero-shot sequence labeling models, illustrating how the proposed token-level predictions can be used to analyze and constrain typical text data sets that only have labels at the document level, rather than at finer-grained resolutions, at least at scale.
3. Methods
We propose a new method for class-conditional feature detection from a large, expressive deep network that enables the interlinked view of interpretability, constrained inference, and updatability via an external database introduced in this work. We demonstrate that a particular max-pool attention-style mechanism from a CNN and a linear layer over a deep network enables the following:
We show that we can derive token-level predictions across the full document, f(x1), …, f(xn), …, f(xN), from the document-level prediction, F(x). This decomposition provides flexibility in learning and analyzing at varying label resolutions.
We further show that the token-level predictions can themselves be approximately decomposed via f(xn) ≈ f(xn)KNN, where f(xn)KNN is an explicit weighting over a set of nearest exemplar representations and their associated labels and predictions.
We proceed by first introducing the base document-level classifier (Section 3.1). We then introduce the approach for deriving token-level predictions from the document-level classifier (Section 3.2). We show how this can be used for supervised labeling (Section 3.3); yields flexibility in adding task-specific priors (Section 3.4); and provides a means of aggregate feature extraction for analyzing data sets (Section 3.5). Next, we introduce the approach for mapping a test-time prediction to a database of exemplars by leveraging dense representations coupled with the class-conditional feature detection (Section 3.6), before introducing the K-NN approximations (Section 3.7).2Figure 1 provides a high-level overview of the approaches further detailed below.
3.1 CNN Binary Classifier Over a Deep Network: Document-Level Predictions
We use a CNN architecture similar to that of Kim (2014) over a pre-trained Transformer model (Devlin et al. 2019) and fine-tuned word embeddings as our document-level classifier, F. Each token xn ∈ x in the document, including padding symbols as necessary, is represented by a D-dimensional vector, tn = (eBERT, eword), the concatenation of the top hidden layer(s) of a Transformer and a vector of word embeddings, D = |eBERT| + |eword|. The convolutional layer is then applied to this ℝD×N matrix, using a filter of width Q, sliding across the dense vectors corresponding to the Q-sized n-grams of the input. The convolution results in a feature map hm ∈ ℝN−Q+1 for each of M total filters.
The base model is trained for document classification with a standard cross-entropy loss. We primarily use a filter width of 1, Q = 1. In experiments with multiple filter widths, we concatenate the output of the max-pooling prior to the fully connected layer.
3.2 Zero-Shot Sequence Labeling with a CNN Binary Classifier: From Document-Level Labels to Token-Level Labels
This decomposition then affords considerable flexibility in defining loss constraints to bias the filter weights according to the granularity of the available labels, and/or according to other priors we may have regarding our data.
3.3 Supervised Sequence Labeling
For inference, token-level detection labels are determined in the same manner as in the zero-shot setting.
3.4 Task-Specific Zero-Shot Loss Constraints: Min-Max
The intuition is to encourage correct sentences to have aggregated token contributions less than zero (i.e., no detected errors), and to encourage sentences with errors to have at least one token contribution less than zero and at least one greater than zero (i.e., to encourage even incorrect sentences to have one or more correct tokens, since errors are, in general, relatively rare).
3.5 Aggregate, Comparative Feature Extraction
With the true document-level labels, we can then identify the n-grams and documents most salient for each class under this metric, and just as importantly for many applications, the n-grams and documents that the model misclassifies.
3.6 Exemplar Auditing: Inference-Time Decision Rules and Data/Model Introspection via Dense Representation Matching
We can view each token-level prediction, f(xn) = , as the composition f = u ∘ v, where v : en ∈ ℝD ↦ rn ∈ ℝM and u : rn ∈ ℝM ↦ ∈ ℝ. The mapping v takes as input the word embeddings and hidden layers of the deep network corresponding to the particular token and produces a dense representation, a distilled summarization of the expressive deep network at the local level which we refer to as an exemplar representation, derived from the CNN filter applications corresponding to the token.4
This connection enables inference-time decision rules with which we can inspect and constrain predictions, which we refer to as exemplar auditing. We will use the label ExAG for the rule in which positive token-level predictions are only admitted when the token-level prediction of the corresponding exemplar token from the support set matches that of the test token, and the exemplar’s document has a positive ground-truth label: > 0 ∧ > 0 ∧ Y(ñ) = 1. Similarly, we use the label ExAT when token-level ground-truth labels are available in the support set: > 0 ∧ > 0 ∧ yñ = 1. In this way, updates to the support set can be a means of making local updates to the model without modifying the parameters of the original model, including in some cases for domain-shifted data over which the original model is otherwise a weak predictor, provided the dense representations yield adequate matching effectiveness across the new domain. The distances to the matches can also be used for constraining predictions, which we consider in the context of the K-NN approximations described next.
3.7 K-NN Model Over Exemplar Representations
The inference-time decision rules are appealing, as once a dense search infrastructure is in place, they are easy to implement and for end-users and auditors to understand: If a prediction does not resemble that of its nearest matched exemplar, as via a large distance and/or label and prediction discrepancies, reject the prediction and send the decision to a human for adjudication. Additionally, because the original model’s output is used for non-rejected predictions, the prediction effectiveness is guaranteed to be the same as that of the original model for the non-rejected predictions. However, in some settings where explainability is paramount, we may require the stronger sense of fully describing a prediction as a weighting over exemplars from the support set. Interestingly, we show that we can construct a K-NN from a simple transformation of the predictions and class labels of the nearest K exemplars that closely matches the sign directions of the original prediction and is at least as strong a predictor on the metrics over the ground-truth.
We consider one primary formulation and two additional variations for further analysis. We aim to keep the number of parameters to a minimum to avoid over-fitting; since our goal is to simply reproduce the sign of the original prediction, rather than to construct a significantly larger or more expressive model; and since we seek a weighting that is easily inspectable by an end-user.
The three considered variations differ in their particular formulation of wk, detailed below, but in all cases ∑ wk = 1, wk ∈ [0, 1]. We take to mean the token-level prediction of the kth nearest exemplar in the support set, and Y(k) ∈ {−1, 1} as the document-level label associated with the document to which the kth exemplar belongs in the support set. When token-level labels are available, as with the fully supervised setting, we replace Y(k) with yk ∈ {−1, 1}, the ground-truth token-level label associated with the kth exemplar. The γ ⋅ Y(k) term is in effect a class-specific bias offset given the matched document, and the γ ⋅ yk variation directly balances the signal from the true token-level label and the prediction. The predictions and exemplar matchings are at the token level, but importantly r is a representation of the token that encodes contextual dependencies over the full input, as a result of the deep network.
3.7.1 Distance-Weighted K-NN (KNNDIST.).
3.7.2 Constraint-Weighted K-NN (KNNCONST.).
3.7.3 Equally Weighted K-NN (KNNEQUAL).
Finally, we consider wk = . An advantage of this approach is that it requires learning and interpreting only two parameters, γ and β; it is just a simple transformation of the nearest exemplar predictions and associated labels. A disadvantage is that even relatively far exemplars will play an equal role in the final K-NN prediction. In this way, an interpretation of the model is obligated to equally consider even the farthest exemplars, which requires an end-user to examine the full set of size K, some members of which may have near-zero weights in the above alternatives that explicitly enforce a ranking. For comparison purposes, we train this version via gradient descent with 𝓛KNN, as with KNNDIST. above.
4. Grammatical Error Detection
The task of grammatical error detection is to detect the presence or absence of grammatical errors in a sentence6 at the token level.
4.1 Grammatical Error Detection: Experiments
We evaluate detection in both the zero-shot and fully supervised sequence labeling settings, comparing the behavior of the proposed sequence labeling layer to previous models, as well as investigating the behavior of the inference-time decision rules and the K-NN approximations.
4.1.1 Data: FCE.
We follow past work on error detection and use the standard training, dev, and test splits of the publicly released subset of the First Certificate in English (FCE) data set (Yannakoudakis, Briscoe, and Medlock 2011; Rei and Yannakoudakis 2016),7 consisting of 28.7k, 2.2k, and 2.7k labeled sentences, respectively.
4.1.2 Data: Domain-Shifted News Data.
In a real deployment, we might reasonably expect an error detection model to encounter well-formed, correct documents from another domain, over which we would want the model to be robust to false positives. To emulate this scenario, we also consider a series of experiments in which we augment the FCE data set with sentences from the news-oriented One Billion Word Benchmark data set (Chelba et al. 2014), which are assigned negative class (Y = −1) sentence-level labels. We augment the FCE training set with a sample of 50,000 sentences (FCE+news50k) and add a disjoint sample of 2,000 sentences to the FCE test set for evaluation (FCE+news2k).
4.1.3 Models.
uniCNN+BERT Model.
Our primary model uses a filter width of 1 with 1,000 filter maps, Q = 1, M = 1,000. The CNN layer takes as input, for each token, the top four hidden layers of the large, pre-trained Bidirectional Encoder Representations from Transformers (BERTLARGE) model of Devlin et al. (2019), a multilayer bidirectional Transformer (Vaswani et al. 2017), concatenated with the pre-trained Word2Vec word embeddings of Mikolov et al. (2013), D = 4,396. The BERT model is pre-trained with masked-language modeling and next-sentence prediction objectives with large amounts of unlabeled data from 3.3 billion words. BERT’s contextualized embeddings are capable of modeling dependencies between words and position information. The CNN can be viewed as summarizing the signal from this deep network for the fine-tuned task. We use the pre-trained, 340-million-parameter BERTLARGE model with case-preserving WordPiece (Wu et al. 2016) tokenization.8 In our experiments, we fine-tune the 300-dimensional word-embeddings with the CNN parameters, while the parameters of the BERTLARGE model remain fixed. The BERT model takes as input WordPiece tokens, using its full vocabulary, and we limit the vocabulary size to 7,500 only for the fine-tuned word embeddings. Prior to evaluation, to maintain alignment with the original tokenization and labels, the WordPiece tokenization is reversed (i.e., de-tokenized), with positive/negative token contribution scores averaged over fragments for original tokens split into separate WordPieces. We also consider fine-tuning the trained uniCNN+BERT model with the min-max loss, which we label uniCNN+BERT+mm.
Our model only adds approximately 2% more parameters than BERTLARGE alone. With Q = 1, the CNN consists of the kernel-width and bias, M ⋅ D + M, and the linear layer consists of 2 ⋅ M + 2 parameters, which includes the 2 bias terms. The word-embeddings contribute 300 ⋅ (7, 500 + 2) parameters, which includes 2 additional holder symbols we use in practice for padding and out-of-vocabulary input tokens. For uniCNN+BERT, this results in around 6.6 million parameters added to the 340 million parameters of BERTLARGE.
Reference Models.
We also include a reference base model, cnn, with filter widths of 3, 4, and 5, with 100 filter maps each, fine-tuning 300 dimensional GloVe embeddings (Pennington, Socher, and Manning 2014), with a vocabulary of size 7,500, comparable to early work on zero-shot detection with lower parameter models. We additionally consider a model, CNN+BERT, similar to the primary uniCNN+BERT model, which uses Word2Vec word embeddings for consistency with the past supervised detection work of Rei and Yannakoudakis (2016), but with Q and M identical to cnn.
Optimization and Tuning.
For our zero-shot detection models, cnn, CNN+BERT, and uniCNN+BERT, we optimize for sentence-level classification, choosing the training epoch with the highest sentence-level F1 score on the dev set, without regard to token-level labels. These models do not have access to token-level labels for training or tuning.
We set aside 1k token-labeled sentences from the dev set to tune the token-level F0.5 score for comparison purposes for the experiments labeled CNN+BERT+1k and uniCNN+BERT+1k.
uniCNN+BERT+S* Model.
We also fine-tune a model with token-level labels, uniCNN+BERT+S*, with weights initialized with those of the uniCNN+BERT model trained for binary sentence-level classification. For calculating the loss at training, we assign each WordPiece to have the detection label of its original corresponding token, with the loss of a mini-batch averaged across all of the WordPieces. Inference is performed as in the zero-shot setting.
All models use dropout, with a probability of 0.5, applied on the output of the max-pooling operation, and we train with Adadelta (Zeiler 2012) with a batch size of 50.
4.1.4 Exemplar Auditing Decision Rules.
In-Domain Data.
For each of the uniCNN models, we also evaluate using the inference-time decision rules of Section 3.6, which we indicate with +ExAG and +ExAT appended to the model labels. The Euclidean distances are calculated at the word level of the original sentences, where we average the exemplar vectors when a word is split across multiple WordPiece tokens.
Expanded Database with Domain-shifted Data.
We also consider adding the FCE+news50k data to the support set, and evaluating on the augmented FCE+news2k test set. For reference, we also train the primary zero-shot models using the FCE+news50k data, for which we use the labels uniCNN+BERT+news50k and uniCNN+BERT+mm+news50k.
4.1.5 K-NN Approximations.
We train each of the 3 proposed K-NN approximations on the held-out KNN dev set to minimize δKNN, for up to 40 epochs, only using the predictions from the original models, rather than ground-truth labels. Only for the fully supervised model, uniCNN+BERT+S* , do we then subsequently use token-level labels to tune the decision boundary, as with that original model. We add the labels of Section 3.7 as suffixes to the original models to indicate the type of K-NN used, +K8NNDIST., +K8NNCONST., +K8NNEQUAL, with the subscript indicating K = 8. We chose K = 8 on the held-out dev set based on minimizing δKNN with the uniCNN+BERT+mm model with K ∈ {1, 3, 5, 8, 25}. The approximations are only marginally better with K = 25 for some of the models, so we hold K = 8 constant for comparison purposes, and since smaller values of K are preferable for interpretability, ceteris paribus. For reference, we also include results with K1NNequal, which only considers the nearest match.
Constraints for Domain-shifted Data.
We also demonstrate constraining the output based on the maximum allowed distance to the nearest match in the support set, among matches for which the K-NN prediction equals that of the sentence-level label of the nearest match, and/or limited to minimum output magnitudes of the K-NN. We determine these constraints on the KNN dev set, based on δKNN, determined without access to token-level labels; for simplicity, we use the mean values among correct approximations. We examine this with weak models over the FCE+news2k domain-shifted test set that only have the FCE training set in the support set, investigating whether we can nonetheless identify subsets with strong effectiveness. This is a challenging but very practical setting, as in real deployments, the input data will often diverge from what we have seen in training. Such constraints serve as heuristics, tied to the model itself, for determining when to refrain from predicting, as is critical in higher-risk settings.9
4.1.6 Previous Approaches and Baselines.
Previous Zero-shot Sequence Models.
Recent work has approached zero-shot error detection by modifying and analyzing bidirectional LSTM taggers, which have been shown to work comparatively well on the task in the supervised setting. Rei and Søgaard (2018) adds a soft-attention mechanism to a bidirectional LSTM tagger, training with additional loss functions to encourage the attention weights to yield more accurate token-level labels (LSTM-ATTN-SW). Previous work also considered a gradient-based approach to analyze this same model (LSTM-ATTN-BP) and the model without the attention mechanism (LSTM-LAST-BP), by fitting a parametric Gaussian model to the distribution of magnitudes of the gradients of the word representations.
Previous Supervised Sequence Models.
For comparison, we include recent fully supervised sequence models. Rei and Yannakoudakis (2016) compares various word-based neural sequence models, finding that a word-based bidirectional LSTM model was the most effective (LSTM-BASE+S*). Rei and Søgaard (2018) compares against a bidirectional LSTM tagger with character representations concatenated with word embeddings (LSTM+S*). The model of Rei (2017) extends this with an auxiliary language modeling objective (LSTM+LM+S*). This model is further enhanced with a character-level language modeling objective and supervised attention mechanisms in Rei and Søgaard (2019) (LSTM+JOINT+S*). Bell, Yannakoudakis, and Rei (2019) consider BERT embeddings with the LSTM+LM+S* model, establishing a new state-of-the-art for the supervised setting, using a frozen BERTBASE model (LSTM+LM+BERTBASE+S*), and also providing results with a BERTLARGE model (LSTM+LM+BERT+S*).
Additional Baselines.
For reference, we also provide a Random baseline, which classifies based on a fair coin flip, and a MajorityClass baseline, which in this case always chooses the positive (“error detected”) class.
4.2 Grammatical Error Detection: Results
4.2.1 Zero-shot Results.
Table 1 contains the main results with the models only given access to sentence-level labels, as well as LSTM+S* for reference, using F1, as in previous zero-shot work. The task is very challenging, in general, with some baselines falling below random at the token level. The cnn model has a similar F1 score as LSTM-ATTN-SW, and is stronger than the back-propagation-based approaches of LSTM-ATTN-BP and LSTM-LAST-BP. This is important, as it suggests the decomposition used with the basic cnn model, which amounts to a very lightweight attention mechanism, has the inductive bias suitable for such local detections, while being trivial to break apart into representative dense vectors of the input, enabling our analysis and interpretability methods. This is further confirmed when adding the pre-trained contextualized embeddings from BERT; remarkably, as a point of reference, these models exceed basic supervised LSTM models that use pre-trained word embeddings. In Table 2 against F0.5, which is the typical metric for evaluating supervised grammatical error detection, used under the assumption that end users prefer higher precision systems, the uniCNN+BERT model exceeds the fully supervised LSTM-BASE+S* model, which was the state-of-the-art model on the task as recently as 2016.
Model . | Sent . | Token-level . | ||
---|---|---|---|---|
F1 . | P . | R . | F1 . | |
LSTM+S* | – | 49.15 | 26.96 | 34.76 |
Random | 58.30 | 15.30 | 50.07 | 23.44 |
MajorityClass | 80.88 | 15.20 | 100.00 | 26.39 |
LSTM-LAST-BP | 85.10 | 29.49 | 16.07 | 20.80 |
LSTM-ATTN-BP | 85.14 | 27.62 | 17.81 | 21.65 |
LSTM-ATTN-SW | 85.14 | 28.04 | 29.91 | 28.27 |
cnn | 84.24 | 20.43 | 50.75 | 29.13 |
CNN+BERT | 86.35 | 26.76 | 61.82 | 37.36 |
uniCNN+BERT | 86.28 | 47.67 | 36.70 | 41.47 |
Model . | Sent . | Token-level . | ||
---|---|---|---|---|
F1 . | P . | R . | F1 . | |
LSTM+S* | – | 49.15 | 26.96 | 34.76 |
Random | 58.30 | 15.30 | 50.07 | 23.44 |
MajorityClass | 80.88 | 15.20 | 100.00 | 26.39 |
LSTM-LAST-BP | 85.10 | 29.49 | 16.07 | 20.80 |
LSTM-ATTN-BP | 85.14 | 27.62 | 17.81 | 21.65 |
LSTM-ATTN-SW | 85.14 | 28.04 | 29.91 | 28.27 |
cnn | 84.24 | 20.43 | 50.75 | 29.13 |
CNN+BERT | 86.35 | 26.76 | 61.82 | 37.36 |
uniCNN+BERT | 86.28 | 47.67 | 36.70 | 41.47 |
Model . | Token-level . | ||
---|---|---|---|
P . | R . | F0.5 . | |
LSTM+JOINT+S* | 65.53 | 28.61 | 52.07 |
LSTM+LM+S* | 58.88 | 28.92 | 48.48 |
LSTM-BASE+S* | 46.1 | 28.5 | 41.1 |
LSTM+LM+BERTBASE+S* | 64.96 | 38.89 | 57.28 |
LSTM+LM+BERT+S* | 64.51 | 38.79 | 56.96 |
uniCNN+BERT+S* | 75.00 | 31.40 | 58.70 |
CNN+BERT+1k | 47.11 | 28.83 | 41.81 |
uniCNN+BERT+1K | 63.89 | 23.27 | 47.36 |
uniCNN+BERT | 47.67 | 36.70 | 44.98 |
uniCNN+BERT+mm | 54.87 | 29.10 | 46.62 |
Model . | Token-level . | ||
---|---|---|---|
P . | R . | F0.5 . | |
LSTM+JOINT+S* | 65.53 | 28.61 | 52.07 |
LSTM+LM+S* | 58.88 | 28.92 | 48.48 |
LSTM-BASE+S* | 46.1 | 28.5 | 41.1 |
LSTM+LM+BERTBASE+S* | 64.96 | 38.89 | 57.28 |
LSTM+LM+BERT+S* | 64.51 | 38.79 | 56.96 |
uniCNN+BERT+S* | 75.00 | 31.40 | 58.70 |
CNN+BERT+1k | 47.11 | 28.83 | 41.81 |
uniCNN+BERT+1K | 63.89 | 23.27 | 47.36 |
uniCNN+BERT | 47.67 | 36.70 | 44.98 |
uniCNN+BERT+mm | 54.87 | 29.10 | 46.62 |
Fine-tuning the zero-shot model uniCNN+BERT with the min-max loss constraint (uniCNN+BERT+mm) has the effect of increasing precision and decreasing recall, as seen in Table 2. This results in a modest increase in F0.5, but also a decrease in F1 to 38.04. Whether or not this is a desirable tradeoff depends on the particular use case, but illustrates biasing the detections via task-specific constraints in the absence of token-level labels.
The inductive bias of the architecture is important for token-level detections: Models with similar sentence-level classification results can have significantly different token-level results. For example, CNN+BERT and uniCNN+BERT have similar sentence-level F1 scores of around 86, despite differing token-level effectiveness, and the LSTM baselines all exhibit similar sentence-level F1 scores yet have significantly different token-level scores. As such, attention-style approaches are useful, but not sufficient, for analyzing model predictions over the non-identifiable parameters of deep models, further justifying the need for the proposed methods establishing auditable mappings to the support set.
4.2.2 Supervised and Dev-set-tuned Results.
Table 2 also compares dev-set-tuned and fully supervised models. For illustrative purposes, CNN+BERT+1k and uniCNN+BERT+1k are given access to 1,000 token-labeled sentences to tune a single parameter, an offset on the decision boundary, for each model. This yields modest gains for both models, but interestingly, the uniCNN+BERT, in particular, already has a strong F0.5 score without modification of the decision boundary in the true zero-shot setting.
The uniCNN+BERT+S* model is a strong supervised sequence labeler. As seen in Table 2, it is nominally stronger than the current state-of-the-art models recently presented in Bell, Yannakoudakis, and Rei (2019). This is critical, as it suggests we can forgo more complicated, expressive final layers, and instead use our proposed CNN and linear decomposition to, in effect, summarize the signal from the deep network, from which it is then straightforward to yield representations for matching, as analyzed next.
4.2.3 Inference-time Decision Rules and K-NN Approximations.
In-domain Data.
Table 3 shows the proposed exemplar auditing decision rules and the K-NN approximations on in-domain data across models. Compared with the results in Table 2, the ExAG rule increases precision. In practice, matches tend to correspond to similar contexts, at least when the distance to the nearest exemplar in the support set is low, as shown in the examples in Appendix B. Further, the F0.5 scores suggest that with K = 8, the distance-weighted K-NNs (KNNDIST.) are sufficient for replacing the original models’ predictions: The zero-shot K-NNs are nominally stronger than the corresponding original models, and the supervised version has the same effectiveness as the original for all practical purposes (± 1 point). Note, too, that the precision vs. recall patterns for uniCNN+BERT+mm+K8NNDIST. vs. uniCNN+BERT+K8NNDIST. parallel those of uniCNN+BERT+mm vs. uniCNN+BERT, reflecting that the approximations are reasonably similar to the original models’ predictions, especially over the subset of data for which the original models’ predictions are correct, as discussed below.
Model . | Token-level . | ||
---|---|---|---|
P . | R . | F0.5 . | |
uniCNN+BERT+S*+ExAG | 85.17 | 21.86 | 53.93 |
uniCNN+BERT+S*+K1NNEQUAL | 72.64 | 25.52 | 53.05 |
uniCNN+BERT+S*+K8NNDIST. | 71.91 | 32.24 | 57.71 |
uniCNN+BERT+ExAG | 56.79 | 26.74 | 46.37 |
uniCNN+BERT+K1NNEQUAL | 47.23 | 32.01 | 43.13 |
uniCNN+BERT+K8NNDIST. | 51.19 | 35.53 | 47.04 |
uniCNN+BERT+mm+ExAG | 63.88 | 20.03 | 44.43 |
uniCNN+BERT+mm+K1NNEQUAL | 60.76 | 21.17 | 44.23 |
uniCNN+BERT+mm+K8NNDIST. | 62.06 | 25.38 | 48.14 |
Model . | Token-level . | ||
---|---|---|---|
P . | R . | F0.5 . | |
uniCNN+BERT+S*+ExAG | 85.17 | 21.86 | 53.93 |
uniCNN+BERT+S*+K1NNEQUAL | 72.64 | 25.52 | 53.05 |
uniCNN+BERT+S*+K8NNDIST. | 71.91 | 32.24 | 57.71 |
uniCNN+BERT+ExAG | 56.79 | 26.74 | 46.37 |
uniCNN+BERT+K1NNEQUAL | 47.23 | 32.01 | 43.13 |
uniCNN+BERT+K8NNDIST. | 51.19 | 35.53 | 47.04 |
uniCNN+BERT+mm+ExAG | 63.88 | 20.03 | 44.43 |
uniCNN+BERT+mm+K1NNEQUAL | 60.76 | 21.17 | 44.23 |
uniCNN+BERT+mm+K8NNDIST. | 62.06 | 25.38 | 48.14 |
We further examine the K-NN behavior on the held-out dev set in Table 4. We find that with K = 8, across models, each of the proposed K-NN formulations can be trained to be roughly similar in approximation effectiveness, and when we reveal the true labels, there is not a clear winner. In this way, the modeling choice shifts to other aspects of the model: The relative distances within the top-K appear not to be critical on this data set and can be replaced with constant learned weights with KNNCONST.; however, that comes at the expense of additional parameters and is harder to train due to the sensitivity of parameter initialization. The simplicity of KNNEQUAL is appealing, but KNNDIST. provides an explicit ranking over the exemplars with the addition of just a single learned parameter, so we take it as our primary model.
Model . | True Labels ŷKNN = y . | Model Approx. ŷKNN = ŷ . | |
---|---|---|---|
F0.5 . | Accuracy . | F0.5 . | |
uniCNN+BERT+S*+K1NNEQUAL | 56.5 | 96.5 | 72.5 |
uniCNN+BERT+S*+K8NNequal | 58.1 | 96.9 | 75.9 |
uniCNN+BERT+S*+K8NNCONST. | 60.0 | 97.0 | 75.8 |
uniCNN+BERT+S*+K8NNDIST. | 59.4 | 97.0 | 75.9 |
uniCNN+BERT+K1NNEQUAL | 45.1 | 92.8 | 69.1 |
uniCNN+BERT+K8NNEQUAL | 50.5 | 94.2 | 78.0 |
uniCNN+BERT+K8NNCONST. | 47.5 | 94.2 | 75.5 |
uniCNN+BERT+K8NNDIST. | 48.1 | 94.3 | 76.4 |
uniCNN+BERT+MM+K1NNEQUAL | 47.7 | 95.8 | 72.4 |
uniCNN+BERT+MM+K8NNEQUAL | 52.3 | 96.4 | 76.9 |
uniCNN+BERT+MM+K8NNCONST. | 53.1 | 96.4 | 75.9 |
uniCNN+BERT+MM+K8NNDIST. | 52.9 | 96.5 | 76.9 |
Model | True Labels ŷ = y | ||
F0.5 | |||
uniCNN+BERT+S* | 59.5 | – | – |
uniCNN+BERT | 44.9 | – | – |
uniCNN+BERT+MM | 49.6 | – | – |
Model . | True Labels ŷKNN = y . | Model Approx. ŷKNN = ŷ . | |
---|---|---|---|
F0.5 . | Accuracy . | F0.5 . | |
uniCNN+BERT+S*+K1NNEQUAL | 56.5 | 96.5 | 72.5 |
uniCNN+BERT+S*+K8NNequal | 58.1 | 96.9 | 75.9 |
uniCNN+BERT+S*+K8NNCONST. | 60.0 | 97.0 | 75.8 |
uniCNN+BERT+S*+K8NNDIST. | 59.4 | 97.0 | 75.9 |
uniCNN+BERT+K1NNEQUAL | 45.1 | 92.8 | 69.1 |
uniCNN+BERT+K8NNEQUAL | 50.5 | 94.2 | 78.0 |
uniCNN+BERT+K8NNCONST. | 47.5 | 94.2 | 75.5 |
uniCNN+BERT+K8NNDIST. | 48.1 | 94.3 | 76.4 |
uniCNN+BERT+MM+K1NNEQUAL | 47.7 | 95.8 | 72.4 |
uniCNN+BERT+MM+K8NNEQUAL | 52.3 | 96.4 | 76.9 |
uniCNN+BERT+MM+K8NNCONST. | 53.1 | 96.4 | 75.9 |
uniCNN+BERT+MM+K8NNDIST. | 52.9 | 96.5 | 76.9 |
Model | True Labels ŷ = y | ||
F0.5 | |||
uniCNN+BERT+S* | 59.5 | – | – |
uniCNN+BERT | 44.9 | – | – |
uniCNN+BERT+MM | 49.6 | – | – |
As shown in Figure 2, across both classes and all models, the approximation effectiveness and the K-NN’s prediction effectiveness increase as the magnitude of the K-NN’s output increases. This reflects a more general pattern: When the original model and/or K-NN produce incorrect predictions, the original model and the K-NN are more likely to produce different predictions. Put another way, difficult instances to predict also tend to be difficult instances over which to approximate the model, which we can exploit as a heuristic to abstain from predicting, discussed below.
Domain-shifted Data.
Table 5 considers the more challenging setting in which the FCE test set has been augmented with 2,000 already correct sentences in the news domain. Just applying the uniCNN+BERT+mm model to this modified test set yields a large number of false positives on the already correct data, yielding a F0.5 of 25.76 (c.f., the F0.5 score of 46.62 on the original test set, as shown in Table 2), and similarly for the other models, including that with full supervision. Simply training with the domain-shifted data, as with uniCNN+BERT+news50k, still results in low effectiveness for the zero-shot models, presumably owing to the class imbalance. Furthermore, the F0.5 score of the uniCNN+BERT+news50k model on the original FCE test set (a result not shown in the tables) is 39.57, which is lower than the result of 44.98 of uniCNN+BERT, the equivalent model trained only with the original FCE set (Table 2).
Model . | Training . | 𝕊 . | Token-level . | ||
---|---|---|---|---|---|
P . | R . | F0.5 . | |||
uniCNN+BERT+S* | F* | – | 43.44 | 31.42 | 40.35 |
uniCNN+BERT+S*+ExAT | F* | F* | 59.23 | 21.02 | 43.43 |
uniCNN+BERT+S*+ExAT | F* | F*+50k* | 83.31 | 18.92 | 49.57 |
uniCNN+BERT+S*+K8NNDIST. | F* | F* | 43.98 | 32.23 | 40.99 |
uniCNN+BERT+S*+K8NNDIST. | F* | F*+50k* | 65.39 | 29.58 | 52.64 |
uniCNN+BERT+news50k | F+50k | – | 26.64 | 40.13 | 28.56 |
uniCNN+BERT+news50k+ExAG | F+50k | F+50k | 47.10 | 26.55 | 40.79 |
uniCNN+BERT+mm+news50k | F+50k | – | 61.80 | 11.67 | 33.25 |
uniCNN+BERT+mm+news50k+ExAG | F+50k | F+50k | 68.89 | 06.39 | 23.31 |
uniCNN+BERT | F | – | 21.84 | 36.65 | 23.76 |
uniCNN+BERT+ExAG | F | F | 29.19 | 26.74 | 28.66 |
uniCNN+BERT+ExAG | F | F+50k | 56.39 | 23.52 | 44.07 |
uniCNN+BERT+ExAT | F | F*+50k* | 75.98 | 18.51 | 46.87 |
uniCNN+BERT+K8NNDIST. | F | F | 24.65 | 35.54 | 26.26 |
uniCNN+BERT+K8NNDIST. | F | F+50k | 43.64 | 30.91 | 40.32 |
uniCNN+BERT+mm | F | – | 25.04 | 29.10 | 25.76 |
uniCNN+BERT+mm+ExAG | F | F | 31.29 | 20.03 | 28.13 |
uniCNN+BERT+mm+ExAG | F | F+50k | 65.08 | 17.62 | 42.30 |
uniCNN+BERT+mm+ExAT | F | F*+50k* | 78.16 | 14.53 | 41.66 |
uniCNN+BERT+mm+K8NNDIST. | F | F | 27.41 | 25.38 | 26.98 |
uniCNN+BERT+mm+K8NNDIST. | F | F+50k | 64.48 | 21.71 | 46.26 |
Model . | Training . | 𝕊 . | Token-level . | ||
---|---|---|---|---|---|
P . | R . | F0.5 . | |||
uniCNN+BERT+S* | F* | – | 43.44 | 31.42 | 40.35 |
uniCNN+BERT+S*+ExAT | F* | F* | 59.23 | 21.02 | 43.43 |
uniCNN+BERT+S*+ExAT | F* | F*+50k* | 83.31 | 18.92 | 49.57 |
uniCNN+BERT+S*+K8NNDIST. | F* | F* | 43.98 | 32.23 | 40.99 |
uniCNN+BERT+S*+K8NNDIST. | F* | F*+50k* | 65.39 | 29.58 | 52.64 |
uniCNN+BERT+news50k | F+50k | – | 26.64 | 40.13 | 28.56 |
uniCNN+BERT+news50k+ExAG | F+50k | F+50k | 47.10 | 26.55 | 40.79 |
uniCNN+BERT+mm+news50k | F+50k | – | 61.80 | 11.67 | 33.25 |
uniCNN+BERT+mm+news50k+ExAG | F+50k | F+50k | 68.89 | 06.39 | 23.31 |
uniCNN+BERT | F | – | 21.84 | 36.65 | 23.76 |
uniCNN+BERT+ExAG | F | F | 29.19 | 26.74 | 28.66 |
uniCNN+BERT+ExAG | F | F+50k | 56.39 | 23.52 | 44.07 |
uniCNN+BERT+ExAT | F | F*+50k* | 75.98 | 18.51 | 46.87 |
uniCNN+BERT+K8NNDIST. | F | F | 24.65 | 35.54 | 26.26 |
uniCNN+BERT+K8NNDIST. | F | F+50k | 43.64 | 30.91 | 40.32 |
uniCNN+BERT+mm | F | – | 25.04 | 29.10 | 25.76 |
uniCNN+BERT+mm+ExAG | F | F | 31.29 | 20.03 | 28.13 |
uniCNN+BERT+mm+ExAG | F | F+50k | 65.08 | 17.62 | 42.30 |
uniCNN+BERT+mm+ExAT | F | F*+50k* | 78.16 | 14.53 | 41.66 |
uniCNN+BERT+mm+K8NNDIST. | F | F | 27.41 | 25.38 | 26.98 |
uniCNN+BERT+mm+K8NNDIST. | F | F+50k | 64.48 | 21.71 | 46.26 |
However, when we update the support set with the domain-shifted data, in conjunction with the decision rules or the K-NN approximations, the F0.5 scores jump significantly across models. The models are generally weak predictors over the domain-shifted data, but the improved scores reflect the capacity of the representations to match to the new data, and by extension, the associated labels. This mechanism opens the potential to update the model locally without a full re-training.
Matching to the support set in this way can improve effectiveness over domain-shifted data, but of course, it also requires such data to be in the support set prior to inference. In practice, it may be advisable to include as much data in the support set as computationally feasible, refraining from predicting for matches to unlabeled data, as applicable. In higher-risk settings, we can also constrain predictions based on the L2 distance to the nearest match and the magnitude of the K-NN output, as demonstrated in Table 6 on the FCE+news2k test set. These constraints limit predictions to reliable subsets, even for these models that are weak predictors over the full set. These heuristics are interpretable in that the matched distance can be compared to that of other instances, and the K-NN output is a bounded value that is an explicit weighting over instances with known labels, tracking prediction reliability at least as well as the magnitude of the token-level output of the original model (Figure 3).
Model . | F0.5 . | L2 distance max constraint (Class -1, Class 1) . | Output min threshold (Class -1, Class 1) . | Admitted n . | n/N . |
---|---|---|---|---|---|
uniCNN+BERT+S*+K8NNDIST. | 42.5 | 92,597 | 1.0 | ||
uniCNN+BERT+S*+K8NNDIST. | 62.5 | (−1.6, 1.3) | 53,396 | 0.58 | |
uniCNN+BERT+S*+K8NNDIST. | 67.5 | (25.3, 38.9) | 7,896 | 0.09 | |
uniCNN+BERT+S*+K8NNDIST. | 86.9 | (25.3, 38.9) | (−1.6, 1.3) | 4,219 | 0.05 |
uniCNN+BERT+K8NNDIST. | 26.3 | 92,597 | 1.0 | ||
uniCNN+BERT+K8NNDIST. | 46.5 | (−0.8, 0.7) | 40,691 | 0.44 | |
uniCNN+BERT+K8NNDIST. | 42.6 | (31.0, 47.6) | 8,779 | 0.09 | |
uniCNN+BERT+K8NNDIST. | 67.4 | (31.0, 47.6) | (−0.8, 0.7) | 4,388 | 0.05 |
uniCNN+BERT+mm+K8NNDIST. | 27.0 | 92,597 | 1.0 | ||
uniCNN+BERT+mm+K8NNDIST. | 45.9 | (−1.2, 0.8) | 38,110 | 0.41 | |
uniCNN+BERT+mm+K8NNDIST. | 53.5 | (34.2, 53.3) | 7,879 | 0.09 | |
uniCNN+BERT+mm+K8NNDIST. | 75.8 | (34.2, 53.3) | (−1.2, 0.8) | 4,180 | 0.05 |
Model . | F0.5 . | L2 distance max constraint (Class -1, Class 1) . | Output min threshold (Class -1, Class 1) . | Admitted n . | n/N . |
---|---|---|---|---|---|
uniCNN+BERT+S*+K8NNDIST. | 42.5 | 92,597 | 1.0 | ||
uniCNN+BERT+S*+K8NNDIST. | 62.5 | (−1.6, 1.3) | 53,396 | 0.58 | |
uniCNN+BERT+S*+K8NNDIST. | 67.5 | (25.3, 38.9) | 7,896 | 0.09 | |
uniCNN+BERT+S*+K8NNDIST. | 86.9 | (25.3, 38.9) | (−1.6, 1.3) | 4,219 | 0.05 |
uniCNN+BERT+K8NNDIST. | 26.3 | 92,597 | 1.0 | ||
uniCNN+BERT+K8NNDIST. | 46.5 | (−0.8, 0.7) | 40,691 | 0.44 | |
uniCNN+BERT+K8NNDIST. | 42.6 | (31.0, 47.6) | 8,779 | 0.09 | |
uniCNN+BERT+K8NNDIST. | 67.4 | (31.0, 47.6) | (−0.8, 0.7) | 4,388 | 0.05 |
uniCNN+BERT+mm+K8NNDIST. | 27.0 | 92,597 | 1.0 | ||
uniCNN+BERT+mm+K8NNDIST. | 45.9 | (−1.2, 0.8) | 38,110 | 0.41 | |
uniCNN+BERT+mm+K8NNDIST. | 53.5 | (34.2, 53.3) | 7,879 | 0.09 | |
uniCNN+BERT+mm+K8NNDIST. | 75.8 | (34.2, 53.3) | (−1.2, 0.8) | 4,180 | 0.05 |
4.3 Grammatical Error Detection: Discussion
The baseline expectations for zero-shot grammatical error detection models are low given the difficulty of the supervised case. It is therefore relatively surprising that a model such as uniCNN+BERT, when given only sentence-level labels, can yield a reasonably decent sequence model that is in the ballpark of some recent—even if lower parameter—fully supervised models. The inductive bias of the proposed method over a strong deep network is effective for such class-conditional detection, as well as supervised labeling. The approach additionally enables dense representation matching against a support set with known labels, with both inference-time decision rules and particular K-NN approximations. In this way, we gain the ability to make updates to a model without re-training; to constrain predictions based on interpretable heuristics; and more generally, to recast the otherwise black-box predictions of the network as an explicit weighting over instances with known labels.
5. Sentiment Data: Binary Prediction of Polarity
We further analyze the behavior of updating the support set over domain-shifted data for the task of predicting sentiment features in IMDb movie reviews. We consider recent work that re-annotates document-level classification data with minimal, local revisions that change the class labels (Kaushik, Hovy, and Lipton 2020; Gardner et al. 2020), from which we back-out token-level labels for evaluation. We use this existing data-oriented approach for robust classification for controlled tests of the internal validity of our approach. We observe an ability to adapt the models via matching as with the grammar experiments. Additionally, in this context, we find that robust prediction over new, unseen domains remains challenging, but simple token-level heuristics tied to the K-NN approximation are nonetheless at least reasonably effective at constraining predictions to reliable subsets, and for screening data unlike that seen in training. This provides further justification for methods, such as proposed here, with which we can analyze and curate the data under the current generation of deep networks.
5.1 Sentiment Data: Experiments
We consider the task of predicting binary document-level sentiment in IMDb movie reviews. We analyze detection of sentiment features at the token level, treating it as a zero-shot sequence labeling task, and additionally provide document-level classification results when constraining the predictions based on the token-level heuristics.
5.1.1 Data: IMDb Sentiment (Negative vs. Positive) with Local Re-edits.
We use the IMDb data of Kaushik, Hovy, and Lipton (2020).10 This consists of movie reviews with negative sentiment (Y = −1) and positive sentiment (Y = 1), including reviews from the original review site (original, or Orig.) and “counterfactually augmented” revisions (Rev.), the latter of which were created by crowd-workers who annotated the original reviews with local, minimal changes that change the document-level label. For document/review-level sentiment, we follow the main splits of the original work and train on a sample of 3.4k original reviews, Orig. (3.4k), and the original reviews combined with their corresponding revisions, Orig.+Rev. (1.7k+1.7k). For experiments modifying the support set, we will also consider each of these halves separately, Orig. (1.7k) and Rev. (1.7k). For reference, we additionally train with the full set of original reviews, Orig. (19k), and the full set combined with the revisions, >Orig.+Rev. (19k+1.7k). For evaluation, we consider the >Orig. and Rev. test sets from previous work.
To control for the language distribution of the revisions, we also create a new set of disjoint source-target pairs for training by removing the corresponding original reviews and leaving the revisions. We then add in disjoint samples from the remaining full set of original reviews to fill out the remaining sample size. For the smaller set this results in a set of 3.4k reviews, Orig.DISJOINT+Rev. (1.7k+1.7k), the same size as the comparable parallel set. For the larger set, we simply remove any original reviews that match the original reviews paired with revised reviews, creating Orig.DISJOINT+Rev. (19k-1.7k+1.7k).
Sentiment Diffs for Token-Level Detection.
We use the parallel original and revision data to create token-level feature labels. Treating positive reviews as the source, we deterministically generate source-target transduction diffs in the same manner as Schmaltz et al. (2017). We then assign the positive class (yn = 1) to tokens associated with diffs that transduce to documents for which Y = 1, assigning all other tokens to the negative class (yn = −1). We use a similar convention as the FCE data set in Section 4 with respect to insertions, deletions, and replacements. Table C1 provides an example.
5.1.2 Data: IMDb Sentiment (Negative vs. Positive) with Contrast Sets.
We additionally evaluate on the IMDb reviews of Gardner et al. (2020),11 which are revised with local re-edits by professional researchers familiar with the task instead of by crowd-sourced workers. This test set (Contrast) corresponds to the same set of reviews in the test set of Kaushik, Hovy, and Lipton (2020). We do not have a corresponding training set, nor do we use the corresponding dev set for tuning, so we consider all evaluation on this set to be a domain-shifted setting.
5.1.3 Data: Out-of-domain Twitter Document-Level Sentiment (Negative vs. Positive).
Finally, we also evaluate on the test set of SemEval-2017 Task 4a (Rosenthal, Farra, and Nakov 2017).12 This consists of Twitter messages, which are significantly different from the IMDb movie reviews in terms of the topics covered, the language distribution, and the length of the documents, so we consider this to be an out-of-domain setting. We follow the previous work of Kaushik, Hovy, and Lipton (2020) in evaluating the binary classification results with accuracy. We balance the test set, using equal numbers of negative and positive Tweets, and drop the third class (neutral) for consistency with the earlier work, resulting in 4,750 Twitter messages for evaluation.
5.1.4 Models
Our core model is the uniCNN+BERT model from Section 4, with which we vary the training set and the data in the support set. The only differences from uniCNN+BERT in the grammar detection experiments is that we set the maximum length, by WordPiece, to 350 as in previous works, and we choose the training epoch (up to a max of 60 epochs) by the highest accuracy on the dev set.
We evaluate token-level predictions of sentiment diffs using the F0.5 metric, as with grammatical error detection above. We vary whether the support set includes data from the Orig. and/or Rev. training sets, using the labels +ExAG and +ExAT from Section 3.6 to identify the particular rules used. We also present results where we allow the models a small amount of data to tune the decision boundary for the token-level predictions. For consistency, we always use the dev set of the Orig. reviews subset, using the subscript +ORIG_DEV to indicate that the models have access to 245 sentences with token-level labels. This provides a point of comparison to the exemplar auditing decision rules.
K-NN.
We train the distance-weighted K-NN approximation on the held-out KNN dev set to minimize δKNN as in Section 4, but for up to 60 epochs, uniCNN+BERT+ K8NNDIST.. The original model is trained on the Orig. (3.4k) data. For comparisons with experiments with the inference-time decision rules, the K-NN is trained with Orig. (1.7k) as the support set, using half of the +ORIG_DEV for setting the K-NN parameters and the other half as the held-out KNN dev. This is a relatively limited amount of data, but it is sufficient for training the 3 parameters of the K-NN to at least match the accuracy of the original model.
K-NN Token-Level Constraints for Document-Level Classification.
The K-NN enables interpretable heuristics for constraining predictions to the most reliable subsets of the data. In Section 4, we demonstrated this for token-level detection; here, we show how this idea can be applied toward document-level classification, as well. As with detection in Table 6, token-level predictions are constrained by a maximum allowed distance to the nearest match in the support set and K-NN output magnitude limits derived from correct approximations on the KNN dev set, determined without access to token-level labels. For both distances and magnitudes, we use the mean for each class among correct approximations. Using the full +ORIG_DEV set we then set limits on the proportion and/or range of admitted tokens per document required to admit the overall document-level classification from the original uniCNN+BERT model. To emulate a high-risk setting, we set the minimum threshold such that all admitted document-level predictions are correct on the dev set. We also optionally further require the total number of tokens admitted to be within ± 1 standard deviation from the mean of correct predictions to control for unexpected lengths.13
Previous Approaches.
Our primary focus in this section is holding the model architecture from Section 4 constant while changing the data subsets. For reference, we include the results of Kaushik, Hovy, and Lipton (2020), which fine-tunes the BERTBASE uncased model with the standard final linear layer for classification, BERTBASEUNCASED+FT. For comparison, we then also train a model using this same Transformer as frozen input with uncased GloVe embeddings, uniCNN+BERTBASEUNCASED, and also an analogous cased model with Word2Vec embeddings, uniCNN+BERTBASE.
5.2 Sentiment Data: Results
Document-Level Classification.
For context, Table 7 shows the document-level accuracy of uniCNN+BERT when varying the training data, tested on the original (Orig.) and revised (Rev.) test sets. Training with Orig. vs. Orig.+Rev. reflects the same patterns seen in the experiments of Kaushik, Hovy, and Lipton (2020); however, if we control for the language of the revised reviews by training with disjoint source-target pairs (Orig.DISJOINT+Rev.), the difference across test sets is more modest. For reference, we find that uniCNN+BERT is at least as effective as fine-tuning all parameters of the BERTBASE model, with the uniCNN+BERTBASEUNCASED variant within 2–3 points (Table C2).
Model Train. Data (Num. Reviews) . | Review-level Sentiment (Accuracy) . | |
---|---|---|
Orig. . | Rev. . | |
Random | 50.2 | 49.8 |
Orig. (3.4k) | 92.8 | 88.7 |
Orig.+Rev. (1.7k+1.7k) | 90.6 | 96.5 |
Orig.DISJOINT+Rev. (1.7k+1.7k) | 89.5 | 95.7 |
Orig. (19k) | 93.0 | 87.9 |
Orig.+Rev. (19k+1.7k) | 93.0 | 94.3 |
Orig.DISJOINT+Rev. (19k-1.7k+1.7k) | 93.0 | 90.2 |
Model Train. Data (Num. Reviews) . | Review-level Sentiment (Accuracy) . | |
---|---|---|
Orig. . | Rev. . | |
Random | 50.2 | 49.8 |
Orig. (3.4k) | 92.8 | 88.7 |
Orig.+Rev. (1.7k+1.7k) | 90.6 | 96.5 |
Orig.DISJOINT+Rev. (1.7k+1.7k) | 89.5 | 95.7 |
Orig. (19k) | 93.0 | 87.9 |
Orig.+Rev. (19k+1.7k) | 93.0 | 94.3 |
Orig.DISJOINT+Rev. (19k-1.7k+1.7k) | 93.0 | 90.2 |
Document-Level Classification with Token-Level Constraints.
Table 8 shows review-level test accuracy with uniCNN+BERT trained on the Orig. (3.4k) data using uniCNN+BERT+K8NNDIST. to determine constraints. Token-level predictions are constrained by a maximum allowed distance to the nearest match in the support set and K-NN output magnitude limits derived from correct approximations on the K-NN dev set, determined without access to token-level labels (as in Table 6). The document-level predictions are then constrained by a minimum threshold (≈ 10%) on the proportion of admitted tokens among all tokens in the document and optionally, an additional constraint on the allowed range of admitted tokens (between 5 and 15, which is ± 1 standard deviation from the mean), both determined from sentence-level labels on the dev set.
Test Data . | Review-level Sentiment (Accuracy) . | Admitted token % min . | Admitted token min, max . | Admitted n . | n/N . |
---|---|---|---|---|---|
Orig. | 92.8 | 488 | 1.0 | ||
Orig. | 96.2 | • | 78 | 0.16 | |
Orig. | 93.0 | • | • | 43 | 0.09 |
Rev. | 88.7 | 488 | 1.0 | ||
Rev. | 98.1 | • | 52 | 0.11 | |
Rev. | 97.0 | • | • | 33 | 0.07 |
SemEval-2017 | 77.8 | 4,750 | 1.0 | ||
SemEval-2017 | 81.4 | • | 576 | 0.12 | |
SemEval-2017 | 100.0 | • | • | 1 | 0.0002 |
Test Data . | Review-level Sentiment (Accuracy) . | Admitted token % min . | Admitted token min, max . | Admitted n . | n/N . |
---|---|---|---|---|---|
Orig. | 92.8 | 488 | 1.0 | ||
Orig. | 96.2 | • | 78 | 0.16 | |
Orig. | 93.0 | • | • | 43 | 0.09 |
Rev. | 88.7 | 488 | 1.0 | ||
Rev. | 98.1 | • | 52 | 0.11 | |
Rev. | 97.0 | • | • | 33 | 0.07 |
SemEval-2017 | 77.8 | 4,750 | 1.0 | ||
SemEval-2017 | 81.4 | • | 576 | 0.12 | |
SemEval-2017 | 100.0 | • | • | 1 | 0.0002 |
These simple, understandable constraints derived from the token-level predictions are effective at restricting the model to the most reliable document-level predictions, including on dramatically different out-of-domain input (SemEval-2017). For the constraints with the original (Orig.) and revised (Rev.) test sets, the same 3 and 1 reviews, respectively, are missed with both constraint variants, which accounts for the nominally lower accuracy as a result of a smaller denominator, and notably, 1 review in each of these sets is incorrectly or ambiguously annotated in the ground-truth data. On average, only around 1 token is admitted per Tweet in the SemEval-2017 data with the distance and magnitude constraints, so the hard token count constraints readily filter most such data for document-level predictions, which is desirable given the mis-match with the training data. In contrast, the orthogonal approach of seeking more robust predictions by including source-target pairs was not consistently beneficial, as shown in Table C3.
Token-Level Feature Detection.
The token-level feature detections follow a similar pattern with regard to the training data sets as the document-level predictions, with gains observed with the locally re-edited data, and to a lesser extent, the disjoint sets, as shown in Table 9 and the true zero-shot setting shown in the and black rows of Table 10. The predictions from the K-NN are at least as effective as the original model. As with the error detection experiments, the inference-time decision rules can be used to make updates to the model without retraining (Table 10), which in some cases, results in F0.5 scores approaching that of training on that same data.
Model Train. Data (Num. Reviews) . | Token-level Sentiment Diffs (F0.5) . | |
---|---|---|
Orig. . | Rev. . | |
Random | 6.0 | 7.6 |
Orig. (3.4k)+ORIG_DEV (K8NNDIST.) | 29.5 | 23.5 |
Orig. (3.4k)+ORIG_DEV | 26.2 | 22.5 |
Orig.+Rev. (1.7k+1.7k)+ORIG_DEV | 32.4 | 33.1 |
Orig.DISJOINT+Rev. (1.7k+1.7k)+ORIG_DEV | 32.4 | 31.5 |
Orig. (19k)+ORIG_DEV | 24.8 | 21.7 |
Orig.+Rev. (19k+1.7k)+ORIG_DEV | 28.8 | 27.9 |
Orig.DISJOINT+Rev. (19k-1.7k+1.7k)+ORIG_DEV | 28.2 | 26.8 |
Model Train. Data (Num. Reviews) . | Token-level Sentiment Diffs (F0.5) . | |
---|---|---|
Orig. . | Rev. . | |
Random | 6.0 | 7.6 |
Orig. (3.4k)+ORIG_DEV (K8NNDIST.) | 29.5 | 23.5 |
Orig. (3.4k)+ORIG_DEV | 26.2 | 22.5 |
Orig.+Rev. (1.7k+1.7k)+ORIG_DEV | 32.4 | 33.1 |
Orig.DISJOINT+Rev. (1.7k+1.7k)+ORIG_DEV | 32.4 | 31.5 |
Orig. (19k)+ORIG_DEV | 24.8 | 21.7 |
Orig.+Rev. (19k+1.7k)+ORIG_DEV | 28.8 | 27.9 |
Orig.DISJOINT+Rev. (19k-1.7k+1.7k)+ORIG_DEV | 28.2 | 26.8 |
The observed patterns are analogous on the professionally annotated Contrast test set, as shown in Tables 11 and 12. A relatively modest amount of labeled data in the support set is sufficient for improving effectiveness in detecting the token-level sentiment features as seen in the rightmost column of Table 12.
Contrast Sets . | ||
---|---|---|
Model Train. Data (Num. Reviews) . | Review-level Sentiment (Accuracy) Contrast . | Token-level Sentiment Diffs (F0.5) Contrast . |
Random | 49.8 | 8.4 |
Orig. (3.4k) | 82.4 | 17.1 (+ORIG_DEV) |
Orig.+Rev. (1.7k+1.7k) | 93.0 | 28.4 (+ORIG_DEV) |
Orig.DISJOINT+Rev. (1.7k+1.7k) | 91.2 | 26.9 (+ORIG_DEV) |
Orig. (19k) | 81.4 | 18.0 (+ORIG_DEV) |
Orig.+Rev. (19k+1.7k) | 90.0 | 23.5 (+ORIG_DEV) |
Orig.DISJOINT+Rev. (19k-1.7k+1.7k) | 88.1 | 23.4 (+ORIG_DEV) |
Contrast Sets . | ||
---|---|---|
Model Train. Data (Num. Reviews) . | Review-level Sentiment (Accuracy) Contrast . | Token-level Sentiment Diffs (F0.5) Contrast . |
Random | 49.8 | 8.4 |
Orig. (3.4k) | 82.4 | 17.1 (+ORIG_DEV) |
Orig.+Rev. (1.7k+1.7k) | 93.0 | 28.4 (+ORIG_DEV) |
Orig.DISJOINT+Rev. (1.7k+1.7k) | 91.2 | 26.9 (+ORIG_DEV) |
Orig. (19k) | 81.4 | 18.0 (+ORIG_DEV) |
Orig.+Rev. (19k+1.7k) | 90.0 | 23.5 (+ORIG_DEV) |
Orig.DISJOINT+Rev. (19k-1.7k+1.7k) | 88.1 | 23.4 (+ORIG_DEV) |
5.3 Sentiment Classification and Feature Detection: Discussion
As with error detection, on the sentiment data sets we demonstrate that we can leverage dense representation matching to update a model and to improve token-level feature detection. Remarkably, with a strong neural model and an inductive bias conducive to matching, we can start to close the distance with models trained with domain-shifted data by just updating the support set, which points to new flexibility in adapting models. However, this still requires a least some data from the distribution of the new domain to be available. When we carefully control for data distributions, robust prediction over data from unseen domain-shifted and out-of-domain distributions remains challenging, ceteris paribus, even with recently proposed data perturbation approaches, which is consistent with the broad patterns observed in the contemporaneous works of Taori et al. (2020) and Gulrajani and Lopez-Paz (2021) for image data. This is a point of concern for higher-risk settings, as some amount of domain shift or subpopulation shift will invariably occur in many real-world settings.
Faced with these challenges, we can instead constrain document-level predictions based on an interpretable token-level K-NN derived from the deep model. This combination of feature-level detection derived from document-level labels, dense matching, and heuristics that can be traced back to individual token-level predictions across the support set offers an alternative, practical approach for deploying deep models in higher-risk settings, in which we refrain from predicting over domain-shifted data and out-of-domain data over which reliable predictions and bounds remain elusive. In this way, we can refrain from predicting when necessary and then re-label, update, and as needed, re-train models in a continual loop based on these methods. For instructive purposes, we contrast such a framework with local re-edits in Figure 4.
6. Sentiment Data: Binary Prediction of Local Annotation Edits
In Section 5.1, we found locally re-edited data to be useful in analyzing and evaluating feature detection for a classification task typically only labeled at the document-level. In this section, we use the same data sets to demonstrate that our proposed methods can be used to uncover subtle distributional differences across annotations, which can be used, for example, for filtering and performing quality control on data sets for training and evaluation.
6.1 Binary Prediction of Local Annotation Edits: Experiments
Kaushik, Hovy, and Lipton (2020) report that the BERTBASEUNCASED+FT model is able to distinguish original vs. revised reviews (hereafter, “annotator domain”) with an accuracy of about 77 percent. We investigate this further, illustrating how the proposed approach for token-level detections can be used for fine-grained text analysis.
6.1.1 Data: Predicting Annotator Domain (Original vs. Revised).
We assign Y = −1 to the original reviews and Y = 1 to the revised reviews. We report results at the review level on varying subsets of the data, including splits by sentiment. We refer to the subset of original and revised reviews restricted to reviews with negative sentiment with the label (Orig.+Rev.) ∧ Neg., and similarly for other subsets. We derive token-level labels analogously to those created for sentiment in Section 5.1, except the diffs here represent the transduction from revised reviews (source) to original reviews (target). Applicable tokens in revised reviews receive a class 1 (yn = 1) label, whereas tokens in original reviews are all assigned a yn = −1 label. We similarly analyze the professionally annotated Contrast test set of Gardner et al. (2020), predicting the original reviews vs. the professionally annotated alternatives.
6.1.2 Models.
We train the uniCNN+BERT model on the 3414 parallel original and counterfactually augmented revised reviews, using the 490 paired reviews of the dev set to choose the epoch with highest accuracy.
6.2 Binary Prediction of Local Annotation Edits: Results
Predicting Annotator Domain (Original vs. Revised).
With the uniCNN+BERT model, original reviews are distinguishable from counterfactually revised reviews with an accuracy of around 80%, as shown in Table 13. The revised reviews are slightly easier to distinguish in general (accuracy of 80.5 vs. 78.7). The negative reviews are particularly distinct in relative terms, with the accuracy nearly 9 points higher on the negative reviews in the combined set, with an accuracy of 84.0 vs. 75.2 for the positive reviews.
Test (Sub-)Set . | Review-level (Not Sentiment) . | |
---|---|---|
Accuracy . | Num. Reviews . | |
Orig.+Rev. | 79.6 | 976 |
Orig. | 78.7 | 488 |
Rev. | 80.5 | 488 |
(Orig.+Rev.) ∧ Neg. | 84.0 | 488 |
(Orig.+Rev.) ∧ Pos. | 75.2 | 488 |
Orig. ∧ Neg. | 84.0 | 243 |
Orig. ∧ Pos. | 73.5 | 245 |
Rev. ∧ Neg.> | 84.1 | 245 |
Rev. ∧ Pos. | red77.0 | 243 |
Test (Sub-)Set . | Review-level (Not Sentiment) . | |
---|---|---|
Accuracy . | Num. Reviews . | |
Orig.+Rev. | 79.6 | 976 |
Orig. | 78.7 | 488 |
Rev. | 80.5 | 488 |
(Orig.+Rev.) ∧ Neg. | 84.0 | 488 |
(Orig.+Rev.) ∧ Pos. | 75.2 | 488 |
Orig. ∧ Neg. | 84.0 | 243 |
Orig. ∧ Pos. | 73.5 | 245 |
Rev. ∧ Neg.> | 84.1 | 245 |
Rev. ∧ Pos. | red77.0 | 243 |
We further examine the particularly distinctive language used in the negative reviews using the aggregate feature extraction of Section 3.5. We split the dev set14 according to the true document-level labels. Table 14 presents the top and lowest scoring negative class (i.e., original reviews) unigrams and positive class (i.e., revised reviews) unigrams, by total score (totaln-gram− and totaln-gram+) for the dev set reviews for each class,15 as well as the corresponding unigram frequency. We see a sharp distinction between the words most discriminative for each class. Certain unigrams, such as not and bad, occur with similar frequency in the original and revised reviews, but have diametrically opposed weightings for the respective classes. It seems that words that tend to be sentiment-laden, especially those that are of negative sentiment, are particularly discriminative features for distinguishing revised reviews. In Table 15, we show the 5-grams normalized by occurrence.16 The most discriminating phrases across classes are distinct, with the contextual use of words such as “bad,” “not,” and “waste” recognized by the model as being distinctive of original vs. revised reviews.
Review-level (Not Sentiment) . | |||||
---|---|---|---|---|---|
Orig. . | Rev. . | ||||
unigram . | totaln-gram− score . | Total Frequency . | unigram . | totaln-gram+ score . | Total Frequency . |
but | 41.5 | 249 | not | 61.4 | 229 |
waste | 18.5 | 22 | terrible | 54.3 | 20 |
any | 11.9 | 56 | least | 44.1 | 26 |
just | 11.0 | 112 | bad | 43.1 | 61 |
still | 8.6 | 40 | worst | 32.6 | 22 |
that | 7.7 | 394 | poor | 31.9 | 21 |
only | 7.6 | 70 | awful | 22.4 | 13 |
But | 7.6 | 42 | dislike | 20.2 | 9 |
moving | 5.8 | 7 | great | 18.5 | 69 |
completely | 5.3 | 18 | boring | 18.1 | 25 |
… SKIPPED … | |||||
hated | −1.2 | 3 | missed | −1.8 | 4 |
excited | −1.3 | 3 | without | −1.8 | 21 |
horrible | −1.4 | 5 | just | −2.1 | 97 |
worst | −1.4 | 19 | lacks | −2.2 | 3 |
usual | −1.6 | 4 | lost | −2.3 | 9 |
disliked | −1.6 | 1 | I | −2.6 | 561 |
worse | −1.9 | 8 | that | −3.1 | 395 |
hate | −1.9 | 3 | any | −4.3 | 40 |
bad | −5.5 | 64 | waste | −4.5 | 9 |
not | −7.6 | 217 | but | −10.0 | 203 |
Review-level (Not Sentiment) . | |||||
---|---|---|---|---|---|
Orig. . | Rev. . | ||||
unigram . | totaln-gram− score . | Total Frequency . | unigram . | totaln-gram+ score . | Total Frequency . |
but | 41.5 | 249 | not | 61.4 | 229 |
waste | 18.5 | 22 | terrible | 54.3 | 20 |
any | 11.9 | 56 | least | 44.1 | 26 |
just | 11.0 | 112 | bad | 43.1 | 61 |
still | 8.6 | 40 | worst | 32.6 | 22 |
that | 7.7 | 394 | poor | 31.9 | 21 |
only | 7.6 | 70 | awful | 22.4 | 13 |
But | 7.6 | 42 | dislike | 20.2 | 9 |
moving | 5.8 | 7 | great | 18.5 | 69 |
completely | 5.3 | 18 | boring | 18.1 | 25 |
… SKIPPED … | |||||
hated | −1.2 | 3 | missed | −1.8 | 4 |
excited | −1.3 | 3 | without | −1.8 | 21 |
horrible | −1.4 | 5 | just | −2.1 | 97 |
worst | −1.4 | 19 | lacks | −2.2 | 3 |
usual | −1.6 | 4 | lost | −2.3 | 9 |
disliked | −1.6 | 1 | I | −2.6 | 561 |
worse | −1.9 | 8 | that | −3.1 | 395 |
hate | −1.9 | 3 | any | −4.3 | 40 |
bad | −5.5 | 64 | waste | −4.5 | 9 |
not | −7.6 | 217 | but | −10.0 | 203 |
Review-level (Not Sentiment) . | |||||||
---|---|---|---|---|---|---|---|
Orig. . | Rev. . | ||||||
5-gram . | meann-gram− score . | 5-gram . | meann-gram+ score . | ||||
little bit, but it still | 3.9 | his awful performance did not | 11.3 | ||||
bit, but it still managed | 3.7 | dominated this film, his awful | 10.4 | ||||
movie, but many elements ruined | 3.4 | Come is indeed a terrible | 10.3 | ||||
killer down. A serious waste | 3.4 | a terrible work of speculative | 10.3 | ||||
this slow paced, boring waste | 3.4 | This was a very bad | 10.0 | ||||
movie is just a waste | 3.1 | /><br />A terrible look at | 8.8 | ||||
waste of time. The most | 2.9 | dream home. <br /><br />A terrible | 8.8 | ||||
to be nice people, but | 2.8 | movie is not a lot | 8.2 | ||||
nice people, but can’t carry | 2.8 | This movie is not a | 8.2 | ||||
people, but can’t carry a | 2.8 | remains one of my least | 7.9 | ||||
…SKIPPED … | |||||||
film. The usual superb acting | −1.6 | either been reduced to stereo | −1.7 | ||||
disliked it and looking at | −1.6 | around have either been reduced | −1.7 | ||||
the reasons that I disliked | −1.6 | would simply be a waste | −1.7 | ||||
film or an even worse | −2.0 | don’t waste your time and | −1.7 | ||||
this is such a bad | −2.5 | about lovey-dovey romance, don’t waste | −1.9 |
Review-level (Not Sentiment) . | |||||||
---|---|---|---|---|---|---|---|
Orig. . | Rev. . | ||||||
5-gram . | meann-gram− score . | 5-gram . | meann-gram+ score . | ||||
little bit, but it still | 3.9 | his awful performance did not | 11.3 | ||||
bit, but it still managed | 3.7 | dominated this film, his awful | 10.4 | ||||
movie, but many elements ruined | 3.4 | Come is indeed a terrible | 10.3 | ||||
killer down. A serious waste | 3.4 | a terrible work of speculative | 10.3 | ||||
this slow paced, boring waste | 3.4 | This was a very bad | 10.0 | ||||
movie is just a waste | 3.1 | /><br />A terrible look at | 8.8 | ||||
waste of time. The most | 2.9 | dream home. <br /><br />A terrible | 8.8 | ||||
to be nice people, but | 2.8 | movie is not a lot | 8.2 | ||||
nice people, but can’t carry | 2.8 | This movie is not a | 8.2 | ||||
people, but can’t carry a | 2.8 | remains one of my least | 7.9 | ||||
…SKIPPED … | |||||||
film. The usual superb acting | −1.6 | either been reduced to stereo | −1.7 | ||||
disliked it and looking at | −1.6 | around have either been reduced | −1.7 | ||||
the reasons that I disliked | −1.6 | would simply be a waste | −1.7 | ||||
film or an even worse | −2.0 | don’t waste your time and | −1.7 | ||||
this is such a bad | −2.5 | about lovey-dovey romance, don’t waste | −1.9 |
In Table 16 we display the top two revised reviews, ranked by n-, normalized by length. We have further highlighted both the ground-truth token-level domain diffs and the zero-shot sequence labeling predictions by the model (i.e., > 0). The token-level domain diff predictions typically are subsets of the true diffs, with a focus on particularly sentiment-laden words, along the lines of what was shown in Tables 14 and 15. More generally, rather remarkably, the zero-shot sequence labeling is sufficiently effective that the approach can be used as a tool for quickly scanning through a data set for distinctive words and phrases conditional on the document-level label, as demonstrated with additional examples in Table D1. Interestingly, just reading the documents in isolation, it is not always obvious that many of the detected diffs are from revisions, yet the model is nonetheless often able to detect such subtle distributional differences.
Review-level (Not Sentiment) . | |
---|---|
Dev. Set Document 244/245 | |
Original | This is actually one of my favorite films, I would recommend that EVERYONE watches it. There is some great acting in it and it shows that not all ‘‘good’’ films are American.... |
True (Rev.) | This is actually one of my least favorite films, I would not recommend that ANYONE watches it. There is some bad acting in it and it shows that all ‘‘bad’’ films are American.... |
uniCNN+BERT (Rev.) Len. Norm. Score: 0.164 | This is actually one of my favorite films, I would recommend that ANYONE watches it. There is some bad acting in it and it shows that all ‘‘bad’’ films are American.... |
Dev. Set Document 266/267 | |
Original | One of the great classic comedies. Not a slapstick comedy, not a heavy drama. A fun, satirical film, a buyers beware guide to a new home. /> />Filled with great characters all of whom, Cary Grant is convinced, are out to fleece him in the building of a dream home. /> />A great look at life in the late 40’s. /> /> |
True (Rev.) | One of the bad classic comedies. Not a slapstick comedy, not a heavy drama. A boring, unfunny film, a buyers beware guide to a new home. /> />Filled with terrible characters all of whom, Cary Grant is falsely convinced, are out to fleece him in the building of a dream home. /> />A terrible look at life in the late 40’s. /> /> |
uniCNN+BERT (Rev.) Len. Norm. Score: 0.133 | One of the bad classic comedies. Not a slapstick comedy, not a heavy drama. A unfunny film, a buyers beware guide to a new home. /> />Filled with terrible characters all of whom, Cary Grant is falsely convinced, are out to fleece him in the building of a dream home. /> />A look at life in the late 40’s. /> /> |
Review-level (Not Sentiment) . | |
---|---|
Dev. Set Document 244/245 | |
Original | This is actually one of my favorite films, I would recommend that EVERYONE watches it. There is some great acting in it and it shows that not all ‘‘good’’ films are American.... |
True (Rev.) | This is actually one of my least favorite films, I would not recommend that ANYONE watches it. There is some bad acting in it and it shows that all ‘‘bad’’ films are American.... |
uniCNN+BERT (Rev.) Len. Norm. Score: 0.164 | This is actually one of my favorite films, I would recommend that ANYONE watches it. There is some bad acting in it and it shows that all ‘‘bad’’ films are American.... |
Dev. Set Document 266/267 | |
Original | One of the great classic comedies. Not a slapstick comedy, not a heavy drama. A fun, satirical film, a buyers beware guide to a new home. /> />Filled with great characters all of whom, Cary Grant is convinced, are out to fleece him in the building of a dream home. /> />A great look at life in the late 40’s. /> /> |
True (Rev.) | One of the bad classic comedies. Not a slapstick comedy, not a heavy drama. A boring, unfunny film, a buyers beware guide to a new home. /> />Filled with terrible characters all of whom, Cary Grant is falsely convinced, are out to fleece him in the building of a dream home. /> />A terrible look at life in the late 40’s. /> /> |
uniCNN+BERT (Rev.) Len. Norm. Score: 0.133 | One of the bad classic comedies. Not a slapstick comedy, not a heavy drama. A unfunny film, a buyers beware guide to a new home. /> />Filled with terrible characters all of whom, Cary Grant is falsely convinced, are out to fleece him in the building of a dream home. /> />A look at life in the late 40’s. /> /> |
Predicting Annotator Domain (Original vs. Professional Revisions).
The model is nearly as effective at distinguishing the professionally annotated reviews as the crowd-sourced revised reviews, with the overall accuracy only a couple of points lower, as shown in Table 17, even though the model only sees crowd-sourced revisions in training and development. The negative reviews are again easier to distinguish overall, but in this case we see that this is driven by accuracy on the original reviews, which are more readily distinguished. This might be attributable to the effects of the domain shift, with the original reviews being seen in training, while the professionally annotated counterparts are not. As with the counterfactually augmented edits, without such model assistance, it is often not obvious that a review has been revised, especially given the otherwise informal language of movie reviews. However, the class-conditional feature detection is strong enough that the token-level predictions can be visualized and some of the discriminative words and phrases participating in the diffs identified, as shown in Table D2.
Contrast Sets . | ||
---|---|---|
Test (Sub-)Set . | Review-level (Not Sentiment) . | |
Accuracy . | Num. Reviews . | |
Orig.+Contrast | 77.8 | 976 |
Orig. | 78.7 | 488 |
Contrast | 76.8 | 488 |
(Orig.+Contrast) ∧ Neg. | 78.7 | 488 |
(Orig.+Contrast) ∧ Pos. | 76.8 | 488 |
Orig. ∧ Neg. | 84.0 | 243 |
Orig. ∧ Pos. | 73.5 | 245 |
Contrast ∧ Neg. | 73.5 | 245 |
Contrast ∧ Pos. | 80.2 | 243 |
Contrast Sets . | ||
---|---|---|
Test (Sub-)Set . | Review-level (Not Sentiment) . | |
Accuracy . | Num. Reviews . | |
Orig.+Contrast | 77.8 | 976 |
Orig. | 78.7 | 488 |
Contrast | 76.8 | 488 |
(Orig.+Contrast) ∧ Neg. | 78.7 | 488 |
(Orig.+Contrast) ∧ Pos. | 76.8 | 488 |
Orig. ∧ Neg. | 84.0 | 243 |
Orig. ∧ Pos. | 73.5 | 245 |
Contrast ∧ Neg. | 73.5 | 245 |
Contrast ∧ Pos. | 80.2 | 243 |
6.3 Prediction of Local Annotation Edits: Discussion
With effective zero-shot sequence labeling, we gain a straightforward means of aggregating features from a deep network when only given document-level labels. As we have shown, this can be used to analyze text data sets, detecting rather subtle distributional differences that are not readily perceptible without such model assistance, at least at scale. Deep networks are typically viewed as strong predictors at the unit of analysis of the training set’s labels; with the mechanism proposed here, we gain a means of leveraging that discriminative ability at lower resolutions to analyze the input data.
7. Discussion
This new facility for dense representation matching at resolutions of the input more fine-grained than available labels is a substantive departure from existing approaches in computational linguistics, providing new flexibility for locally updating a model and analyzing data sets under the model. It draws a connection between attention-style mechanisms and the older distance metric learning literature (Weinberger and Saul 2009, inter alia), relying on the inductive bias of the CNN to learn summarized representations of the expressive deep network for subsequent matching via simple Euclidean distances. Fortunately, from an efficient compute perspective, this works well when training with standard cross-entropy losses against the available labels without resorting to expensive supervised contrastive losses searching through representations during initial training. When a stronger sense of interpretability is needed, we can then subsequently train an effective K-NN approximation with just 3 learnable parameters from the frozen representations.
Prototypical networks (Snell, Swersky, and Zemel 2017) and matching networks (Vinyals et al. 2016) can also be updated by modifying a support set, but the means of doing so are markedly different from what we have proposed, motivated by different intended use cases. Critical for NLP settings, we are concerned with fine-grained feature detection, which necessitates the proposed indirect approach for deriving predictions and representations from an imputation-trained deep network, and a different approach for training. Additionally, unlike prototypical networks, we perform matching against every instance (in fact, every token) in the support set, rather than class means, which is a strength rather than a weakness for the intended interpretability and data set analysis applications. Finally, matching networks can also be viewed as a particular weighted K-NN. In contrast, our K-NN approximation of an already trained model is proposed as a parsimonious, interpretable replacement of the original model, and is trained accordingly.17
8. Conclusion
Deep networks are typically viewed as strong predictors that are otherwise immutable and inscrutable black boxes, with the non-identifiable parameters running into the millions and higher. In this context, we have demonstrated a series of approaches toward a more actionable understanding of a deep network over its input data. We have shown that a kernel-width-one CNN and a linear layer over a deep network is effective for deriving token-level predictions when only given document-level labels for training. This approach for class-conditional feature detection enables dense representation matching against a support set with known labels, which can be used with inference-time decision rules to constrain predictions. Additionally, we have shown that we can altogether replace a model’s output with an interpretable weighting over instances with known labels without loss of predictive effectiveness. In this way, we gain sequence labeling at varying label resolutions; local updatability of a model without re-training; interpretable token-level constraints over domain-shifted and out-of-domain data; and more generally, a model-assisted means for uncovering patterns in large data sets that may not be readily detectable at scale without the expressive, deep networks.
Appendix A. Contents
In Appendices B, C, and D, we provide additional results and output for the experiments on the grammatical error detection task, the sentiment data sets, and for the experiments predicting annotator re-edits, respectively.
Appendix B. Grammatical Error Detection Analysis and Examples
Table B1 shows five random examples of original sentences from the FCE test set and the corresponding labeled outputs from the cnn, uniCNN+BERT, uniCNN+BERT+mm, and uniCNN+BERT+S* models.
Sentence 174 | |
True | There are some informations you have asked me about . |
cnn | There are some informations you have asked me about . |
uniCNN+BERT | There are some informations you have asked me about . |
uniCNN+BERT+mm | There are some informations you have asked me about . |
uniCNN+BERT+S* | There are some informations you have asked me about . |
Sentence 223 | |
True | There is space for about five hundred people . |
cnn | There is space for about five hundred people . |
uniCNN+BERT | There is space for about five hundred people . |
uniCNN+BERT+mm | There is space for about five hundred people . |
uniCNN+BERT+S* | There is space for about five hundred people . |
Sentence 250 | |
True | It is n’t easy giving an answer at this question . |
cnn | It is n’t easy giving an answer at this question . |
uniCNN+BERT | It is n’t easy giving an answer at this question . |
uniCNN+BERT+mm | It is n’t easy giving an answer at this question . |
uniCNN+BERT+S* | It is n’t easy giving an answer at this question . |
Sentence 1302 | |
True | Your group has been booked in Palace Hotel which is one of the most comfortable hotels in London . |
cnn | Your group has been booked in Palace Hotel which is one of the most comfortable hotels in London . |
uniCNN+BERT | Your group has been booked in Palace Hotel which is one of the most comfortable hotels in London . |
uniCNN+BERT+mm | Your group has been booked in Palace Hotel which is one of the most comfortable hotels in London . |
uniCNN+BERT+S* | Your group has been booked in Palace Hotel which is one of the most comfortable hotels in London . |
Sentence 1551 | |
True | By the way you can visit the |
cnn | By the way you can visit the |
uniCNN+BERT | By the way you can visit the |
uniCNN+BERT+mm | By the way you can visit the |
uniCNN+BERT+S* | By the way you can visit the |
Sentence 174 | |
True | There are some informations you have asked me about . |
cnn | There are some informations you have asked me about . |
uniCNN+BERT | There are some informations you have asked me about . |
uniCNN+BERT+mm | There are some informations you have asked me about . |
uniCNN+BERT+S* | There are some informations you have asked me about . |
Sentence 223 | |
True | There is space for about five hundred people . |
cnn | There is space for about five hundred people . |
uniCNN+BERT | There is space for about five hundred people . |
uniCNN+BERT+mm | There is space for about five hundred people . |
uniCNN+BERT+S* | There is space for about five hundred people . |
Sentence 250 | |
True | It is n’t easy giving an answer at this question . |
cnn | It is n’t easy giving an answer at this question . |
uniCNN+BERT | It is n’t easy giving an answer at this question . |
uniCNN+BERT+mm | It is n’t easy giving an answer at this question . |
uniCNN+BERT+S* | It is n’t easy giving an answer at this question . |
Sentence 1302 | |
True | Your group has been booked in Palace Hotel which is one of the most comfortable hotels in London . |
cnn | Your group has been booked in Palace Hotel which is one of the most comfortable hotels in London . |
uniCNN+BERT | Your group has been booked in Palace Hotel which is one of the most comfortable hotels in London . |
uniCNN+BERT+mm | Your group has been booked in Palace Hotel which is one of the most comfortable hotels in London . |
uniCNN+BERT+S* | Your group has been booked in Palace Hotel which is one of the most comfortable hotels in London . |
Sentence 1551 | |
True | By the way you can visit the |
cnn | By the way you can visit the |
uniCNN+BERT | By the way you can visit the |
uniCNN+BERT+mm | By the way you can visit the |
uniCNN+BERT+S* | By the way you can visit the |
Tables B2 and B3 show the nearest matches used for the proposed inference-time decision rules for the first three sentences with ground-truth grammatical errors from Table B1 for the uniCNN+BERT+mm and uniCNN+BERT+S* models, respectively. We have provided the exemplar tokens and associated sentences from the support set (here, consisting of the FCE training set) wherever the model makes a positive prediction. For reference, we have also provided the sentence corresponding to the exemplar representation for any tokens marked in the ground-truth labels but missed by the model. The qualitative analysis is consistent with the quantitative results in the main text: When the test prediction is in the same direction as the prediction of the exemplar from the support set, the corresponding contexts, and the exemplar word itself—which is not always a verbatim lexical match—are often similar, particularly when the L2 distances are low.
Sentence 174 | |
True | There are[1] some informations[3] you have asked me about . |
uniCNN+BERT+mm | There are some informations[3] you have asked me about . |
Exemplar [1] Dist. | 41.1 |
Exemplar [1] True | But, there are[1] three things which I would like to tell you . |
Exemplar [1] Pred. | But, there are[1] three things which I would like to tell you . |
Exemplar [3] Dist. | 34.0 |
Exemplar [3] True | I am very glad to hear that and would like to tell you all the informations[3] you need to know from me . |
Exemplar [3] Pred. | I am very glad to hear that and would like to tell you all the informations[3] you need to know from me . |
Sentence 250 | |
True | It is n’t easy giving[4] an answer at[7] this question . |
uniCNN+BERT+mm | It is n’t easy giving an answer at[7] this question . |
Exemplar [4] Dist. | 71.8 |
Exemplar [4] True | It ’s very difficult describing[4] all emotion I felt . |
Exemplar [4] Pred. | It ’s very difficult describing[4] allemotion I felt . |
Exemplar [7] Dist. | 63.3 |
Exemplar [7] True | I ’m going to reply at[7] your question . |
Exemplar [7] Pred. | I ’m going to reply at[7] your question . |
Sentence 1302 | |
True | Your group has been booked in[5] Palace[6] Hotel which is one of the most comfortable hotels in London . |
uniCNN+BERT+mm | Your group has been booked in[5] Palace Hotel which is one of the most comfortable hotels in London . |
Exemplar [5] Dist. | 57.0 |
Exemplar [5] True | Secondly I would prefer to beaccommodate in[5] logcabins . |
Exemplar [5] Pred. | Secondly I would prefer to be accommodate in[5] log cabins . |
Exemplar [6] Dist. | 59.6 |
Exemplar [6] True | I insisted on going to your theatre, to the Circle[6] Theatre, because I have heard that it is one of the best theatres in London . |
Exemplar [6] Pred. | I insisted on going to your theatre, to the Circle[6] Theatre, because I have heard that it is one of the best theatres in London . |
Sentence 174 | |
True | There are[1] some informations[3] you have asked me about . |
uniCNN+BERT+mm | There are some informations[3] you have asked me about . |
Exemplar [1] Dist. | 41.1 |
Exemplar [1] True | But, there are[1] three things which I would like to tell you . |
Exemplar [1] Pred. | But, there are[1] three things which I would like to tell you . |
Exemplar [3] Dist. | 34.0 |
Exemplar [3] True | I am very glad to hear that and would like to tell you all the informations[3] you need to know from me . |
Exemplar [3] Pred. | I am very glad to hear that and would like to tell you all the informations[3] you need to know from me . |
Sentence 250 | |
True | It is n’t easy giving[4] an answer at[7] this question . |
uniCNN+BERT+mm | It is n’t easy giving an answer at[7] this question . |
Exemplar [4] Dist. | 71.8 |
Exemplar [4] True | It ’s very difficult describing[4] all emotion I felt . |
Exemplar [4] Pred. | It ’s very difficult describing[4] allemotion I felt . |
Exemplar [7] Dist. | 63.3 |
Exemplar [7] True | I ’m going to reply at[7] your question . |
Exemplar [7] Pred. | I ’m going to reply at[7] your question . |
Sentence 1302 | |
True | Your group has been booked in[5] Palace[6] Hotel which is one of the most comfortable hotels in London . |
uniCNN+BERT+mm | Your group has been booked in[5] Palace Hotel which is one of the most comfortable hotels in London . |
Exemplar [5] Dist. | 57.0 |
Exemplar [5] True | Secondly I would prefer to beaccommodate in[5] logcabins . |
Exemplar [5] Pred. | Secondly I would prefer to be accommodate in[5] log cabins . |
Exemplar [6] Dist. | 59.6 |
Exemplar [6] True | I insisted on going to your theatre, to the Circle[6] Theatre, because I have heard that it is one of the best theatres in London . |
Exemplar [6] Pred. | I insisted on going to your theatre, to the Circle[6] Theatre, because I have heard that it is one of the best theatres in London . |
Sentence 174 | |
True | There are[1] some informations[3] you have asked me about . |
uniCNN+BERT+S* | There are some informations[3] you have asked me about . |
Exemplar [1] Dist. | 32.7 |
Exemplar [1] True | But, there are[1] three things which I would like to tell you . |
Exemplar [1] Pred. | But, there are[1] three things which I would like to tell you . |
Exemplar [3] Dist. | 24.0 |
Exemplar [3] True | I am very glad to hear that and would like to tell you all the informations[3] you need to know from me . |
Exemplar [3] Pred. | I am very glad to hear that and would like to tell you all the informations[3] you need to know from me . |
Sentence 250 | |
True | It is n’t easy giving[4] an answer at[7] this question . |
uniCNN+BERT+S* | It is n’t easy giving an answer at[7] this question . |
Exemplar [4] Dist. | 54.6 |
Exemplar [4] True | To say nothing about his or her giving[4] advice ! |
Exemplar [4] Pred. | To say nothing about his or her giving[4] advice ! |
Exemplar [7] Dist. | 44.0 |
Exemplar [7] True | I ’m going to reply at[7] your question . |
Exemplar [7] Pred. | I ’m going to reply at[7] your question . |
Sentence 1302 | |
True | Your group has been booked in[5] Palace[6] Hotel which is one of the most comfortable hotels in London . |
uniCNN+BERT+S* | Your group has been booked in Palace[6] Hotel which is one of the most comfortable hotels in London . |
Exemplar [5] Dist. | 43.6 |
Exemplar [5] True | Accommodationin[5] log cabins would be better for me, because they are more comfortable . |
Exemplar [5] Pred. | Accommodation in[5] log cabins would be better for me, because they are more comfortable . |
Exemplar [6] Dist. | 46.8 |
Exemplar [6] True | There will be The London Fashion and Leisure Show in Central[6] Exhibition Hall on the 14th of March . |
Exemplar [6] Pred. | There will be The London Fashion and Leisure Show in Central[6] Exhibition Hall on the 14th of March . |
Sentence 174 | |
True | There are[1] some informations[3] you have asked me about . |
uniCNN+BERT+S* | There are some informations[3] you have asked me about . |
Exemplar [1] Dist. | 32.7 |
Exemplar [1] True | But, there are[1] three things which I would like to tell you . |
Exemplar [1] Pred. | But, there are[1] three things which I would like to tell you . |
Exemplar [3] Dist. | 24.0 |
Exemplar [3] True | I am very glad to hear that and would like to tell you all the informations[3] you need to know from me . |
Exemplar [3] Pred. | I am very glad to hear that and would like to tell you all the informations[3] you need to know from me . |
Sentence 250 | |
True | It is n’t easy giving[4] an answer at[7] this question . |
uniCNN+BERT+S* | It is n’t easy giving an answer at[7] this question . |
Exemplar [4] Dist. | 54.6 |
Exemplar [4] True | To say nothing about his or her giving[4] advice ! |
Exemplar [4] Pred. | To say nothing about his or her giving[4] advice ! |
Exemplar [7] Dist. | 44.0 |
Exemplar [7] True | I ’m going to reply at[7] your question . |
Exemplar [7] Pred. | I ’m going to reply at[7] your question . |
Sentence 1302 | |
True | Your group has been booked in[5] Palace[6] Hotel which is one of the most comfortable hotels in London . |
uniCNN+BERT+S* | Your group has been booked in Palace[6] Hotel which is one of the most comfortable hotels in London . |
Exemplar [5] Dist. | 43.6 |
Exemplar [5] True | Accommodationin[5] log cabins would be better for me, because they are more comfortable . |
Exemplar [5] Pred. | Accommodation in[5] log cabins would be better for me, because they are more comfortable . |
Exemplar [6] Dist. | 46.8 |
Exemplar [6] True | There will be The London Fashion and Leisure Show in Central[6] Exhibition Hall on the 14th of March . |
Exemplar [6] Pred. | There will be The London Fashion and Leisure Show in Central[6] Exhibition Hall on the 14th of March . |
Table B4 contains the unigram positive class n-grams normalized by occurrence (meann-gram+) for the training sentences for which Y = 1. The top scoring such unigrams constitute a relatively sharp list of misspellings. We also include the lowest scoring such unigrams at the bottom of the table, as a check on our featuring scoring method. The ranked features are as we would expect, with the lowest scoring unigrams being names and other words that are otherwise correctly spelled.
unigram . | meann-gram+ score . | Total Frequency . |
---|---|---|
wating | 22.5 | 1 |
noize | 21.9 | 1 |
exitation | 21.5 | 1 |
exitement | 21.2 | 1 |
toe | 20.1 | 1 |
fite | 20.0 | 1 |
ofer | 20.0 | 2 |
n | 19.7 | 5 |
intents | 18.6 | 1 |
wit | 17.7 | 2 |
defences | 17.5 | 1 |
meannes | 17.5 | 1 |
baying | 17.3 | 1 |
saing | 17.1 | 2 |
dipends | 17.0 | 1 |
lair | 16.7 | 2 |
torne | 16.7 | 1 |
farther | 16.2 | 1 |
andy | 16.0 | 1 |
seasonaly | 15.9 | 1 |
remainds | 15.6 | 1 |
sould | 15.5 | 4 |
availble | 15.5 | 3 |
…SKIPPED… | ||
sixteen | −1.7 | 3 |
Uruguay | −1.7 | 1 |
Jose | −1.7 | 1 |
leg | −1.7 | 3 |
Joseph | −2.0 | 1 |
deny | −2.1 | 1 |
Sandre | −2.2 | 1 |
leather | −2.4 | 2 |
shoulder | −2.6 | 1 |
apartheid | −2.8 | 1 |
tablets | −2.8 | 1 |
Martial | −3.0 | 1 |
Lorca | −3.1 | 1 |
unigram . | meann-gram+ score . | Total Frequency . |
---|---|---|
wating | 22.5 | 1 |
noize | 21.9 | 1 |
exitation | 21.5 | 1 |
exitement | 21.2 | 1 |
toe | 20.1 | 1 |
fite | 20.0 | 1 |
ofer | 20.0 | 2 |
n | 19.7 | 5 |
intents | 18.6 | 1 |
wit | 17.7 | 2 |
defences | 17.5 | 1 |
meannes | 17.5 | 1 |
baying | 17.3 | 1 |
saing | 17.1 | 2 |
dipends | 17.0 | 1 |
lair | 16.7 | 2 |
torne | 16.7 | 1 |
farther | 16.2 | 1 |
andy | 16.0 | 1 |
seasonaly | 15.9 | 1 |
remainds | 15.6 | 1 |
sould | 15.5 | 4 |
availble | 15.5 | 3 |
…SKIPPED… | ||
sixteen | −1.7 | 3 |
Uruguay | −1.7 | 1 |
Jose | −1.7 | 1 |
leg | −1.7 | 3 |
Joseph | −2.0 | 1 |
deny | −2.1 | 1 |
Sandre | −2.2 | 1 |
leather | −2.4 | 2 |
shoulder | −2.6 | 1 |
apartheid | −2.8 | 1 |
tablets | −2.8 | 1 |
Martial | −3.0 | 1 |
Lorca | −3.1 | 1 |
Table B5 compares the K-NN output with that of the original model, uniCNN+BERT+mm, on the domain-shifted test set, as with Figure 3 in the main text.
Appendix C. Sentiment Analysis
Sentiment Diffs for Token-Level Detection.
An example of the process to create the token-level detection labels for the sentiment data sets is shown in Table C1. Note that the in-line diffs of the first row are used for data creation, but are not subsequently directly used in training or inference. The diffs are guaranteed to transduce to the source and target and the resulting positive class labels often correspond to positive sentiment. Occasionally there are edge cases created by the diff process and/or the underlying data for which an independent annotator tasked with labeling positive words might conceivably label differently. For example, in this review, “not“ is assigned to the positive class, which is consistent with the original and revised diff of the reviews.
Model . | Review-level Sentiment (Accuracy) . | |
---|---|---|
Orig. . | Rev. . | |
BERTBASEUNCASED+FT | 93.2 | 93.9 |
uniCNN+BERTBASEUNCASED | 91.8 | 91.4 |
uniCNN+BERTBASE | 92.2 | 93.4 |
uniCNN+BERT | 93.0 | 94.3 |
Model . | Review-level Sentiment (Accuracy) . | |
---|---|---|
Orig. . | Rev. . | |
BERTBASEUNCASED+FT | 93.2 | 93.9 |
uniCNN+BERTBASEUNCASED | 91.8 | 91.4 |
uniCNN+BERTBASE | 92.2 | 93.4 |
uniCNN+BERT | 93.0 | 94.3 |
Model Train. Data (Num. Reviews) . | Review-level Sentiment (Accuracy) SemEval-2017 . |
---|---|
Random | 50. |
Orig. (3.4k) | 77.8 |
Orig.+Rev. (1.7k+1.7k) | 64.2 |
Orig.DISJOINT+Rev. (1.7k+1.7k) | 75.1 |
Orig. (19k) | 72.0 |
Orig.+Rev. (19k+1.7k) | 66.9 |
Orig.DISJOINT+Rev. (19k-1.7k+1.7k) | 76.5 |
Orig. (3.4k) BASEuncased | 75.7 |
>Orig.+Rev. (1.7k+1.7k) BASEuncased | 73.5 |
Orig.DISJOINT>+Rev. (1.7k+1.7k) BASEuncased | 75.2 |
Orig. (19k) BASEuncased | 68.5 |
Orig.+Rev. (19k+1.7k) BASEuncased | 72.6 |
Orig.DISJOINT+Rev. (19k-1.7k+1.7k) BASEuncased | 76.9 |
Model Train. Data (Num. Reviews) . | Review-level Sentiment (Accuracy) SemEval-2017 . |
---|---|
Random | 50. |
Orig. (3.4k) | 77.8 |
Orig.+Rev. (1.7k+1.7k) | 64.2 |
Orig.DISJOINT+Rev. (1.7k+1.7k) | 75.1 |
Orig. (19k) | 72.0 |
Orig.+Rev. (19k+1.7k) | 66.9 |
Orig.DISJOINT+Rev. (19k-1.7k+1.7k) | 76.5 |
Orig. (3.4k) BASEuncased | 75.7 |
>Orig.+Rev. (1.7k+1.7k) BASEuncased | 73.5 |
Orig.DISJOINT>+Rev. (1.7k+1.7k) BASEuncased | 75.2 |
Orig. (19k) BASEuncased | 68.5 |
Orig.+Rev. (19k+1.7k) BASEuncased | 72.6 |
Orig.DISJOINT+Rev. (19k-1.7k+1.7k) BASEuncased | 76.9 |
Appendix D. Sentiment Data: Binary Prediction of Local Annotation Edits
Tables D1 and D2 illustrate how the zero-shot sequence labeling predictions from the uniCNN+BERT model can be used as an assistant for analyzing text data sets, uncovering subtle patterns that are not easily discoverable in large data sets.
Counterfactually Augmented Data . | |
---|---|
Review-level (Not Sentiment) . | |
Dev. Set Document 40/41 | |
Original | [...] It shocks me that something exceptional like Firefly lasts one season, while garbage like the Battlestar Galactica remake spawns a spin off. [...] |
uniCNN+BERT (Rev.) | [...] It shocks me that something exceptional like Firefly lasts one season, while even shows like the Battlestar Galactica remake spawns a spin off. [...] |
Dev. Set Document 254/255 | |
Original | [...] A well made movie, one which I will always remember, and watch again. |
uniCNN+BERT (Rev.) | [...] A feeble movie, one which I will always remember and never watch again. |
Dev. Set Document 258/259 | |
Original | [...] We need that time again, now more than ever. [...] |
uniCNN+BERT (Rev.) | [...] We do need that time again, now than ever. [...] |
Dev. Set Document 276/277 | |
>Original | [...] Highly, hugely recommended! |
uniCNN+BERT (Rev.) | [...] Highly, hugely recommended! |
Dev. Set Document 278/279 | |
Original | almost every review of this movie I’d seen was pretty bad. It’s not pretty bad, it’s actually pretty good, though not great. [...] |
uniCNN+BERT (Rev.) | almost every review of this movie I’d seen was pretty bad. And the reviews are correct, it’s actually pretty horrible, though [...] |
Counterfactually Augmented Data . | |
---|---|
Review-level (Not Sentiment) . | |
Dev. Set Document 40/41 | |
Original | [...] It shocks me that something exceptional like Firefly lasts one season, while garbage like the Battlestar Galactica remake spawns a spin off. [...] |
uniCNN+BERT (Rev.) | [...] It shocks me that something exceptional like Firefly lasts one season, while even shows like the Battlestar Galactica remake spawns a spin off. [...] |
Dev. Set Document 254/255 | |
Original | [...] A well made movie, one which I will always remember, and watch again. |
uniCNN+BERT (Rev.) | [...] A feeble movie, one which I will always remember and never watch again. |
Dev. Set Document 258/259 | |
Original | [...] We need that time again, now more than ever. [...] |
uniCNN+BERT (Rev.) | [...] We do need that time again, now than ever. [...] |
Dev. Set Document 276/277 | |
>Original | [...] Highly, hugely recommended! |
uniCNN+BERT (Rev.) | [...] Highly, hugely recommended! |
Dev. Set Document 278/279 | |
Original | almost every review of this movie I’d seen was pretty bad. It’s not pretty bad, it’s actually pretty good, though not great. [...] |
uniCNN+BERT (Rev.) | almost every review of this movie I’d seen was pretty bad. And the reviews are correct, it’s actually pretty horrible, though [...] |
Contrast Sets . | |
---|---|
Review-level (Not Sentiment) . | |
Dev. Set Document 38/39 | |
Original | [...] The content of the film was very very moving. [...] |
uniCNN+BERT (Contrast) | [...] The content of the film was very very [...] |
Dev. Set Document 58/59 | |
Original | [...] Anyone who has the slightest interest in Gaelic, folk history, folk music, oral culture, Scotland, British history, multi-culturalism or social justice should go and see this film. |
uniCNN+BERT (Contrast) | [...] Anyone who has the slightest interest in Gaelic, folk history, folk music, oral culture, Scotland, British history, multi-culturalism or social justice should go and this film. |
Dev. Set Document 146/147 | |
Original | [...] It is hard to describe the incredible subject matter the Maysles discovered but everything in it works wonderfully. [...] |
uniCNN+BERT (Contrast) | [...] It is hard to describe the flawed subject matter the Maysles discovered but everything in it hopelessly. [...] |
Dev. Set Document 164/165 | |
Original | [...] The characters are cardboard clichs of everything that has ever been in a bad Sci-Fi series. [...] |
uniCNN+BERT (Contrast) | [...] The characters are imaginations everything that has ever been in a good Sci-Fi series. [...] |
Dev. Set Document 176/177 | |
Original | [...] There was also a forgettable sequel several years later, but this instant classic is not to be missed. |
uniCNN+BERT (Contrast) | [...] There was also a sequel several years later, which made this film even more missable. |
Dev. Set Document 182/183 | |
Original | [...] It has very little plot,mostly partying,beer drinking and fighting. [...] |
uniCNN+BERT (Contrast) | [...] It has very plot,mostly partying,beer drinking and fighting. [...] |
Dev. Set Document 184/185 | |
Original | [...] Whatever originality exists in this film - unusual domestic setting for a musical, lots of fantasy, some animation - is more than offset by a script that has not an ounce of wit or thought-provoking plot development. [...] |
uniCNN+BERT (Contrast) | [...] Whatever originality exists in this film - unusual domestic setting for a musical, lots of fantasy, some animation - is more than offset by a script that has wit thought-provoking plot development. [...] |
Contrast Sets . | |
---|---|
Review-level (Not Sentiment) . | |
Dev. Set Document 38/39 | |
Original | [...] The content of the film was very very moving. [...] |
uniCNN+BERT (Contrast) | [...] The content of the film was very very [...] |
Dev. Set Document 58/59 | |
Original | [...] Anyone who has the slightest interest in Gaelic, folk history, folk music, oral culture, Scotland, British history, multi-culturalism or social justice should go and see this film. |
uniCNN+BERT (Contrast) | [...] Anyone who has the slightest interest in Gaelic, folk history, folk music, oral culture, Scotland, British history, multi-culturalism or social justice should go and this film. |
Dev. Set Document 146/147 | |
Original | [...] It is hard to describe the incredible subject matter the Maysles discovered but everything in it works wonderfully. [...] |
uniCNN+BERT (Contrast) | [...] It is hard to describe the flawed subject matter the Maysles discovered but everything in it hopelessly. [...] |
Dev. Set Document 164/165 | |
Original | [...] The characters are cardboard clichs of everything that has ever been in a bad Sci-Fi series. [...] |
uniCNN+BERT (Contrast) | [...] The characters are imaginations everything that has ever been in a good Sci-Fi series. [...] |
Dev. Set Document 176/177 | |
Original | [...] There was also a forgettable sequel several years later, but this instant classic is not to be missed. |
uniCNN+BERT (Contrast) | [...] There was also a sequel several years later, which made this film even more missable. |
Dev. Set Document 182/183 | |
Original | [...] It has very little plot,mostly partying,beer drinking and fighting. [...] |
uniCNN+BERT (Contrast) | [...] It has very plot,mostly partying,beer drinking and fighting. [...] |
Dev. Set Document 184/185 | |
Original | [...] Whatever originality exists in this film - unusual domestic setting for a musical, lots of fantasy, some animation - is more than offset by a script that has not an ounce of wit or thought-provoking plot development. [...] |
uniCNN+BERT (Contrast) | [...] Whatever originality exists in this film - unusual domestic setting for a musical, lots of fantasy, some animation - is more than offset by a script that has wit thought-provoking plot development. [...] |
Acknowledgments
We thank the reviewers for their feedback and suggestions.
Notes
Hereafter, we will tend to use “token” instead of “word,” as the lowest resolution of the input will be determined by the tokenization scheme of the particular data set.
Our replication code is publicly available at https://github.com/allenschmaltz/exa.
We drop the constant bias term because we are ranking negative and positive class n-grams separately.
We use the term “exemplar” rather than “prototype,” as we use these representations directly, unique to each feature, rather than as class-based centroids.
We restrict our experiments to exact search, which is nonetheless reasonably fast using GPUs at this scale, to avoid introducing another source of variation, but approximate search could be used in practice for larger support sets.
For the FCE data set, each “document” consists of a single sentence.
We use the PyTorch (https://pytorch.org/) reimplementation of the original code base available at https://github.com/huggingface/pytorch-pretrained-BERT (Wolf et al. 2020).
Within the set of admitted predictions, we might then consider approaches for quantifying uncertainty, which we leave for future work. Here we focus on examining and establishing the K-NN behavior relative to the original model to justify its use as an interpretable substitute, as well as the types of interpretable heuristics useful for avoiding domain-shifted and out-of-domain data this enables.
Available at https://github.com/acmi-lab/counterfactually-augmented-data.
Available at https://github.com/allenai/contrast-sets/tree/master/IMDb.
This differs from a simple hard constraint on token input lengths. In principle, most Twitter messages could still be admitted by this model-dependent constraint, as the lower bound is around 5 tokens.
The overall accuracy for annotator domain prediction is 79.4 on the dev set, which is similar to that of the test set (79.6).
The analogous totaln-gram− and totaln-gram+ scores for Rev. and Orig., respectively, which are not shown, exhibit patterns in the expected, corresponding directions.
For display purposes, we have dropped subsequent n-grams with the same score, which typically just differ by a single non-discriminating word as the prefix or suffix token.