Measuring and Improving Consistency in Pretrained Language Models

Consistency of a model -- that is, the invariance of its behavior under meaning-preserving alternations in its input -- is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel, a high-quality resource of cloze-style query English paraphrases. It contains a total of 328 paraphrases for 38 relations. Using ParaRel, we show that the consistency of all PLMs we experiment with is poor -- though with high variance between relations. Our analysis of the representational spaces of PLMs suggests that they have a poor structure and are currently not suitable for representing knowledge robustly. Finally, we propose a method for improving model consistency and experimentally demonstrate its effectiveness.


Introduction
Pretrained Language Models (PLMs) are large neural networks that are used in a wide variety of NLP tasks. They operate under a pretrain-finetune paradigm: models are first pretrained over a large text corpus and then finetuned on a downstream task. PLMs are thought of as good language encoders, supplying basic language understanding capabilities that can be used with ease for many downstream tasks.
A desirable property of a good language understanding model is consistency: the ability to make consistent decisions in semantically equivalent contexts, reflecting a systematic ability to generalize in the face of language variability.
Examples of consistency include: predicting the same answer in question answering and read- 1 The code and resource are available at: https:// github.com/yanaiela/pararel.  Figure 1: Overview of our approach. We expect that a consistent model would predict the same answer for two paraphrases. In this example, the model is inconsistent on the Homeland and consistent on the Seinfeld paraphrases.
ing comprehension tasks regardless of paraphrase (Asai and Hajishirzi, 2020); making consistent assignments in coreference resolution (Denis and Baldridge, 2009;Chang et al., 2011); or making summaries factually consistent with the original document (Kryscinski et al., 2020). While consistency is important in many tasks, nothing in the training process explicitly targets it. One could hope that the unsupervised training signal from large corpora made available to PLMs such as BERT or RoBERTa Liu et al., 2019) is sufficient to induce consistency and transfer it to downstream tasks. In this paper, we show that this is not the case.
The recent rise of PLMs has sparked a discussion about whether these models can be used as Knowledge Bases (KBs) (Petroni et al., 2019(Petroni et al., , 2020Davison et al., 2019;Peters et al., 2019;Jiang et al., 2020;Roberts et al., 2020). Consistency is a key property of KBs and is particularly important for automatically constructed KBs. One of the biggest appeals of using a PLM as a KB is that we can query it in natural language -instead of relying on a specific KB schema. The expectation is that PLMs abstract away from language and map queries in natural language into meaningful representations such that queries with identical intent but different language forms yield the same answer. For example, the query "Homeland premiered on [MASK]" should produce the same answer as "Homeland originally aired on [MASK]". Studying inconsistencies of PLM-KBs can also teach us about the organization of knowledge in the model or lack thereof. Finally, failure to behave consistently may point to other representational issues such as the similarity between antonyms and synonyms (Nguyen et al., 2016), and overestimating events and actions (reporting bias) (Shwartz and Choi, 2020).
In this work, we study the consistency of factual knowledge in PLMs, specifically in Masked Language Models (MLM) -these are PLMs trained with the MLM objective Liu et al., 2019), as opposed to other strategies such as standard language modeling (Radford et al., 2019) or text-to-text (Raffel et al., 2020). We ask: Is the factual information we extract from PLMs invariant to paraphrasing? We use zero-shot evaluation since we want to inspect models directly, without adding biases through finetuning. This allows us to assess how much consistency was acquired during pretraining and to compare the consistency of different models. Overall, we find that the consistency of the PLMs we consider is poor, although there is a high variance between relations.
We introduce PARAREL , a new benchmark that enables us to measure consistency in PLMs ( §3), by using factual knowledge that was found to be partially encoded in them (Petroni et al., 2019;Jiang et al., 2020). PARAREL is a manually curated resource that provides patterns -short textual prompts -that are paraphrases of one another, with 328 paraphrases describing 38 binary relations such as X born-in Y, X works-for Y ( §4). We then test multiple PLMs for knowledge consistency, i.e., whether a model predicts the same answer for all patterns of a relation. Figure 1 shows an overview of our approach. Using PARAREL , we probe for consistency in four PLM types: BERT, BERT-whole-word-masking, RoBERTa and AL-BERT ( §5). Our experiments with PARAREL show that current models have poor consistency, although with high variance between relations ( §6).
Finally, we propose a method that improves model consistency by introducing a novel consistency loss ( §8). We demonstrate that trained with this loss, BERT achieves better consistency performance on unseen relations. However, more work is required to achieve fully consistent models.

Background
There has been significant interest in analyzing how well PLMs (Rogers et al., 2020) perform on linguistic tasks (Goldberg, 2019;Hewitt and Manning, 2019;Tenney et al., 2019;Elazar et al., 2021), commonsense (Forbes et al., 2019;Da and Kasai, 2019; and reasoning (Talmor et al., 2020;, usually assessed by measures of accuracy. However, accuracy is just one measure of PLM performance (Linzen, 2020). It is equally important that PLMs do not make contradictory predictions (cf. Figure 1), a type of error that humans rarely make. There has been relatively little research attention devoted to this question, i.e., to analyze if models behave consistently. One example concerns negation: Ettinger (2020) and  show that models tend to generate facts and their negation, a type of inconsistent behavior. Ravichander et al. (2020) propose paired probes for evaluating consistency. Our work is broader in scope, examining the consistency of PLM behavior across a range of factual knowledge types and investigating how models can be made to behave more consistently.
Consistency has also been highlighted as a desirable property in automatically constructed KBs and downstream NLP tasks. We now briefly review work along these lines.
Consistency in knowledge bases (KBs) has been studied in theoretical frameworks in the context of the satisfiability problem and KB construction, and efficient algorithms for detecting inconsistencies in KBs have been proposed (Hansen and Jaumard, 2000;Andersen and Pretolani, 2001). Other work aims to quantify the degree to which KBs are inconsistent and detects inconsistent statements (Thimm, 2009(Thimm, , 2013Muiño, 2011).
Consistency in question answering was studied by Ribeiro et al. (2019) in two tasks: visual question answering (Antol et al., 2015) and reading comprehension (Rajpurkar et al., 2016). They automatically generate questions to test the consistency of QA models. Their findings suggest that most models are not consistent in their predictions. In addition, they use data augmentation to create more robust models. Alberti et al. (2019) generate new questions conditioned on context and answer from a labeled dataset and by filtering answers that do not provide a consistent result with the origi-nal answer. They show that pretraining on these synthetic data improves QA results. Asai and Hajishirzi (2020) use data augmentation that complements questions with symmetricity and transitivity, as well as a regularizing loss that penalizes inconsistent predictions. Kassner et al. (2021b) propose a method to improve accuracy and consistency of QA models by augmenting a PLM with an evolving memory that records PLM answers and resolves inconsistency between answers.
Work on consistency in other domains includes (Du et al., 2019) where prediction of consistency in procedural text is improved. Ribeiro et al. (2020) use consistency for more robust evaluation. Li et al. (2019) measure and mitigate inconsistency in natural language inference (NLI), and finally, Camburu et al. (2020) propose a method for measuring inconsistencies in NLI explanations (Camburu et al., 2018).

Probing PLMs for Consistency
In this section, we formally define consistency and describe our framework for probing consistency of PLMs.

Consistency
We define a model as consistent if, given two cloze-phrases such as "Seinfeld originally aired on [MASK]" and "Seinfeld premiered on [MASK]" that are quasi-paraphrases, it makes noncontradictory predictions 2 on N-1 relations over a large set of entities. A quasi-paraphrase -a concept introduced by Bhagat and Hovy (2013) -is a more fuzzy version of a paraphrase. The concept does not rely on the strict, logical definition of paraphrase and allows to operationalize concrete uses of paraphrases. This definition is in the spirit of the RTE definition (Dagan et al., 2005), which similarly supports a more flexible use of the notion of entailment. For instance, a model that predicts NBC and ABC on the two aforementioned patterns, is not consistent, since these two facts are contradictory. We define a cloze-pattern as a cloze-phrase that expresses a relation between a subject and an object. Note that consistency does not require the answers to be factually correct. While correctness is also an important property for KBs, we view it as a separate objective and measure it independently. We use the terms paraphrase and quasi-paraphrase interchangeably.
Many-to-many (N-M) relations (e.g. sharesborder-with) can be consistent even with different answers (given they are correct). For instance, two patterns that express the shares-border-with relation and predict Albania and Bulgaria for Greece are both correct. We do not consider such relations for measuring consistency. However, another requirement from a KB is determinism, i.e., returning the results in the same order (when more than a single result exists). In this work, we focus on consistency, but also measure determinism of the models we inspect.

The Framework
An illustration of the framework is presented in Figure 2. Let D i be a set of subject-object KB tuples (e.g., <Homeland, Showtime>) from some relation r i (e.g., originally-aired-on), accompanied with a set of quasi-paraphrases cloze-patterns P i (e.g., X originally aired on Y). Our goal is to test whether the model consistently predicts the same object (e.g., Showtime) for a particular subject (e.g., Homeland). 3 To this end, we substitute X with a subject from D i and Y with [MASK] in all of the patterns P i of that relation (e.g., Homeland originally aired on [MASK] and Homeland premiered on [MASK]). A consistent model must predict the same entity.
Restricted Candidate Sets Since PLMs were not trained for serving as KBs, they often predict words that are not KB entities; e.g., a PLM may predict, for the pattern "Showtime originally aired on [MASK]", the noun 'tv' -which is also a likely substitution for the language modeling objective, but not a valid KB fact completion. Therefore, following (Xiong et al., 2020;Ravichander et al., 2020;Kassner et al., 2021a), we restrict the PLMs' output vocabulary to the set of possible gold objects for each relation from the underlining KB. For example, in the born-in relation, instead of inspecting the entire vocabulary of a model, we only keep objects from the KB, such as Paris, London, Tokyo, etc.
Note that this setup makes the task easier for the (D 1 , r 1 , P 1 ), . . . , (D i , r i , P i ), . . . , (D n , r n , P n )  Figure 2: Overview of our framework for assessing model consistency. D i ("Data Pairs (D)" on the left) is a set of KB triplets of some relation r i , which are coupled with a set of quasi-paraphrase cloze-patterns P i ("Patterns (P )" on the right) that describe that relation. We then populate the subjects from D i as well as a mask token into all patterns P i (shown in the middle) and expect a model to predict the same object across all pattern pairs. PLM, especially in the context of KBs. However, poor consistency in this setup strongly implies that consistency would be even lower without restricting candidates.

The PARAREL Resource
We now describe PARAREL , a resource designed for our framework (cf. Section 3.2). PARAREL is curated by experts, with a high level of agreement. It contains patterns for 38 relations 4 from T-REx (Elsahar et al., 2018) -a large dataset containing KB triples aligned with Wikipedia abstracts -with an average of 8.63 patterns per relation. Table 1 gives statistics. We further analyse the paraphrases used in this resource, partly based on the types defined in Bhagat and Hovy (2013), and report this analysis in Appendix B.
Construction Method PARAREL was constructed in four steps. (1) We began with the patterns provided by LAMA (Petroni et al., 2019) (one pattern per relation, referred to as base-pattern).
(2) We augmented each base-pattern with other patterns that are paraphrases from LPAQA (Jiang et al., 2020). However, since LPAQA was created automatically (either by back-translation or by extracting patterns from sentences that contain both subject and object), some LPAQA patterns are not correct paraphrases. We therefore only include the subset of correct paraphrases. (3) Using SPIKE (Shlain et al., 2020), 5 a search engine over Wikipedia sentences that supports syntax-based queries, we searched for additional patterns that appeared in Wikipedia and added them to PARAREL . Specifically, we searched for Wikipedia sentences containing a subject-object tuple from T-REx and then manually extracted patterns from the sentences. (4) Lastly, we added additional paraphrases of the basepattern using the annotators' linguistic expertise. Two additional experts went over all the patterns and corrected them, while engaging in a discussion until reaching agreement, discarding patterns they could not agree on.
Human Agreement To assess the quality of PARAREL , we run a human annotation study. For each relation, we sample up to five paraphrases, comparing each of the new patterns to the basepattern from LAMA. That is, if relation r i contains the following patterns: p 1 , p 2 , p 3 , p 4 , and p 1 is the base-pattern, then we compare the following pairs (p 1 , p 2 ), (p 1 , p 3 ), (p 1 , p 4 ).
We populate the patterns with random subjects and objects pairs from T-REx (Elsahar et al., 2018) and ask annotators if these sentences are paraphrases. We also sample patterns from different relations to provide examples that are not paraphrases of each other, as a control. Each task contains five patterns that are thought to be paraphrases and two that are not. 6 Overall, we collect annotations for 156 paraphrase candidates and 61 controls. We asked NLP graduate students to annotate the pairs and collected one answer per pair. 7 The agreement scores for the paraphrases and the controls are 95.5% and 98.3%, which is high and indicates PARAREL 's high quality. We also inspected the disagreements and fixed many additional problems to further improve quality.

Models & Data
We experiment with four PLMs: BERT, BERT whole-word-masking 8 , RoBERTa (Liu et al., 2019) and ALBERT (Lan et al., 2019). For BERT, RoBERTa and ALBERT, we use a base and a large version. 9 We also report a majority baseline that always predicts the most common object for a relation. By construction, this baseline is perfectly consistent.
We use knowledge graph data from T-REx (Elsahar et al., 2018). 10 To make the results comparable across models, we remove objects that are not represented as a single token in all models' vocabularies; 26,813 tuples remain. 11 We further split the data into N-M relations for which we report determinism results (seven relations) and N-1 relations for which we report consistency (31 relations). 7 We asked the annotators to re-annotate any mismatch with our initial label, to allow them to fix random mistakes. 8 BERT whole-word-masking is BERT's version where words that are tokenized into multiple tokens are masked together. 9 For ALBERT we use the smallest and largest versions. 10 We discard three poorly defined relations from T-REx. 11 In a few cases, we filter entities from certain relations that contain multiple fine-grained relations to make our patterns compatible with the data. For instance, most of the instances for the genre relation describes music genres, thus we remove some of the tuples were the objects include non-music genres such as 'satire', 'sitcom' and 'thriller'.

Evaluation
Our consistency measure for a relation r i (Consistency) is the percentage of consistent predictions of all the pattern pairs p i k , p i l ∈ P i of that relation, for all its KB tuples d i j ∈ D i . Thus, for each KB tuple from a relation r i that contains n patterns, we consider predictions for n(n − 1)/2 pairs.
We also report Accuracy, that is, the acc@1 of a model in predicting the correct object, using the original patterns from Petroni et al. (2019). In contrast to Petroni et al. (2019), we define it as the accuracy of the top-ranked object from the candidate set of each relation. Finally, we report Consistent-Acc, a new measure that evaluates individual objects as correct only if all patterns of the corresponding relation predict the object correctly. Consistent-Acc is much stricter and combines the requirements of both consistency (Consistency) and factual correctness (Accuracy). We report the average over relations, i.e., macro average, but notice that the micro average produces similar results.

Knowledge Extraction through Different Patterns
We begin by assessing our patterns as well as the degree to which they extract the correct entities. These results are summarized in Table 2. First, we report Succ-Patt, the percentage of patterns that successfully predicted the right object at least once. A high score suggests that the patterns are of high quality and enable the models to extract the correct answers. All PLMs achieve a perfect score. Next, we report Succ-Objs, the percentage of entities that were predicted correctly by at least one of the patterns. Succ-Objs quantifies the degree to which the models "have" the required knowledge. We observe that some tuples  are not predicted correctly by any of our patterns: the scores vary between 45.8% for ALBERT-base and 65.7% for BERT-large. With an average number of 8.63 patterns per relation, there are multiple ways to extract the knowledge, we thus interpret these results as evidence that a large part of T-REx knowledge is not stored in these models. Finally, we measure Unk-Const, a consistency measure for the subset of tuples for which no pattern predicted the correct answer; and Know-Const, consistency for the subset where at least one of the patterns for a specific relation predicted the correct answer. This split into subsets is based on Succ-Objs. Overall, the results indicate that when the factual knowledge is successfully extracted, the model is also more consistent. For instance, for BERT-large, Know-Const is 65.2% and Unk-Const is 48.1%.

Consistency & Knowledge
In this section, we report the overall knowledge measure that was used in Petroni et al. (2019) (Accuracy), the consistency measure (Consistency), and Consistent-Acc, which combines knowledge and consistency (Consistent-Acc). The results are summarized in Table 3.
We begin with the Accuracy results. The results range between 29.8% (ALBERT-base) and 48.7% (BERT-large whole-word-masking). Notice that our numbers differ from Petroni et al. (2019) as we use a candidate set ( §3) and only consider KB triples whose object is a single token in all the PLMs we consider ( §5.1).
Next, we report Consistency ( §5.2). The BERT models achieve the highest scores. There is a consistent improvement from base to large versions of each model. In contrast to previous work that observed quantitative and qualitative improvements of RoBERTa-based models over BERT (Liu et al., 2019;Talmor et al., 2020), in terms of consistency, BERT is more consistent than RoBERTa and AL-BERT. Still, the overall results are low (61.1% for the best model), even more remarkably so because the restricted candidate set makes the task easier. We note that the results are highly variant between models (performance on original-language varies between 52% and 90%), and relations (BERT-large performance is 92% on capital-of and 44% on owned-by).
Finally, we report Consistent-Acc: the results are much lower than for Accuracy, as expected, but follow similar trends: RoBERTa-base performs worse (16.4%) and BERT-large best (29.5%).
Interestingly, we find strong correlations between Accuracy and Consistency, ranging from 67.3% for RoBERTa-base to 82.1% for BERT-large (all with small p-values 0.01).
A striking result of the model comparison is the clear superiority of BERT, both in knowledge accuracy (which was also observed by Shin et al. (2020)) and knowledge consistency. We hypothesize this result is caused by the different sources of training data: although Wikipedia is part of the training data for all models we consider, for BERT it is the main data source, but for RoBERTa and ALBERT it is only a small portion. Thus, when using additional data, some of the facts may be forgotten, or contradicted in the other corpora; this can diminish knowledge and compromise consistency behavior. Thus, since Wikipedia is likely the largest unified source of factual knowledge that exists in unstructured data, giving it prominence in pretraining makes it more likely that the model will incorporate Wikipedia's factual knowledge well. These results may have a broader impact on models to come: Training bigger models with more data (such as GPT-3 (Brown et al., 2020)) is not always beneficial.
Determinism We also measure determinism for N-M relations, i.e., we use the same measure as Consistency, but since difference predictions may be factually correct, these do not necessarily convey consistency violations, but indicate nondeterminism. For brevity, we do not present all results, but the trend is similar to the consistency result (although not comparable, as the relations are different): 52.9% and 44.6% for BERT-large and RoBERTa-base, respectively.  Effect of Pretraining Corpus Size Next, we study the question of whether the number of tokens used during pretraining contributes to consistency. We use the pretrained RoBERTa models from Warstadt et al. (2020) and repeat the experiments on four additional models. These are RoBERTa-based models, trained on a sample of Wikipedia and the book corpus, with varying training sizes and parameters. We use one of the three published models for each configuration and report the average accuracy over the relations for each model in Table 4. Overall, Accuracy and Consistent-Acc improve with more training data. However, there is an interesting outlier to this trend: The model that was trained on one million tokens is more consistent than the models trained on ten and one-hundred million tokens. A potentially crucial difference is that this model has many fewer parameters than the rest (to avoid overfitting). It is nonetheless interesting that a model that is trained on significantly less data can achieve better consistency. On the other hand, it's accuracy scores are lower, arguably due to the model being exposed to less factual knowledge during pretraining.   (Honnibal et al., 2020) 12 and retain the path between the entities. Success on (1) indicates that the model's knowledge processing is robust to syntactic variation. Success on (2) indicates that the model's knowledge processing is robust to variation in word order and tense. Table 5 reports results. While these and the main results on the entire dataset are not comparable as the pattern subsets are different, they are higher than the general results: 67.5% for BERTlarge when only the syntax differs and 78.7% when the syntax is identical. This demonstrates that while PLMs have impressive syntactic abilities, they struggle to extract factual knowledge in the face of tense, word-order, and syntactic variation.

Do PLMs Generalize Over Syntactic
McCoy et al. (2019) show that supervised models trained on MNLI (Williams et al., 2018), an NLI dataset (Bowman et al., 2015), use superficial syntactic heuristics rather than more generalizable properties of the data. Our results indicate that PLMs have problems along the same lines: they are not robust to surface variation.

Qualitative Analysis
To better understand the factors affecting consistent predictions, we inspect the predictions of BERTlarge on the patterns shown in Table 6. We highlight several cases: The predictions in Example #1 are inconsistent, and correct for the first pattern (Amsterdam), but not for the other two (Madagascar and Luxembourg). The predictions in Example #2 also show a single pattern that predicted the right object; however, the two other patterns, which are # Subject  Table 6: Predictions of BERT-large-cased. "Subject" and "Object" are from T-REx (Elsahar et al., 2018). "Pattern #i" / "Pred #i": three different patterns from our resource and their predictions. The predictions are colored in blue if the model predicted correctly (out of the candidate list), and in red otherwise. If there is more than a single erroneous prediction, it is colored by a different red. lexically similar, predicted the same, wrong answer -Renault. Next, the patterns of Example #3 produced two factually correct answers out of three (Greece, Kosovo), but simply do not correspond to the gold object in T-REx (Albania), since this is an M-N relation. Note that this relation is not part of the consistency evaluation, but the determinism evaluation. The three different predictions in example #4 are all incorrect. Finally, the two last predictions demonstrate consistent predictions: Example #5 is consistent but factually incorrect (even though the correct answer is a substring of the subject), and finally, Example #6 is consistent and factual.

Representation Analysis
To provide insights on the models' representations, we inspect these after encoding the patterns. Motivated by previous work that found that words with the same syntactic structure cluster together (Chi et al., 2020;Ravfogel et al., 2020) we perform a similar experiment to test if this behav-ior replicates with respect to knowledge: We encode the patterns, after filling the placeholders with subjects and masked tokens and inspect the last layer representations in the masked token position. When plotting the results using t-SNE (Maaten and Hinton, 2008) we mainly observe clustering based on the patterns, which suggests that encoding of knowledge of the entity is not the main component of the representations. Figure 3 demonstrates this for BERT-large encodings of the capital relation, which is highly consistent. 13 To provide a more quantitative assessment of this phenomenon, we also cluster the representations and set the number of centroids based on: 14 (1) the number of patterns in each relation, which aims to capture patternbased clusters, and (2) the number of subjects in each relation, which aims to capture entity-based clusters. This would allow for a perfect clustering, in the case of perfect alignment between the representation and the inspected property. We measure the purity of these clusters using V-measure and observe that the clusters are mostly grouped by the patterns, rather than the subjects. Finally, we compute the spearman correlation between the consistency scores and the V-measure of the representations. However, the correlation between these variables is close to zero, 15 therefore not explaining the models' behavior. We repeated these experiments while inspecting the objects instead of the subjects, and found similar trends. This finding is interesting since it means that (1) these representations are not knowledge-focused, i.e., their main component does not relate to knowledge, and (2) the representation by its entirety does not explain the behavior of the model, and thus only a subset of the representation does. This finding is consistent with previous work that observed similar trends for linguistic tasks (Elazar et al., 2021). We hypothesize that this disparity between the representation and the behavior of the model may be explained by a situation where the distance between representations largely does not reflect the distance between predictions, but rather other, behaviorally irrelevant factors of a sentence.

Improving Consistency in PLMs
In the previous sections, we showed PLMs are generally not consistent in their predictions, and previous works have noticed the lack of this property in a variety of downstream tasks. An ideal model would exhibit the consistency property after pretraining, and would then be able to transfer it to different downstream tasks. We therefore ask: Can we enhance current PLMs and make them more consistent?

Consistency Improved PLMs
We propose to improve the consistency of PLMs by continuing the pretraining step with a novel consistency loss. We make use of the T-REx tuples and the paraphrases from PARAREL .
For each relation r i , we have a set of paraphrased patterns P i describing that relation. We use a PLM to encode all patterns in P i , after populating a subject that corresponds to the relation r i and a mask token. We expect the model to make the same prediction for the masked token for all patterns.

Consistency Loss Function
As we evaluate the model using acc@1, the straight-forward consistency loss would require these predictions to be identical: where f θ (P n ) is the output of an encoding function (e.g., BERT) parameterized by θ (a vector) over input P n , and f θ (P n )[i] is the score of the ith vocabulary item of the model. However, this objective contains a comparison between the output of two argmax operations, making it discrete and discontinuous, and hard to optimize in a gradient-based framework. We instead relax the objective, and require that the predicted distributions Q n = softmax(f θ (P n )), rather than the top-1 prediction, be identical to each other. We use two-sided KL Divergence to measure similarity between distributions: D KL (Q r i n ||Q r i m ) + D KL (Q r i m ||Q r i n ) where Q r i n is the predicted distribution for pattern P n of relation r i .
As most of the vocabulary is not relevant for the predictions, we filter it down to the k tokens from the candidate set of each relation ( §3.2). We want to maintain the original capabilities of the modelfocusing on the candidate set helps to achieve this goal since most of the vocabulary is not affected by our new loss.
To encourage a more general solution, we make use of all the paraphrases together, and enforce all predictions to be as close as possible. Thus, the consistency loss for all pattern pairs for a particular relation r i is: MLM Loss Since the consistency loss is different from the Cross-Entropy loss the PLM is trained on, we find it important to continue the MLM loss on text data, similar to previous work (Geva et al., 2020). We consider two alternatives for continuing the pretraining objective: (1) MLM on Wikipedia and (2) MLM on the patterns of the relations used for the consistency loss. We found that the latter works better. We denote this loss by L M LM .

Consistency Guided MLM Continual Training
Combining our novel consistency loss with the regular MLM loss, we continue the PLM training by combining the two losses. The combination of the two losses is determined by a hyperparameter λ, resulting in the following final loss function: This loss is computed per relation, for one KB tuple. We have many of these instances, which we require to behave similarly. Therefore, we batch together l = 8 tuples from the same relation and apply the consistency loss function to all of them.

Setup
Since we evaluate our method on unseen relations, we also split train and test by relation type (e.g., location-based relations, which are very common in T-REx). Moreover, our method is aimed to be simple, effective, and to require only minimal supervision. Therefore, we opt to use only three relations: original-language, named-after, and original-network; these were chosen randomly,  Table 7: Knowledge and consistency results for the baseline, BERT base, and our model. The results are averaged over the 25 test relations. Underlined: best performance overall, including ablations. Bold: best performance for BERT-ft and the two baselines (BERT-base, majority).
out of the non-location related relations. 16 For validation, we randomly pick three relations of the remaining relations and use the remaining 25 for testing.
We perform minimal tuning of the parameters (λ ∈ 0.1, 0.5, 1) to pick the best model, train for three epochs, and select the best model based on Consistent-Acc on the validation set. For efficiency reasons, we use the base version of BERT.

Improved Consistency Results
The results are presented in Table 7. We report aggregated results for the 25 relations in the test. We again report macro average (mean over relations) and standard deviation. We report the results of the majority baseline (first row), BERT-base (second row) and our new model (BERT-ft, third row). First, we note that our model significantly improves consistency: 64.0% (compared with 58.2% for BERT-base, an increase of 5.8 points). Accuracy also improves compared to BERT-base, from 45.6% to 47.4%. Finally, and most importantly, we see an increase of 5.9 points in Consistent-Acc, which is achieved due to the improved consistency of the model. Notably, these improvements arise from training on merely three relations, meaning that the model improved its consistency ability and generalized to new relations. We measure the statistical significance of our method compared to the BERT baseline, using McNemar's test (following Dror et al. (2018Dror et al. ( , 2020) and find all results to be significant (p 0.01). We also perform an ablation study to quantify the utility of the different components. First, we report on the finetuned model without the consistency loss (-consistency). Interestingly, it does im-prove over the baseline (BERT-base), but it lags behind our finetuned model. Second, applying our loss on the candidate set rather than on the entire vocabulary is beneficial (-typed). Finally, by not performing the MLM training on the generated patterns (-MLM), the consistency results improve significantly (80.8%); however, this also hurts Accuracy and Consistent-Acc. MLM training seems to serve as a regularizer that prevents catastrophic forgetting.
Our ultimate goal is to improve consistency in PLMs for better performance on downstream tasks. Therefore, we also experiment with finetuning on SQuAD (Rajpurkar et al., 2016), and evaluating on paraphrased questions from SQuAD (Gan and Ng, 2019) using our consistency model. However, the results perform on par with the baseline model, both on SQuAD and the paraphrase questions. More research is required to show that consistent PLMs can also benefit downstream tasks.

Discussion
Consistency for Downstream Tasks The rise of PLMs has improved many tasks but has also brought a lot of expectations. The standard usage of these models is pretraining on a large corpus of unstructured text and then finetuning on a task of interest. The first step is thought of as providing a good language-understanding component, whereas the second step is used to teach the format and the nuances of a downstream task.
As discussed earlier, consistency is a crucial component of many NLP systems (Du et al., 2019;Asai and Hajishirzi, 2020;Denis and Baldridge, 2009;Kryscinski et al., 2020) and obtaining this skill from a pretrained model would be extremely beneficial and has the potential to make specialized consistency solutions in downstream tasks redundant. Indeed, there is an ongoing discussion about the ability to acquire "meaning" from raw text signal alone (Bender and Koller, 2020). Our new benchmark makes it possible to track the progress of consistency in pretrained models.
Broader Sense of Consistency In this work we focus on one type of consistency, that is, consistency in the face of paraphrasing; however, consistency is a broader concept. For instance, previous work has studied the effect of negation on factual statements, which can also be seen as consistency (Ettinger, 2020;. A consistent model is expected to return different an-swers to the prompts: "Birds can [MASK]" and "Birds cannot [MASK]". The inability to do so, as was shown in these works, also shows the lack of model consistency. Usage of PLMs as KBs Our work follows the setup of Petroni et al. (2019); Jiang et al. (2020), where PLMs are being tested as KBs. While it is an interesting setup for probing models for knowledge and consistency, it lacks important properties of standard KBs: (1) the ability to return more than a single answer and (2) the ability to return no answer. Although some heuristics can be used for allowing a PLM to do so, e.g., using a threshold on the probabilities, it is not the way that the model was trained, and thus may not be optimal. Newer approaches that propose to use PLMs as a starting point to more complex systems have promising results and address these problems (Thorne et al., 2020).
In another approach, Shin et al. (2020) suggest using AUTOPROMPT to automatically generate prompts, or patterns, instead of creating them manually. This approach is superior to manual patterns (Petroni et al., 2019), or aggregation of patterns that were collected automatically (Jiang et al., 2020).

Brittleness of Neural Models
Our work also relates to the problem of the brittleness of neural networks. One example of this brittleness is the vulnerability to adversarial attacks (Szegedy et al., 2014;Jia and Liang, 2017). The other problem, closer to the problem we explore in this work, is the poor generalization to paraphrases. For example, Gan and Ng (2019) created a paraphrase version for a subset of SQuAD (Rajpurkar et al., 2016), and showed that model performance drops significantly. Ribeiro et al. (2018) proposed another method for creating paraphrases and performed a similar analysis for visual question answering and sentiment analysis. Recently, Ribeiro et al. (2020) proposed CHECKLIST, a system that tests a model's vulnerability to several linguistic perturbations.
PARAREL enables us to study the brittleness of PLMs, and separate facts that are robustly encoded in the model from mere 'guesses', which may arise from some heuristic or spurious correlations with certain patterns (Poerner et al., 2020). We showed that PLMs are susceptible to small perturbations, and thus, finetuning on a downstream task -given that training datasets are typically not large and do not contain equivalent examples -is not likely to perform better.
Can we Expect from LMs to be Consistent? The typical training procedure of an LM does not encourage consistency. The standard training solely tries to minimize the log-likelihood of an unseen token, and this objective is not always aligned with consistency of knowledge. Consider for example the case of wikipedia texts, as opposed to reddit; their texts and styles may be very different and they may even describe contradictory facts. An LM can exploit the styles of each text to best fit the probabilities given to an unseen word, even if the resulting generations contradict each other.
Since the pretraining-finetuning procedure is the dominating one in our field currently, a great amount of the language capabilities that were learned during pre-training also propagates to the fine-tuned models. As such, we believe it is important to measure and improve consistency already in the pretrained models.
Reasons Behind the (In)Consistency Since LMs are not expected to be consistent, what are the reasons behind their predictions, when being consistent, or inconsistent?
In this work, we presented the predictions of multiple queries, and the representation space of one of the inspected models. However, this does not point to the origins of such behavior. In future work, we aim to inspect this question more closely.

Conclusion
In this work, we study the consistency of PLMs with regard to their ability to extract knowledge. We build a high-quality resource named PARAREL that contains 328 high-quality patterns for 38 relations. Using PARAREL , we measure consistency in multiple PLMs, including BERT, RoBERTa, and ALBERT, and show that although the latter two are superior to BERT in other tasks, they fall short in terms of consistency. However, the consistency of these models is generally low. We release PARAREL along with data tuples from T-REx as a new benchmark to track knowledge consistency of NLP models. Finally, we propose a new simple method to improve model consistency, by continuing the pretraining with a novel loss. We show this method to be effective and to improve both the consistency of models as well as their ability to extract the correct facts.