Abstract
Large language models have been shown to behave inconsistently in response to meaning-preserving paraphrastic inputs. At the same time, researchers evaluate the knowledge and reasoning abilities of these models with test evaluations that do not disaggregate the effect of paraphrastic variability on performance. We propose a metric, PC, for evaluating the paraphrastic consistency of natural language reasoning models based on the probability of a model achieving the same correctness on two paraphrases of the same problem. We mathematically connect this metric to the proportion of a model’s variance in correctness attributable to paraphrasing. To estimate PC, we collect ParaNlu, a dataset of 7,782 human-written and validated paraphrased reasoning problems constructed on top of existing benchmark datasets for defeasible and abductive natural language inference.1 Using ParaNlu, we measure the paraphrastic consistency of several model classes and show that consistency dramatically increases with pretraining but not fine-tuning. All models tested exhibited room for improvement in paraphrastic consistency.
1 Introduction
The NLP community has transitioned away from “deeper” abstract semantic representations (e.g., FrameNet (Baker et al., 1998)) towards “shallower” representations (e.g., Universal Dependencies (Nivre et al., 2016)) which retain attributes of their original surface form. The culmination of this trend is to use natural language as a semantic representation itself for evaluating a model’s reasoning ability. This has enabled rapid advancement across a host of tasks including NLI (Bowman et al., 2015) and QA (Rajpurkar et al., 2016), with the latest generation of large language models saturating many benchmark natural language understanding datasets. However, natural language as a meaning representation is highly ambiguous (Schubert, 2015). While versatile and compact, it leaves open the possibility that systems are not robust to different ways of expressing the same meaning in natural language.
Benchmark evaluation datasets such as SNLI (Bowman et al., 2015) consist of a collection of reasoning problems designed to probe particular aspects of commonsense knowledge, with each example represented by a singular linguistic expression. When a system gets a particular example correct, it is only evidence that it was able to correctly reason for the particular phrasing used in the example, allowing for the possibility of systems that can correctly solve one form of a reasoning problem, but not others. Conversely, if a model gets a question wrong, how can we tell if the error was due to a failure in language understanding or a failure in reasoning?
Consider the defeasible reasoning example in Figure 1. A language model finetuned on the δ-NLI dataset (Rudinger et al., 2020) may correctly predict that the original update sentence strengthens a human’s belief in the hypothesis sentence. However, different linguistic expressions of that same update sentence may yield high variance in a model’s predictions. If models stay consistent in the face of paraphrastic variability, we may conclude that correctly reasoning about one expression is indicative of an understanding of that reasoning problem, a desirable property of teaching machines to reason entirely in natural language.
We explore the sensitivity of natural language reasoning models to paraphrasing as a way to better characterize their knowledge and reasoning ability, contextualize their performance on evaluation sets, and evaluate room for improvement on the basis of consistency. Under the assumption that world and linguistic knowledge are separable, our study attempts to disentangle the two by generating examples that hold the required world knowledge of a reasoning problem constant while modifying its surface form, a problem formulation with connections to causality (Stolfo et al., 2023) and counterfactual invariance (Veitch et al., 2021; Kaushik et al., 2020).
To study this, we build on top of two NLU datasets—Abductive NLI (Bhagavatula et al., 2019) and Defeasible NLI (Rudinger et al., 2020)—by collecting paraphrases of reasoning problems using label-preserving paraphrasing, a functional change of traditional paraphrasing that preserves the semantics of the core reasoning problem. Our dataset, ParaNlu, contains 7,782 human-elicited and 7,295 model-elicited paraphrases across 1,000 reasoning problems spanning both datasets. We select diverse examples to paraphrase ranging in difficulty (Sakaguchi et al., 2021) and model confidence. Our dataset is entirely manually validated, ensuring semantic equivalence while maximizing paraphrase diversity.
We measure paraphrastic consistency (PC), or the likelihood of model’s prediction remaining consistent under different phrasings, in order to understand the types of surface-form changes that models are sensitive to. We study the relationship between consistency and various data conditions and modeling paradigms, exploring factors such as data source, example difficulty, model complexity, and training dynamics. For their given accuracy level, we find that models still have room to improve on paraphrastic consistency. Since no model demonstrates high accuracy and high paraphrastic consistency, we conclude that attempts to measure their reasoning abilities will be confounded by inconsistencies in their linguistic abilities.
2 Paraphrastic Consistency
In principle, natural language reasoning tasks like abductive NLI and defeasible NLI require the ability to linguistically decode the meaning of the text that expresses an underlying problem, as well as the knowledge and reasoning capabilities to solve the underlying problem.
By analogy, consider evaluating a child’s understanding of the concept of addition. Instead of simply presenting them with a mathematical expression (say, 7 + 7), we write a word problem that can be answered by (1) understanding the situation in natural language, (2) recognizing that the answer corresponds to the mathematical reasoning problem 7 + 7, and finally, (3) solving 7 + 7. If the child answers incorrectly, we must figure out whether they did not understand the goal of the word problem or were not able to perform the arithmetic in order to evaluate their mathematical reasoning ability.
For models tasked with natural language reasoning problems, teasing apart these two failure modes (namely, deficiencies in language understanding versus deficiencies in knowledge or reasoning) requires more than reporting test set accuracy. The design of natural language reasoning test sets does not facilitate this type of analysis: If a test set contains 100 different natural language reasoning problems, and a model correctly answers 80% of them, which failure mode should we attribute the 20% of errors to?
For practitioners, it is useful to characterize performance by measuring paraphrastic consistency alongside accuracy: How likely is it that a model’s prediction for a natural language reasoning problem will remain the same given a different phrasing of the problem? We collect a dataset that changes the language of NLI examples while maintaining the underlying logic of the problem to tease the two apart. For a test example, x, we collect a set of paraphrases {x1′, x2′,...}, which we call a bucket (). After collecting many such buckets, we can directly estimate the probability that a model’s prediction for any two paraphrases belonging to the same bucket will be the same.
2.1 Measuring Paraphrastic Consistency
When authoring a test example for a natural language reasoning task, a crowdworker has many linguistic choices for expressing the underlying reasoning problem. If the purpose of the resulting test set is to evaluate a model’s ability to perform the underlying reasoning task, then ideally the crowdworker’s choice of phrasing would have no effect on the model’s performance. In practice, however, it is known that language models exhibit some degree of sensitivity to paraphrastic variation (Verma et al., 2023; Jiang et al., 2021). To quantify this effect, we pose the following counterfactual question: If a given test question had been written differently, what is the probability that a model would still receive the same credit for its prediction on the paraphrased question?
Quantitatively, we introduce a metric of paraphrastic consistency, PC, defined as the probability that a model’s predictions for two paraphrases of the same problem, xi′ and xj′, are either both correct or both incorrect (provided the ground truth labels of xi′ and xj′ match). To formalize PC, we define the following terms:
R: a discrete random binary variable indicating whether a model’s prediction of a paraphrased reasoning problem, x′, is correctly predicted (1) or incorrectly predicted (0).
θ: 𝟙E[R], or the average correctness (i.e., accuracy) of the paraphrased problems (x1′, x2′,...) in a particular bucket.
A: Overall accuracy of model M over a set of natural language reasoning problems. This is equivalent to 𝟙E[θ] across all buckets.
Figure 2 illustrates the computation of PC for different patterns of model behavior. If PC = 1, the model is either entirely correct or incorrect on all paraphrases of the same problem, as in the left-most panel of Figure 2 where each bucket contains only 1’s or only 0’s. In this case, no errors can be attributed to paraphrastic variance. The minimum PC occurs when every paraphrase bucket has the same accuracy, as shown in the middle panel; in this case all errors are likely due to paraphrastic variability. In practice, PC lies between these two extremes and some, but not all, errors are due to paraphrasing, as in the right-most panel. Modern NLP evaluation sets usually consist of a collection of independent reasoning problems each represented by singular natural language expression, and practitioners often make claims about the reasoning capabilities or world knowledge given a model’s accuracy on such evaluation sets. However, as depicted in Figure 2, accuracy presents an incomplete picture of performance: In all three scenarios, the overall accuracy remain 80%, but only the first scenario, in which the model makes equivalent predictions given many alternate phrasings of a reasoning problem, results in a high PC.
PC in Practice.
PC can be interpreted as the probability that given two phrasings of the same reasoning problem, the model’s two predictions are either both correct or both incorrect. This summary metric allows us to capture the reliability of a model’s prediction: How confident can we be that a model’s prediction would remain correct (or incorrect) if it had been phrased differently? While PC is lower-bounded by a function of accuracy, we design it as a metric complementary to accuracy in order to better characterize model performance, diagnose modeling errors, and benchmark the linguistic reasoning capabilities. For example, when two models achieve similar accuracies, computing their respective paraphrastic consistencies may help as an additional point of comparison. Just as desired model accuracy may be dependent on the application in which it is deployed, different settings may mandate varying appropriate PC values and practitioners must decide on tolerable thresholds based on their use case.
2.2 Proportion of Variance Attributable to Paraphrasing
Model Confidence and PC.
We choose to characterize paraphrastic consistency via variance in correctness instead of variance in model confidence. Confidence represents a model’s overall estimate in the correct answer, and thus conflates confidence in linguistic understanding and problem solving. A low confidence in the correct label may indicate that the model understood what the problem was asking, but was unable to reach the answer (or vice versa). Models trained to optimize for accuracy are not calibrated to explicitly encode confidence in linguistic decoding or problem-solving ability.
3 Reasoning Tasks
3.1 Defeasible Reasoning
Defeasible reasoning is a mode of reasoning in which inferences or conclusions may be altered or withdrawn in light of new evidence (Reiter, 1980). For example, given the context “A group of people perform on a stage”, the natural conclusion “A group performs for an audience” may be weakened upon learning that the group is at a rehearsal.
To study defeasible reasoning in language models, Rudinger et al. (2020) introduce the task of defeasible natural language inference (NLI). Traditionally, NLI involves determining whether a premiseP entails, contradicts, or is neutral in relation to a hypothesisH (Giampiccolo et al., 2007). When the premise P and hypothesis H are neutral in relation to one another, defeasible NLI studies whether a third update (U) sentence strengthens or weakens H. Namely, a human may determine H more likely to be true when U is a strengthener, and less likely to be true when U is a weakener.
Rudinger et al. (2020) also introduce δ-NLI, a dataset that extends three existing natural language datasets: SNLI (Bowman et al., 2015), SOCIAL-CHEM-101 (Forbes et al., 2020), and ATOMIC (Sap et al., 2019). For each premise-hypothesis pair (or just hypothesis, in the case of δ-SOCIAL), crowdworkers write multiple strengthening and weakening updates, ensuring a balance. Defeasible NLI is a binary classification task that involves predicting whether the update sentence is a strengthener or a weakener (e.g., the original update in Row 5 of Table 1 is a weakener).
Adopting the terminology from Srikanth and Rudinger (2022), we distinguish between context parts of examples, and target parts of examples, with our study of consistency concerning target portions of examples. For δ-NLI, we consider the premise P and hypothesis H as context sentences, and the update U sentence as the target (Table 1).
3.2 Abductive Reasoning
Abduction is inference to the most likely explanation (Peirce, 1974). Such inferences are hypotheses that can best fit one or more incomplete observations.
Given the example above, any human would most likely infer the second hypothesis over the first to explain the two observations. Bhagavatula et al. (2019) introduce abductive NLI, an abductive reasoning task formulated binary classification. α-NLI examples consist of two observations O1 and O2 (where O2 occurs some point in time after O1), h +, a plausible hypothesis, and h−, an implausible hypothesis. Given both observations, the task is to determine which hypothesis is more plausible. We treat O1 and O2 as context (C) and h + and h− as the target (T) portion of the example.
4 Constructing ParaNlu
We study paraphrastic consistency by paraphrasing the target portions (T) of examples from α-NLI and all three data sources in δ-NLI (δ-SNLI, δ-ATOMIC, and δ-SOCIAL). For an original example x, we collect a bucket of paraphrased examples xi such that the context portions C and gold label l remain identical, while the target portions T are rewritten as label-preserving paraphrases. This allows us to modify the surface form of the example while retaining the underlying commonsense reasoning problem.
Label-preserving Paraphrases.
We construct quasi-paraphrases (Bhagat and Hovy, 2013) of target sentences in reasoning problems by loosening the requirement of semantic equivalence. Given a natural language reasoning task L, a textual instance x of task L with label ℓL(x), we say that x′ is a label-preserving paraphrase of x if:
ℓL(x) = ℓL(x′).
x′ does not contradict any context (C) in x.
x′ remains consistent with the situation evoked by x.
Label-preserving paraphrases are functionally equivalent: Target sentences (hypotheses in α-NLI and updates in δ-NLI) may introduce small bits of information as long as the same scenario is plausibly described. Consider the following δ-NLI example: A man stands in front of a cashier and kiosk at a grocery storeHe is smiling“The man got a discount.” While the sentence “The man saved 10% with a coupon” is not semantically equivalent to the strengthening update sentence, it is a valid, label-preserving paraphrase since it retains the logic of the problem and the label. Table 1 shows examples of admissible and inadmissible paraphrases under our definition. Label-preserving paraphrases represent alternative, but equivalent expressions annotators could have chosen when writing the original problem that employ similar world knowledge. This is different from label-altering edits proposed by Gardner et al. (2020), where minimal human edits that shift the target label were used to create examples for measuring linguistic robustness.
4.1 Original Example Selection
We adopt a stratified sampling strategy to obtain diverse examples for annotation that vary in difficulty. To obtain such examples, we leverage AFLite (Sakaguchi et al., 2021), an adversarial filtering (Zellers et al., 2018) algorithm designed to partition datasets based on difficulty using pre-computed dense embeddings of examples fed into an ensemble of n logistic regression classifiers. At each iteration of AFLite, members of the ensemble are trained on random partitions of m examples and evaluated on the remaining validation examples. Each validation example is assigned a score, computed by the proportion of correct predictions. The top-k examples with scores above a threshold τ are subsequently added to the easy partition of the dataset and filtered out, and the process repeats until less than k instances are removed in a particular iteration. Including both types of examples, easy and hard, ensures that ParaNlu can support analysis that investigates whether models are inconsistent only on certain classes of examples (e.g., those filtered out due to lexical artifacts).
Defeasible NLI.
We use the train splits of δ-SNLI, δ-ATOMIC, and δ-SOCIAL to source examples for annotation, since they are large enough to meaningfully partition using AFLite. We partition each train set into 3 sections: (1) examples to finetune the RoBERTa-base models used to embed examples (RoBERTaembed), (2) training examples for models used for consistency analysis (RoBERTaanalysis), and (3) a pool from which examples are sampled for annotation. We partition at the premise-hypothesis (P-H) level to avoid leakage, since multiple examples may contain the same P-H pair, but different updates (U).
For each dataset, we pre-compute example embeddings with the RoBERTaembed model, and run aflite with the n = 64 linear classifiers, τ = 0.75, k = 500, and m = 5000.2 Examples in the annotation pool with τ ≤ 0.75 are added to the easy subset, and the those with τ > 0.75 are labeled as difficult. We then finetune a separate RoBERTa-large model on RoBERTaanalysis and use it to obtain predictions on examples in the annotation pool. Based on the RoBERTaanalysis model confidence in the gold label, we sample 125 examples from the easy subset, as determined by AFLite, in a round-robin fashion for each decile between 0 and 1. We repeat this to collect 125 examples from the difficult subset.
Abductive NLI.
Since the publicly released α-NLI dataset only contains examples that survived adversarial filtering (“difficult” examples), we reached out to the authors to obtain easy examples that were filtered out. We source our original examples from the test split of α-NLI. We train a RoBERTa-large model on examples from the train split of α-NLI and follow the same stratified sampling protocol on 125 examples from each of the easy and difficult subsets, according to the confidence of the RoBERTaanalysis in the correct label.
4.2 Paraphrased Example Collection
We obtain 250 examples per dataset (α-NLI, δ-SNLI, δ-ATOMIC, δ-SOCIAL), resulting in 1,000 examples for which we collect paraphrases.
Crowdsourcing.
We use Amazon Mechanical Turk to collect paraphrases of the target portions of each example. Workers are shown context sentences and must write a paraphrase of the target sentence(s) according to the definition of label-preserving paraphrasing presented earlier.
We abstract the underlying reasoning task away, presenting α-NLI examples as short stories that require paraphrasing middle sentences, and δ-NLI examples as scenarios with weakening or strengthening evidence. In order to encourage diversity of paraphrases, we display the Jaccard similarity between tokens in the original sentence and the paraphrase as workers typed. Figure 7 shows our annotation interface and instructions for collecting paraphrases of α-NLI and δ-NLI problems respectively.
Workers provide 3 paraphrases of both plausible and implausible hypotheses for α-NLI examples and 3 paraphrases of updates for δ-NLI examples. In the case of α-NLI, we randomly pair together plausible and implausible hypotheses written by the same worker to construct paraphrased examples. Each example was annotated by 3 workers. See Appendix L for more details.
Paraphrase Example Validation.
Ensuring semantic equivalence between paraphrased examples and original reasoning problems is essential to our study. Inadvertent removal of the crux of the reasoning problem while paraphrasing may result in invalid examples. Table 3 includes an α-NLI example with paraphrases that were both accepted and rejected from crowdworkers based on our definition of label-preserving paraphrasing. The first and third accepted paraphrases both introduce new pieces of information (“mantelpiece”, “mounted it on the wall”) but do not violate the situation evoked by the original problem. In contrast, the first rejected paraphrase is incompatible with the situation and the second rejected paraphrase does not retain the plausibility of the hypothesis.
We opt to have an author validate all paraphrases, each within the context of the problem.3 A second author and two external annotators annotated a sample of 100 paraphrases, again labeling each as either valid or invalid. We obtain a Fleiss’s Kappa (Fleiss, 1971) value of κ = 0.81 between all validators—the two authors and two external validators. This measures agreement on the criterion of label-preservation, reflecting whether a paraphrase written by a crowdworker was of high enough quality to admit into our dataset, as opposed to a measurement of agreement on the correct label given paraphrased examples.
Dataset Overview.
Our resulting dataset, ParaNlu, contains 1,000 examples uniformly split across α-NLI, δ-SNLI, δ-ATOMIC, and δ-SOCIAL. Table 2 shows the total number of post-validation paraphrased examples per data split, and the statistics of sizes of buckets.
. | # . | # . | mean # . |
---|---|---|---|
original . | paraphrases . | paraphrases/ex . | |
α-NLI | 250 | 2098 | 8.4 ± 1.2 |
δ-SNLI | 250 | 1980 | 7.9 ± 1.4 |
δ-ATOMIC | 250 | 1869 | 7.5 ± 1.6 |
δ-SOCIAL | 250 | 1835 | 7.3 ± 1.8 |
. | # . | # . | mean # . |
---|---|---|---|
original . | paraphrases . | paraphrases/ex . | |
α-NLI | 250 | 2098 | 8.4 ± 1.2 |
δ-SNLI | 250 | 1980 | 7.9 ± 1.4 |
δ-ATOMIC | 250 | 1869 | 7.5 ± 1.6 |
δ-SOCIAL | 250 | 1835 | 7.3 ± 1.8 |
5 Consistency on Human Paraphrases
We first examine several different models’ behavior on ParaNlu to measure robustness to different linguistic expressions. While language models such as RoBERTa are trained on vast amounts of text that may instill some paraphrastic consistency, especially given label-preserving paraphrases that not be semantically equivalent, other non-pretrained models without access to such knowledge may falter. We characterize the progress of models with respect to PC to understand whether factors such as training setups (training from scratch, supervised finetuning, prompting) and model complexity (ranging from bag-of-words representations to GPT-3) affect consistency.
5.1 Model Variants
We train 5 different types of models per data source. For all models, we use the same set of examples that were used to finetune the RoBERTaanalysis models introduced in §4.1.
Bag of Words.
We train bag-of-words models (BoW) using fasttext, an off-the-shelf text classification library, with a maximum of 4-grams (Joulin et al., 2017) for 5 epochs with the default learning rate of 0.1.
BiLSTM.
We train end-to-end BiLSTM models using the architecture from Conneau et al. (2017) and initialize them with GloVe embeddings (Pennington et al., 2014). We use 3 fully connected layers for classification with max pooling. After tuning on the development sets, models are trained for 10 epochs with early stopping and a batch size of 64.
RoBERTa.
We use the RoBERTaanalysis models in §4.1, and add one more setting for defeasible examples in which we finetune a RoBERTa-large model on all combined data across the 3 data sources in δ-NLI, which we refer to as a unified RoBERTa model. All RoBERTa-large models were finetuned for 2 epochs with a learning rate of 2e-5 and a batch size of 32.
DeBERTa.
We finetune DeBERTa-v3-large (He et al., 2022) for 2 epochs with a learning rate of 5e-6 and a batch size of 16.
GPT-3.
Lastly, we experiment with prompting GPT-3 (Brown et al., 2020) using text-curie-001 (prompts in Table 5). For α-NLI, we randomly sample 36 examples from the training set and include instructions derived from those shown to the crowdworkers that annotated the α-NLI dataset. For δ-NLI, we randomly sample 12 examples per dataset (36 in-context examples total) and include the task definition from Rudinger et al. (2020) in the prompt. Since we cannot reliably extract a softmax distribution over binary classes for our tasks (GPT-3 is not a classification model), we calculate model confidence in a particular class by extracting log probabilities associated with the tokens for both labels and normalize them.
5.2 Results
For all models, we compute PC (Equation 2). In addition, we undo the biasing effects of the stratified sampling according to model confidence (§4.1) and report a corrected version of paraphrastic consistency, , by weighting the expectations in Equation 2 according to the distribution of model confidences in the correct label of the corresponding test set. We compute four accuracy metrics: (1) accuracy on original examples in the ParaNlu (AO), (2) accuracy on the test set of the original dataset (AT), (3) accuracy on all paraphrases across all buckets (A), and (4) corrected accuracy on all paraphrases across all buckets () which is weighted in the same manner as to undo stratification effects.
Table 4 shows these accuracy metrics along with PC and for all models across all data sources. We highlight and , as they are complementary metrics meant to jointly assess model performance. Some models optimize at the cost of , or vice versa. To capture models that balance both, we highlight (, ) points that are Pareto-optimal for each dataset. The highest performing model according to , RoBERTa, earns a of around 0.9, indicating room to improve on its paraphrastic consistency for its accuracy level. We observe that a GPT-3-curie model with minimal prompt engineering (we simply use the definition of defeasible inference directly from Rudinger et al. (2020)) along with a handful of in-context examples has a value competitive with a RoBERTa model finetuned on thousands of examples. A stronger GPT-3 variant may better perform defeasible reasoning while maintaining a competitive PC.
Figure 3 visualizes the relationship between accuracy and for models on δ-SNLI examples (Figure 8 shows similar plots for δ-ATOMIC and δ-SOCIAL examples). For each model, we plot on the x-axis and on the y-axis along with two types of supporting curves for the δ-SNLI split of ParaNlu. The curve with the lowest minima (labeled Min PC) indicates the theoretical lower bound for PCgiven a particular accuracy level: if all the variance in model correctness (Equation 3) is attributable to the paraphrasing variance term present in Equation 7, then the minimum possible value for PC is 1 −2 * (Acc * 1 −Acc), where Acc * (1 −Acc) is the variance of a Bernoulli random variable with probability Acc. In addition to this theoretical lower bound, we plot curves indicating the proportion of total variance attributed to paraphrasing (labeled %PVAP). As PC increases, less variance is attributed to variance within buckets due to phrasing. Visualizing this relationship makes it clear that accuracy alone provides an incomplete picture of model performance. A perfect model would reside in the top right of Figure 3, not only achieving high accuracy but also high paraphrastic consistency. Models with similar accuracies may have largely different values, indicating to practitioners how sensitive they are to problem phrasing.
We now turn to a series of experiments to better characterize paraphrases in ParaNlu as well as contributing factors to model’s PC value using our best performing model, RoBERTa.
6 Paraphrase Source
Human-written paraphrases in ParaNlu span all of the transformations delineated by Bhagat and Hovy (2013), sometimes involving more complex reasoning that falls between linguistic and world knowledge. Such paraphrases were elicited by providing humans the entire example, encouraging them to engage with both the reasoning problem itself and the wide scope of possible meaning-preserving transformations. To understand the utility of label-preserving paraphrases, we compare a RoBERTa model’s behavior on our human-written paraphrases with paraphrases generated automatically, as previous studies have explored (Verma et al., 2023).
Using our RoBERTa models (§5.1), we probe the relationship between paraphrastic consistency () and the source of paraphrased examples. Are models more robust to the paraphrastic transformations produced by automatic paraphrase generation models, or do they only struggle with the more complex, example-aware transformations made by humans?
Since human paraphrases and model paraphrases are generated by different processes, and are thus drawn from different distributions, they may exhibit different properties. Paraphrase generation models are predisposed to biases arising from n-gram frequency effects.4 However, reasoning models should exhibit consistency regardless of whether correct answers are phrased as high-probability sentences under a language model.
We use two models (§6.2) to automatically paraphrase target sentences in original examples and compare model PC on automatic and human paraphrases.
6.1 Experimental Setup
For each original example in ParaNlu, we sample paraphrases of targets from generation models. As with humans, we elicit paraphrases of target sentences: update sentences for δ-NLI and both hypothesis sentences for α-NLI. In contrast to the human elicitation process, however, we do not provide any context sentences to generation models. While this limits the scope of possible paraphrases, it allows us to gauge the value of exposure to context during paraphrasing.
We adopt a generate-then-validate scheme and have an author again validate all target paraphrases to ensure that their consistency with our definition of label-preserving paraphrasing. In the case of α-NLI examples, where there are multiple target sentences, we randomly pair valid paraphrases together, resampling where necessary when numbers of valid generated hypotheses are unequal.
6.2 Paraphrase Generation Models
Quality-Controlled Paraphrase Generation.
We use a QCPG model (Bandel et al., 2022), a controllable paraphrase generation system that conditions on a 3-dimensional vector encoding semantic similarity and lexical and syntactic distances. We pool all paraphrases from a per-example sweep of these hyperparameters.
GPT-3.
In addition to a supervised, explicitly controllable paraphrase generation model, we elicit paraphrases from GPT-3 using 10 in-context examples of paraphrases randomly sampled from ParaNlu. Setting temperature to 0.7, we sample 9 paraphrases from text-davinci-002 per target sentence from original examples and similarly validate them to ensure label preservation.
6.3 Results
In total, we generate 7,295 valid paraphrased examples across 1,000 examples by pooling together all valid examples from both QCPG and GPT-3 and evaluate our RoBERTa models on these examples. Figure 4 plots on the x-axis and on the y-axis for human-generated and automatically-generated paraphrased examples for each dataset. On all datasets with the exception of δ-SOCIAL, we observe that models have a higher value on automatically generated paraphrased examples than on human-elicited paraphrases. We hypothesize that this pattern may not hold for δ-SOCIAL due to the fact that the dataset does not contain premise sentences, and hence has a smaller scope for more complex transformations involving context. This result suggests that reasoning models may be more robust to the simpler, in-distribution types of paraphrase transformations that automatic paraphrase generation models produce than to those written by human annotators, indicating that over-reliance on evaluation using synthetically generated data may be misleading.
Paraphrase Diversity.
To dissect this result further, we measure lexical diversity, syntactic diversity, and semantic similarity (Bandel et al., 2022) of target paraphrases and original target sentences. Lexical distance is measured by the normalized character-level minimal edit distance between the bag of words (Bandel et al., 2022), and syntactic distance is computed as the normalized tree edit distance between the third level constituency parse trees of the original target and the paraphrased target (Iyyer et al., 2018). We measure semantic similarity using BLEURT (Sellam et al., 2020) as in Bandel et al. (2022). Across all four data splits in ParaNlu, human-elicited paraphrases are more lexically and syntactically diverse, as well as less semantically similar to original examples, than automatically generated paraphrases (Table 6). In addition, we find that automatically generated paraphrases are 3–4% more likely to be bidirectionally entailed than human-written paraphrases, as detected by a RoBERTa-large model finetuned on SNLI (Bowman et al., 2015), MNLI (Williams et al., 2018), ANLI (Nie et al., 2020), and FEVER (Thorne et al., 2018).
. | lex . | syn . | sem . | |
---|---|---|---|---|
α-NLI | automatic | 25.0 | 20.3 | 74.3 |
human | 35.3 | 26.8 | 64.0 | |
δ-SNLI | automatic | 24.1 | 18.5 | 76.3 |
human | 34.4 | 24.6 | 67.4 | |
δ-ATOMIC | automatic | 30.4 | 19.2 | 66.7 |
human | 36.9 | 22.4 | 58.7 | |
δ-SOCIAL | automatic | 30.8 | 22.2 | 70.1 |
human | 40.0 | 24.5 | 60.1 |
. | lex . | syn . | sem . | |
---|---|---|---|---|
α-NLI | automatic | 25.0 | 20.3 | 74.3 |
human | 35.3 | 26.8 | 64.0 | |
δ-SNLI | automatic | 24.1 | 18.5 | 76.3 |
human | 34.4 | 24.6 | 67.4 | |
δ-ATOMIC | automatic | 30.4 | 19.2 | 66.7 |
human | 36.9 | 22.4 | 58.7 | |
δ-SOCIAL | automatic | 30.8 | 22.2 | 70.1 |
human | 40.0 | 24.5 | 60.1 |
Taken together, these results underscore the benefit of our human annotation to generate ParaNlu—evaluation solely on automatically generated paraphrases, as others have done, is insufficient to fully characterize their robustness.
7 Do Artifacts Explain Inconsistency?
Many NLP datasets are constructed by crowdworkers writing most or all parts of a natural language reasoning problem. While efficient and scalable, this paradigm can give rise to annotation artifacts (Gururangan et al., 2018), or statistical biases in parts of examples that correlate with the correct label (McCoy et al., 2019). For example, Gururangan et al. (2018) found that negation words (“no”, “nothing”, etc.) in NLI examples are strong indicators of the contradiction label.
One way to detect artifacts in datasets is through a partial-input baseline (Poliak et al., 2018; Feng et al., 2019), a setting in which only the target portion of an NLI instance (i.e., the hypothesis in a traditional NLI setup) is used to train a model to predict entailment. When partial-input models achieve high accuracy, it is often indicative of annotation artifacts.
Full-input models that are trained on datasets with annotation artifacts may learn to rely on such shallow signals instead of performing true inferential reasoning. This may lead to lower paraphrastic consistency in the face of alternative phrasings of examples, since they may no longer contain the annotation artifact the model leveraged. Can we attribute a model’s issues with paraphrastic consistency on ParaNlu entirely to the presence of annotation artifacts? That is, are models only inconsistent on examples with artifacts, since they are relying on spurious correlations? Or, are they inconsistent even when no artifacts are present?
7.1 Experimental Setup
Rudinger et al. (2020) find that partial-input baselines trained on δ-NLI perform at least 10% better than random chance, indicating the presence of annotation artifacts in their dataset. As such, we focus on δ-NLI for our experiments.5 For each split of δ-NLI, we train a RoBERTa-large model using only the update sentence as input, keeping the training hyperparameters identical to the full-input RoBERTa models from §5.1. Then, we use these partial-input models to partition buckets in ParaNlu into two subsets: those on which the partial-input model correctly predicted the label from the update sentence of the original example, indicating that an artifact is likely present in the original example, and those incorrectly predicted. Using the full-input RoBERTa models from §5.1, we then compute on both example subsets, and the accuracy of partial-input and full-input models on paraphrased and original examples.
7.2 Results
Table 7 shows the accuracy metrics for both partial-input and full-input models on original and paraphrased examples in ParaNlu, as well as paraphrastic consistency metrics on both examples that are likely (+) and unlikely (−) to contain artifacts. While not all original examples that a partial-input model predicts correctly necessarily have artifacts, we expect that (1) examples with particularly strong artifacts are grouped in the likely (+) category, and (2) the unlikely (−) category contains a significantly smaller number of examples with strong artifacts. We observe a dramatic drop in the accuracy of a partial-input baseline on original examples (AO) and paraphrased examples (), indicating that most artifacts detectable with a partial-input model do not project through our label-preserving paraphrase process.
. | artifacts… . | partial-input . | full-input . | . | |||||
---|---|---|---|---|---|---|---|---|---|
AO . | A . | . | AO . | A . | . | PC . | . | ||
δ-SNLI | likely | 100 | 81.8 | 54.6 | 54.3 | 56.6 | 85.8 | 74.7 | 91.4 |
unlikely | 0 | 20.7 | 6.9 | 45.5 | 48.8 | 79.6 | 75.1 | 84.0 | |
δ-ATOMIC | likely | 100 | 77.6 | 53.9 | 55.6 | 58.3 | 78.0 | 76.4 | 87.4 |
unlikely | 0 | 21.3 | 8.3 | 50.9 | 49.9 | 79.2 | 75.9 | 86.5 | |
δ-SOCIAL | likely | 100 | 77.2 | 58.9 | 52.9 | 55.5 | 85.8 | 73.9 | 90.9 |
unlikely | 0 | 28.8 | 8.4 | 50 | 58.7 | 90.7 | 74.7 | 93.5 |
. | artifacts… . | partial-input . | full-input . | . | |||||
---|---|---|---|---|---|---|---|---|---|
AO . | A . | . | AO . | A . | . | PC . | . | ||
δ-SNLI | likely | 100 | 81.8 | 54.6 | 54.3 | 56.6 | 85.8 | 74.7 | 91.4 |
unlikely | 0 | 20.7 | 6.9 | 45.5 | 48.8 | 79.6 | 75.1 | 84.0 | |
δ-ATOMIC | likely | 100 | 77.6 | 53.9 | 55.6 | 58.3 | 78.0 | 76.4 | 87.4 |
unlikely | 0 | 21.3 | 8.3 | 50.9 | 49.9 | 79.2 | 75.9 | 86.5 | |
δ-SOCIAL | likely | 100 | 77.2 | 58.9 | 52.9 | 55.5 | 85.8 | 73.9 | 90.9 |
unlikely | 0 | 28.8 | 8.4 | 50 | 58.7 | 90.7 | 74.7 | 93.5 |
Even on examples unlikely (−) to contain artifacts, where a full-input model cannot rely on shallow signals, models do still have room to improve their paraphrastic consistency. These results indicate that issues with paraphrastic consistency are attributable to factors beyond the presence of artifacts in examples.
8 Training Dynamics and Paraphrastic Consistency
Lastly, we explore the relationship between different parts of the model training pipeline (e.g., pretraining and finetuning) and paraphrastic consistency. How does consistency change as these training processes progress, and does it change in a similar manner as accuracy? Is it the case that simply increasing the volume of pretraining or finetuning data linearly impacts paraphrastic consistency? We train a series of RoBERTa models and adjust the number of pretraining tokens (§8.1) and finetuning examples (§8.2) to explore how they impact a model’s consistency.
8.1 Pretraining and PC
Experimental Setup.
Using the MiniBERTas (Warstadt et al., 2020), a series of RoBERTa models pretrained from scratch on varying numbers of tokens, we compare models trained on 1M, 10M, 100M, and 1B tokens along with RoBERTa-base, which is pretrained on approximately 30B tokens. All models have the same number of parameters as RoBERTa-base (125M), with the exception of the model trained on 1M pretraining tokens which was scaled down in accordance with the smaller volume of pretraining data, and has 45M parameters. We finetune all models on the same data (§4.1) and keep all hyperparameters constant (batch size of 64, 2 finetuning epochs, and a learning rate of 5e-6), ensuring direct measurement of the impact of pretraining words on paraphrastic consistency without other confounds.
Results.
Figure 5 plots model accuracy on paraphrases () against paraphrastic consistency (), along with the same supporting curves as in Figure 3 corresponding to decreasing proportions of variance attributable to paraphrasing (pvap). As expected, pretraining on increasing amounts of data yields both monotonically increasing accuracy and paraphrastic consistency. However, paraphrastic consistency grows more rapidly in the beginning (1M - 100M tokens) as the plots climb steeply between supporting curves and eventually hugs a single pvap curve past 100M tokens, indicating a slower payoff of more pretraining tokens.
8.2 Finetuning and PC
After pretraining, models are endowed with the ability to represent natural language inputs but do not know how to perform a particular reasoning task. As such, we expect monotonically increasing accuracy as the model is shown a larger volume of finetuning examples. However, it is unclear how paraphrastic consistency changes as the model is exposed to more task-specific examples.
Experimental Setup.
For each dataset, we finetune a series of fully pretrained RoBERTa-large models on 1%, 5%, 10%, 50%, and 100% of examples from the training split, sampled at random. δ-ATOMIC has 28.3K training examples, δ-SNLI has 75.2K training examples examples and δ-SOCIAL has 65.3K examples. We sample at the premise-hypothesis level and include all examples that share the same premise and hypothesis to prevent data leakage during evaluation. We hold all training hyperparameters constant, keeping the same configuration as the finetuning in §8.1.
Results.
Figure 6 plots corrected paraphrase accuracy () against corrected paraphrastic consistency () for models trained on increasing numbers of finetuning examples across all three datasets in δ-NLI. We observe that as the model starts to learn the task at hand and draw increasingly complex decision boundaries, it is more likely to be inconsistent. Models trained on ≤5% of the available training examples are highly consistent since they make the same prediction for all examples (thus earning an accuracy of around 50%). As the the model is shown more examples, it makes finer-grained distinctions between examples, in turn impacting its paraphrastic consistency. Though our results show this decrease, it is possible that with even more finetuning data, a model’s paraphrastic consistency will start to increase again. This relationship may also be altered if, during finetuning, models are shown increasing amounts of automatically-generated paraphrased examples in order to learn both the task and paraphrastic consistency.
9 Related Work
Natural language understanding models may produce different predictions in the face of varying expressions of reasoning problems. A wide range of data generation frameworks have been proposed to study these behaviors in NLP systems. Iyyer et al. (2018) automatically generate paraphrases with specified syntactic templates and measure accuracy on these adversarial examples. Verma et al. (2023) introduce a test set of paraphrases generated with a finetuned T5 model (Raffel et al., 2020) and measure the accuracy of several models. Hu et al. (2019) generate paraphrases of MNLI examples using lexical constraints and evaluate an NLI model on the paraphrased inputs, finding that paraphrasing leads to degraded accuracy. Arakelyan et al. (2024) measure the semantic sensitivity of NLI models by automatically generating examples with FLAN-T5 (Chung et al., 2022) and verifying the generations with bidirectional entailment predicted by pretrained NLI models. While scalable, our findings illustrate that it is insufficient to evaluate models on automatic paraphrases alone, as human-written paraphrases introduce more semantic and pragmatic diversity (Section 6). Moreover, we show that bidirectional entailment as a verification method for generated paraphrases is extremely stringent, precluding us from testing consistency in the face of more challenging label-preserving transformations.
Another body of research studies the creation of adversarial examples to improve model robustness. Nie et al. (2020) construct an NLI benchmark, Adversarial NLI, by developing a model-in-the-loop framework to iteratively produce examples that models cannot correctly solve. Naik et al. (2018) develop a suite of adversarial examples to “stress test” common failure modes of NLI models, such as phenomena word overlap or negation. In contrast with these studies, our goal is not to generate a test suite of difficult examples that “break” models (Glockner et al., 2018), but rather to carefully measure the role of paraphrastic variability in model performance.
Other approaches to measuring robustness also include counterfactual example generation (Srikanth and Rudinger, 2022; Kaushik et al., 2020). Kaushik et al. (2020) recruit humans to create counterfactual examples by minimally editing example text in order to flip the gold label and show that models trained on the original datasets perform poorly on counterfactually-manipulated data. Similarly, Gardner et al. (2020) argue for the creation of evaluative contrast sets, or manual minimal perturbations of dataset examples that change the gold label, in order to probe the decision boundary of models. Our work has a related, but distinct, counterfactual flavor: if an original annotator had chosen to phrase the question differently with the same target label, what is the probability that a model’s prediction would stay consistent? We aim to estimate, in expectation, the reliability of models when they are faced with different phrasings of the same problem.
Most of these studies measure accuracy on adversarial examples as the main determination of robustness. Elazar et al. (2021) instead measure the consistency of models with respect to factual knowledge, evaluating whether extracted information from masked language models is invariant to paraphrasing using an agreement-based consistency metric. Our study is similarly concerned with consistency, however we make precise the relationship between accuracy and consistency on natural language reasoning tasks.
10 Conclusion
As more studies investigate the capabilities of LLMs, the ability to disentangle the effects of paraphrastic variability from other target attributes will be an important analytical tool.
This work introduces a new methodology and dataset for measuring paraphrastic consistency, or PC, of models on natural language reasoning tasks. PC captures the probability that a model will remain consistent in its prediction given different phrasings of the same underlying reasoning problem. We design PC as a metric complementary to accuracy, and propose practitioners use it alongside accuracy when diagnosing modeling errors, summarizing a model’s performance, or deciding when a model is ready for deployment to users for a particular application.
Our results confirm that paraphrastic sensitivity is present in all models, but decreases with pretraining volume. Because PC only requires model predictions to be labeled as correct or incorrect, our approach can generalize to any task with binary scoring (and where answers must be invariant to paraphrases). Future work may consider adapting this approach for tasks with more complex or open-ended evaluations.
Notes
We publicly release all data and code at https://github.com/nehasrikn/paraphrase-nlu.
We explored using NLI models to automatically ensure semantic equivalence, but found it too strict of a formulation to capture the spirit of label-preserving paraphrasing.
For example, a generation model is less likely to paraphrase “The camera zooms out to show the man spraying the car with soap” to “The camera zooms out to servicemen sprinkling the automobile with soap”.
We choose δ-NLI instead of α-NLI for this experiment, since the authors of α-NLI released their dataset after running adversarial filtering.
References
A Crowdsourcing ParaNlu
We collect paraphrases in ParaNlu using Amazon Mechanical Turk. The instructions and annotation interface shown to crowdworkers is shown in Figure 7.
Workers provide 3 paraphrases of both plausible and implausible hypotheses for α-NLI examples and 3 paraphrases of updates for δ-NLI examples. We include a distance widget in our interface that computes the Jaccard similarity between the entered paraphrase and the original text to encourage lexical diversity. Each example was annotated by 3 workers. Workers were paid US$12/hour on average and were required to be native English speakers with a 95% or more HIT acceptance rate on at least 100 HITs.
B Paraphrastic Consistency
Figure 8 shows model accuracy plotted against corrected paraphrastic consistency of all models tested for the δ-ATOMIC and δ-SOCIAL splits of ParaNlu.
Author notes
Action Editor: Jacob Eisenstein