Abstract
We introduce a neuro-symbolic natural logic framework based on reinforcement learning with introspective revision. The model samples and rewards specific reasoning paths through policy gradient, in which the introspective revision algorithm modifies intermediate symbolic reasoning steps to discover reward-earning operations as well as leverages external knowledge to alleviate spurious reasoning and training inefficiency. The framework is supported by properly designed local relation models to avoid input entangling, which helps ensure the interpretability of the proof paths. The proposed model has built-in interpretability and shows superior capability in monotonicity inference, systematic generalization, and interpretability, compared with previous models on the existing datasets.
1 Introduction
In the past decade, deep neural networks have achieved impressive performance on modeling natural language inference (NLI) (Dagan et al., 2005; MacCartney, 2009; Bowman et al., 2015; Chen et al., 2017a, b), which aims to determine the entailment relations between a premise sentence and its corresponding hypothesis. Progress in NLI has greatly benefited from the models’ capabilities at approximating complex underlying functions, discovering and utilizing rich (true and/or spurious) patterns, and exhibiting robustness to noise and ambiguity. However, the black-box models inherently lack interpretability, and still fail to capture many aspects of human reasoning, including monotonicity inference (Yanaka et al., 2019a, b, 2020), systematic compositionality and generalization (Fodor and Pylyshyn, 1988; Aydede, 1997; Yanaka et al., 2020), and negation (Geiger et al., 2020), among others.
A recent research trend has attempted to advance the long-standing problem of bringing together the complementary strengths of neural networks and symbolic models (Garcez et al., 2015; Yang et al., 2017; Rocktäschel and Riedel, 2017; Evans and Grefenstette, 2018; Weber et al., 2019; De Raedt et al., 2019; Mao et al., 2018). Specifically for natural language, natural logic has long been studied to model reasoning in human language (Lakoff, 1970; van Benthem, 1988; Valencia, 1991; Van Benthem, 1995; Nairn et al., 2006; MacCartney, 2009; MacCartney and Manning, 2009; Icard, 2012; Angeli and Manning, 2014). However, the work of investigating the joint advantage of neural networks and natural logic is sparse (Feng et al., 2020) (see Section 2 for more details) and understudied.
In this paper, we present a neuro-symbolic framework that integrates natural logic with neural networks for natural language inference. At the local level, we explore appropriate transformer networks to model the local relations between the constituents of a premise and hypothesis, in order to prevent attention from fully entangling the input, which otherwise can seriously impair the interpretability of proof paths built on local relations. We then construct natural logic programs and use reinforcement learning to reward the aggregation of the local relations. When reinforcement learning passes the final reward signals (NLI labels) through the neural natural logic composition network, it faces the challenges of excessive spurious programs (incorrect programs that lead to correct final NLI labels) as well as training inefficiency; the former is particularly harmful to interpretability. Our framework leverages the proposed Introspective Revision method to discover better reward-earning operations and leverage external knowledge to reduce spurious proofs.
We conducted experiments on six datasets: SNLI (Bowman et al., 2015), HELP (Yanaka et al., 2019b), MED (Yanaka et al., 2019a), MoNLI (Geiger et al., 2020), NatLog-2hop (Feng et al., 2020), and a compositional generalization dataset (Yanaka et al., 2020). The results show the model’s superior capability in monotonicity inferences, systematic generalization, and interpretability, compared with previous models on these existing datasets, while the model remains a competitive performance on the generic SNLI test set.
2 Related Work
Natural Logic:
Rather than performing deduction over an abstract logical form, natural logic (Lakoff, 1970; van Benthem, 1988; Valencia, 1991; Van Benthem, 1995; Nairn et al., 2006; MacCartney, 2009; MacCartney and Manning, 2009; Icard, 2012; Angeli and Manning, 2014) models logical inferences in natural language by operating directly on the structure of language. Natural logic allows for a wide range of intuitive inferences in a conceptually clean way (MacCartney, 2009; Angeli and Manning, 2014) and hence provides a good framework for developing explainable neural natural language inference models. Specifically, our work is motivated by the natural logic variant proposed by MacCartney and Manning (2009), for which we will provide more background in Section 3.
Natural Language Inference:
Natural language inference (NLI) (Dagan et al., 2005; MacCartney, 2009; Bowman et al., 2015) aims to identify the entailment relations between the premise-hypothesis sentence pairs. Benefiting from pre-training on large-scale unlabeled corpora and then fine-tuning on large crowd-sourced datasets like SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018), the pre-trained language models (Devlin et al., 2019; Radford et al., 2019, 2018) have achieved the state-of-the-art performance. However, recent work revealed several drawbacks of the current deep NLI systems. The research in Gururangan et al. (2018) and Poliak et al. (2018) has shown that deep NLI models learn to utilize dataset biases and label-relevant artifacts for prediction. Yanaka et al. (2019a, b) and Geiger et al. (2020) showed that a dominating proportion of samples in SNLI and MultiNLI are in upward monotone, and models trained on these datasets have limited ability to generalize to downward monotone. More recently, systematically generated datasets have been proposed to evaluate the current models’ ability on compositional generalization and showed that pretrained transformers generalize poorly to unseen combinations of the semantic fragments (Geiger et al., 2019; Richardson et al., 2020; Yanaka et al., 2020; Goodwin et al., 2020).
Neural Network with Logic Components for NLI:
Recent work (Kalouli et al., 2020; Hu et al., 2020; Chen et al., 2021; Feng et al., 2020) has started to combine neural networks with logic-based components. The work most related to ours is Feng et al. (2020), which adapts ESIM (Chen et al., 2017b) to predict relations between tokens in a premise and hypothesis, and composes them to predict final inferential labels. Rather than optimizing the likelihood of specific reasoning paths, the model maximizes the sum of the likelihood of all possible paths (i.e., marginal likelihood) that reach the correct final NLI labels. As a result, the model potentially encourages a large set of spurious reasoning paths and has to rely on external prior and strong constraints to predict meaningful intermediate local relations.
This paper, instead, proposes a reinforcement learning with introspective revision framework to sample and reward specific reasoning paths through the policy gradient method. The introspective revision leverages external commonsense knowledge to tackle spurious proof paths and training inefficiency, key issues in developing interpretable neuro-symbolic models. To support that, local relation components need to be carefully designed. We will demonstrate that the proposed model substantially outperforms that proposed in Feng et al. (2020) on five datasets.
Policy Gradient:
Policy gradient algorithms like REINFORCE (Williams, 1992) have been used in neuro-symbolic models to connect neural representation learning and symbolic reasoning (Andreas et al., 2017; Liang et al., 2017; Mascharka et al., 2018; Yi et al., 2018; Mao et al., 2018). The original REINFORCE algorithm suffers from sparse rewards and high variance in the gradient. To overcome these issues, the research presented in Popov et al. (2017), Goyal et al. (2019), and Trott et al. (2019) proposes reward shaping, which leverages domain-specific knowledge to carefully design the reward functions. Instead of learning only from the desired outcomes, some approaches also learn from failed attempts. Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) and Scheduled Auxiliary Control (SAC-X) (Riedmiller et al., 2018) can replay the failed episodes and provide the agent with auxiliary learning goals to enable sample-efficient learning. Li et al. (2020) propose a back-search algorithm, which diagnoses the failed reasoning processes and corrects potential errors to facilitate model training. Based on Li et al. (2020), we propose the introspective revision method, which leverages external knowledge to effectively discover reward-earning reasoning programs and to alleviate spurious reasoning.
3 Background
Our model’s backbone logic framework is based on the MacCartney and Manning (2009) variant of the natural logic formalism. The inference system operates by mutating spans of text in a premise to obtain the corresponding hypothesis sentence, and generates proofs based on the natural logic relations of the mutations. To extend the entailment relations to consider semantic exclusion, MacCartney and Manning (2009) introduced seven set-theoretic relations for modeling entailment relations between two spans of texts (see Table 1 for some examples).
A set of seven natural logic relations proposed by MacCartney and Manning (2009).
Relation . | Relation Name . | Example . |
---|---|---|
x ≡ y | equivalence | mom ≡ mother |
forward entailment | ||
reverse entailment | ||
x∧y | negation | human∧nonhuman |
x∣y | alternation | cat∣dog |
cover | ||
x# y | independence | happy# student |
Relation . | Relation Name . | Example . |
---|---|---|
x ≡ y | equivalence | mom ≡ mother |
forward entailment | ||
reverse entailment | ||
x∧y | negation | human∧nonhuman |
x∣y | alternation | cat∣dog |
cover | ||
x# y | independence | happy# student |
Assuming the availability of the alignment between a premise and hypothesis, the system first infers the relations between aligned pairs of words or phrases. Consider the top-left example in Figure 1: the relation between “the child” and “the kid” is equivalence (≡), same as the relation between “does not love” and “doesn’t like”, while “sports” reversely entails () “table-tennis”.
An overview of the proposed neuro-symbolic natural logic framework.
The next step is monotonicity inference. Monotonicity is a pervasive feature of natural language that explains the impact of semantic composition on entailment relations (Van Benthem, 1986; Valencia, 1991; Icard and Moss, 2014). Similar to the monotone functions in calculus, upward monotone keeps the entailment relation when the argument “increases” (e.g., catanimal). Downward monotone keeps the entailment relation when the argument “decreases” (e.g., in all animalsall cats). The system performs monotonicity inference through a projection function , which is determined by the context and projection rules. Table 2 shows some examples. Consider the last row in the table—it shows how the project function ρ works in the negated context following the negation word not. Specifically, this row shows seven relations that ρ(r) will output, given the corresponding input relations r. For example, if the input relation is forward entailment (), the function ρ projects it to reverse entailment (); that is, ρ(‘’) =‘’. As a result, in the example in Figure 1, the reverse entailment relation () between “sports” and “table-tennis” will be projected to forward entailment () in the negated context.
The projection function ρ maps input relations to output relations under different contexts (here, different surrounding quantifiers).
Quantifier & Connective . | Proj. . | Input Relation r . | ||||||
---|---|---|---|---|---|---|---|---|
≡ . | . | . | ∧ . | ∣ . | . | # . | ||
all | ρarg1(r) | ≡ | ∣ | # | ∣ | # | ||
ρarg2(r) | ≡ | ∣ | ∣ | # | # | |||
some | ρarg1(r) | ≡ | # | # | ||||
ρarg2(r) | ≡ | # | # | |||||
not | ρ(r) | ≡ | ∧ | ∣ | # |
Quantifier & Connective . | Proj. . | Input Relation r . | ||||||
---|---|---|---|---|---|---|---|---|
≡ . | . | . | ∧ . | ∣ . | . | # . | ||
all | ρarg1(r) | ≡ | ∣ | # | ∣ | # | ||
ρarg2(r) | ≡ | ∣ | ∣ | # | # | |||
some | ρarg1(r) | ≡ | # | # | ||||
ρarg2(r) | ≡ | # | # | |||||
not | ρ(r) | ≡ | ∧ | ∣ | # |
Built on that, the system aggregates/composes the projected local relations to obtain the inferential relation between a premise and hypothesis. Specifically, Table 3 shows the composition function when a relation (in a row) is composed with another (in a column). In practice, multiple compositions as such are performed in sequential order or from leaves to root along a constituency parse tree. MacCartney (2009) shows that different orders of compositions yield consistent results except in some rare artificial cases. Therefore, many studies, including ours here, perform a sequential (left-to-right) composition. In the example in Figure 1, composing two equivalence (≡) with forward entailment () yields forward entailment (), resulting in a prediction that the premise entails the hypothesis.
Results (Icard, 2012) of composing one relation (row) with another relation (column).
⋈ . | ≡ . | . | . | ∧ . | ∣ . | . | # . |
---|---|---|---|---|---|---|---|
≡ | ≡ | ∧ | ∣ | # | |||
# | ∣ | ∣ | # | # | |||
# | # | # | |||||
∧ | ∧ | ∣ | ≡ | # | |||
∣ | ∣ | # | ∣ | # | # | ||
# | # | # | |||||
# | # | # | # | # | # | # | # |
⋈ . | ≡ . | . | . | ∧ . | ∣ . | . | # . |
---|---|---|---|---|---|---|---|
≡ | ≡ | ∧ | ∣ | # | |||
# | ∣ | ∣ | # | # | |||
# | # | # | |||||
∧ | ∧ | ∣ | ≡ | # | |||
∣ | ∣ | # | ∣ | # | # | ||
# | # | # | |||||
# | # | # | # | # | # | # | # |
4 Method
This section introduces our neural natural logic framework based on the proposed Reinforcement Learning with Introspective Revision approach. We start with local relation modeling, in which caution needs to be taken to avoid the input entangling problem, which can seriously harm the model’s interpretability. By viewing the local relation distribution as the stochastic policy, our model then samples and rewards specific reasoning paths through policy gradient, in which the Introspective Revision model can modify intermediate symbolic reasoning steps to discover better reward-earning operations and leverages external knowledge to alleviate spurious reasoning and training inefficiency.
4.1 Local Relation Modeling
We use phrases/chunks instead of words as the basic reasoning units. The primary motivation for chunking is to shorten the reasoning paths and hence reduce the number of possible paths, both of which make the reasoning process more efficient. Motivated by Ouyang and McKeown (2019), we segment the premise P and the hypothesis H into several phrases/chunks. Specifically, we first extract noun phrases with spaCy (Honnibal et al., 2020) and then group the continuous spans of words between two noun phrases as chunks. As shown in Figure 1, by identifying the noun phrases “the kid” and “table tennis”, the hypothesis sentence H is segmented into three chunks. We denote the number of chunks in the hypothesis as m, and the t-th hypothesis chunk (and its vectorized representation) as st. Similarly, the t′-th premise phrase is denoted as .
As the first step of the neuro-symbolic natural logic, we use a neural network to model the local natural logic relation between each hypothesis phrase st and its associated premise constituents. However, accurately finding the hard alignment between st and the corresponding phrase in the premise is a hard problem (MacCartney et al., 2008). Current state-of-the-art NLI systems, like BERT (Devlin et al., 2019), use bi-directional soft attention to model the cross-sentence relationship, however, we observe that it tends to fully entangle the input (DeYoung et al., 2020). Consider the top-left example in Figure 1. If we use BERT to encode the input sentences, then the bi-directional attention model can infer the final NLI label solely based on the last-layer hidden states of the first hypothesis phrase “the kid” because the contextualized representation of this phrase entangles the information of the whole input through attention. Consequently, the hidden states of the phrase contain global information, thus not being suitable for modeling the local relations.
To alleviate the undesired entangling, we model local relations with uni-directional attention (such as GPT-2). On the one hand, the uni-directional attention prevents entangling future inputs. For example, in Figure 1, the phrase “table tennis” will not affect the relation prediction anchored on “the kid”. On the other hand, although the last hypothesis phrase attends to all previous inputs, without knowing whether the current phrase is the ending one (the future inputs are not available), the model cannot skip predicting the natural logic relation at the current phrase st and postpone all the required reasoning to the last phrase. Specifically, suppose a model always predicts equivalence (≡) at each step t and postpones its final decision to the last hypothesis phrase. Without knowing that “table tennis” is the ending phrase, the model can predict equivalence (≡) for “table tennis” and wait to make a better decision upon seeing the next input phrase, which actually does not exist. Failing to make timely local predictions that lead to the correct label before running out of the hypothesis phrases, the model will receive a negative reward in the end. In this way, the model is encouraged to be more careful in predicting the local relation for each hypothesis phrase. We also develop a model that achieves local relations by masking both the past and future hypothesis chunks. Compared with such a model, we will show later (Table 6) that the uni-directional attention model performs better, partly because it preserves the structure of the pretrained GPT-2 model.
4.2 Natural Logic Program
We propose to use reinforcement learning to develop neural natural logic, which views the local relation distribution pt as the stochastic policy. At each time step t, the model samples a relation according to the policy, and we treat the sequence of sampled relations as a symbolic program, which executes to produce the final inferential relation between a premise and hypothesis. To the best of our knowledge, this is the first model that integrates reinforcement learning with natural logic.
Rewards and Optimization:
Relation Augmentation:
It can be hard to learn the reverse entailment () relation from the existing NLI datasets because the relation of a pair of premise and hypothesis is labeled as neutral, if HentailsP and Pdoes not entailsH. To help the model distinguish reverse entailment from independence (#), both of which result in the NLI label neutral, we perform relation augmentation to create samples whose hypothesis entails the premise. Specifically, for each sample that is originally labeled as entailment in the training set, we create an augmented sample by exchanging the premise and the hypothesis. Note that we avoid augmenting the case where P and H mutually entail each other because the new premise still entails the hypothesis after the exchange. To achieve this, we exclude an exchanged sample from relation augmentation if it is still identified as entailment by a pretrained model finetuned on SNLI (Bowman et al., 2015). In terms of the augmented samples, the program receives a positive reward during training if and only if it reaches the correct final state reverse entailment.
4.3 Introspective Revision
The key challenges of developing interpretable neural natural logic models include coping with spurious reasoning paths (incorrect paths leading to the correct inferential label for a premise-hypothesis pair) as well as training inefficiency. Finding a correct program that reaches the correct label is challenging because it is inefficient to explore a space of 5m paths for a reward. A positive reward to the correct path is often sparse.
We propose to use the fail-and-fix approach based on the newly proposed Back-Search algorithm (Li et al., 2020) to mitigate training inefficiency caused by sparse positive rewards, which, based on a failed program that earns no positive reward, searches for better proof paths in its neighborhood that reaches the correct final prediction. To solve the spurious issue in this fail-and-fix framework, we propose Introspective Revision that leverages external commonsense knowledge (denoted as ) to control spurious proof paths. We believe unstated commonsense knowledge is important not only for improving prediction accuracy (which, as discussed in Section 2, often results from fitting to spurious correlations), but critical for developing interpretable natural language reasoning models by avoiding spurious proofs.
Without loss of generality, we distinguish a non-spurious program r* from spurious ones based on the following assumption, whose effectiveness will be shown and discussed in our experiments.
Assumption 4.1 A program r* has a larger probability than another program r to be a non-spurious program if r* has a better agreement with the external knowledge base .
External Knowledge:
Previous work (Chen et al., 2017a) queries the knowledge base for each pair of words between a premise and hypothesis exhaustively, which is inefficient and likely to introduce undesired local relations. As a remedy, we found that the lightweight text alignment tool JacanaAligner (Yao et al., 2013), though not accurate enough to align all pairs of associated phrases in the input, can be used to guide the search. For a hypothesis phrase s, we first apply JacanaAligner to obtain its associated premise phrase , and then query the WordNet (Miller, 1998) database for the possible natural logic relations for the phrase pair 〈s, 〉:
- ①
Equivalence (≡): s = or s ⊂;
- ②
Forward Entailment (): s ⊂ or s = or ∃u ∈s,v ∈ and u is a hypernym of v;
- ③
Reverse Entailment (): ⊂s or ∃u ∈s,v ∈ and v is a hypernym of u;
- ④
Alternation (∣): ∃u ∈s,v ∈ and u is a antonym of v;
where u, v denote tokens in the phrase and s ⊂ means that s is a sub-phrase of . The local relations suggested by the knowledge base are formulated as a set of triplet proposals (t, , pt[]), where t is the time step, is the suggested relation, and pt[] is the model predicted probability that corresponds to .
Human-curated rules, which are designed to retrieve natural logic relations from the knowledge base, are often imperfect. They inevitably introduce errors due to language variations. For example, intuitively s ⊂ indicates forward entailment (); for example, “white cat” entails “cat”, while there are cases where the sub-phrase rule indicates equivalence (≡); for example,“have a chat with” is equivalent to “chat with” in meaning. In rare cases, the relation can be alternation (∣); for example, “fake gun” and “gun” are distinct concepts. While s = often indicates equivalence (≡), our rules need to handle cases where the adverbial is posed in separate phrases; for example, “a bike” and “near the park” entails “a bike”.
To deal with this issue, instead of making an intensive effort to design sophisticated rules to pinpoint a single accurate relation, we design relatively coarse rules to narrow down the possibilities and leave the final choice to the model. Specifically, at each step we provide the model with multiple possible candidates, and the proposed introspective revision algorithm introduced in this section decides to accept a useful proposal or reject a misleading one, based on both the reasoning objective (i.e., the label) and the predicted relation distribution.
Algorithm:
Given a program , the goal of the Introspective Revision algorithm is to find a program r* in the neighbourhood of r that executes to the correct answer y while maintaining a large agreement with the external knowledge , as detailed in Algorithm 1. The algorithm starts with knowledge-driven revision (lines 2∼15). We arrange the triplet proposals obtained from the knowledge base as a priority queue . In each iteration the queue pops the triplet with the largest probability pt[] that specifies a modification to the sampled program r′ = Fix(r,t,). In other words, changing the relation rt at step t of program r to the proposed relation yields a new program r′. Following Li et al. (2020), the modification is accepted with a probability 1 − ϵ if r′ executes to the correct answer y; otherwise, it is accepted with a probability . The hyperparameter ϵ encourages the model to explore low-probability proposals. For each sample, the model accepts or rejects up to M triplets.
The knowledge-driven revision above is conservative because only the top-M proposals are considered. However, there are complex cases where the program still cannot reach the correct answer after M steps, or where the provided proposals are insufficient to solve the problem. In these cases, we apply the answer-driven revision (lines 17∼22) by conducting a 5 × m grid search to find modifications that lead to the correct answers. Among the search results Ψ, we accept the triplet with the maximum probability. A detailed description of the grid search is presented in Algorithm 2.
5 Experiments
We evaluate the performance of the proposed model on six NLI tasks from various perspectives: the ability of performing monotonicity inference (Section 5.2), reasoning systematicity (Section 5.3), and model interpretability (Section 5.4).
Our model is trained on Stanford Natural Language Inference (SNLI) (Bowman et al., 2015), in which the relation between a premise and hypothesis is classified to either entailment, contradiction, or neutral. We set the unit reward μ = 1.0, and optimize our model with Adam gradient descent for six epochs with a learning rate of 2e-5. We compare the models with discount factor γ ∈{0.25,0.50,0.75,1.00} and ϵ ∈{0.05,0.10,0.20}. We found that the test accuracies are not sensitive to γ when γ ≥ 0.50, and we select γ = 0.50, ϵ = 0.20, which achieved the best validation accuracy on SNLI. For the introspective revision algorithm we set M = 3 based on the average number of proposals (2.383 proposals/sample) in Table 5. We treat the revised program and the original program as equally informative by setting λ = 0.5. Our code is available at https://github.com/feng-yufei/NS-NLI.
5.1 Statistics for Introspective Revision
In Table 4, we present the statistics for the introspective revision at the start/end of the training, where the natural logic programs are sampled from the predicted distribution. Approximately 80% of the samples perform at least one step of revision, and at the end of the training, there is an increasing chance (98.4% vs. 59.4%) that introspective revision helps the model reach the final correct NLI prediction. In Table 5, we show the statistics of the average number of triplet proposals obtained from WordNet and the average number of proposals accepted by knowledge or answer-driven revision during training. Equivalence (≡) and forward entailment () make up a large portion of the proposals, while the alternation relation is scarce due to the sparsity of the antonym relation obtained from WordNet. As a result, the numbers of proposals accepted in the knowledge-driven revision are imbalanced across different relations. Moreover, we found that the number of accepted answer-driven revisions slightly increased at the end of the training, which is due to the fact that as the training proceeds, the programs produced by the model are closer to the target labels.
The percentage of samples being revised and the revision success rate at the start/end of the training.
Phase . | Revision (Knowl. / Answ. / Both) . | Success Rate of Revision . |
---|---|---|
Start | 80.4% (85.2% / 8.1% / 6.7%) | 59.4% |
End | 81.7% (80.3% / 5.9% / 13.8%) | 98.4% |
Phase . | Revision (Knowl. / Answ. / Both) . | Success Rate of Revision . |
---|---|---|
Start | 80.4% (85.2% / 8.1% / 6.7%) | 59.4% |
End | 81.7% (80.3% / 5.9% / 13.8%) | 98.4% |
The average number of triplet proposals obtained from the WordNet per sample and the average number of proposals accepted by knowledge or answer-driven revision at the start / end of the training.
Relation . | Knowledge Available . | Knowl.-driven start / end . | Answ.-driven start / end . |
---|---|---|---|
Equivalence | 1.035 | 0.595 / 0.482 | 0.096 / 0.026 |
Fwd. Entail | 1.087 | 0.370 / 0.523 | 0.014 / 0.037 |
Rev. Entail | 0.249 | 0.097 / 0.191 | 0.008 / 0.061 |
Alternation | 0.012 | 0.004 / 0.008 | 0.001 / 0.037 |
Sum | 2.383 | 1.066 / 1.204 | 0.119 / 0.161 |
Relation . | Knowledge Available . | Knowl.-driven start / end . | Answ.-driven start / end . |
---|---|---|---|
Equivalence | 1.035 | 0.595 / 0.482 | 0.096 / 0.026 |
Fwd. Entail | 1.087 | 0.370 / 0.523 | 0.014 / 0.037 |
Rev. Entail | 0.249 | 0.097 / 0.191 | 0.008 / 0.061 |
Alternation | 0.012 | 0.004 / 0.008 | 0.001 / 0.037 |
Sum | 2.383 | 1.066 / 1.204 | 0.119 / 0.161 |
5.2 Performance on Monotonicity Reasoning
We conduct experiments on multiple recently proposed challenging test datasets for monotonicity inference: HELP (Yanaka et al., 2019b), MED (Yanaka et al., 2019a), and Monotonicity NLI (MoNLI) (Geiger et al., 2020). Unlike SNLI, half of the samples in HELP, MED, and MoNLI are in downward monotone, and they are categorized as entailment or non-entailment. In the above datasets, a premise and the corresponding hypothesis differ by 1-hop; that is, they are different by either a lexical substitution, insertion, or deletion. In addition, we also evaluated our model on the Natural Logic 2-hop dataset (Feng et al., 2020), which requires a model to perform a 2-hop natural logic composition according to Table 3.
We compare our model with popular natural language inference baselines including ESIM (Chen et al., 2017b), BERT-base (Devlin et al., 2019), GPT-2 (Radford et al., 2019), and Feng et al. (2020). Following Yanaka et al. (2019a) and to ensure a fair comparison, all models are trained on SNLI, and during testing, we regard contradiction and neutral as non-entailment if a binary prediction is required.
Table 6 shows the test accuracy on SNLI and four challenging test datasets. Our model performs consistently and significantly better than previous state-of-the-art models on all challenging datasets while achieving competitive “in-domain” performance on SNLI. Manual inspection shows that compared to GPT-2, a significant proportion of the failure cases on SNLI are due to errors from the projectivity parser, and the ambiguity between contradiction and neutral (Bowman et al., 2015). The introspective revision algorithm significantly boosts the model performance on the monotonicity reasoning test sets (A0 vs. A3). Ablation shows that the knowledge-driven revision improves the performance on MoNLI and the 2-hop dataset (A0 vs. A1), which suggests that without proper constraints, the answer-driven revision can lead to spurious reasoning. We found that removing equivalence (≡) (knowledge ①) from the knowledge-driven revision lowers the performance, because in this case the knowledge-driven revision mistakenly encourages the model to replace equivalence (≡) with forward entailment (), which may lead to incorrect prediction under downward monotonicity. Compared to forward entailment () (knowledge ②), removing reverse entailment () (knowledge ③) and alternation (∣) (knowledge ④) does not significantly affect the results. We deduce that the relative importance of different relations are affected by the frequency of the external knowledge, and without the help of the knowledge-driven revision, the model can still learn the reverse entailment () relation from relation augmentation in Section 4.2. The performance drops when the relation augmentation is vacant (A0 vs. A4).
Model accuracy on multiple challenging test datasets. All models are trained on SNLI and the results of model (A0) ∼ (A5) are the average of 3 models starting from different consistently-seeded initializations.
Model . | SNLI . | HELP . | MED . | MoNLI . | Nature Logic-2hop . |
---|---|---|---|---|---|
ESIM (Chen et al., 2017b) | 88.0 | 55.3 | 51.8 | 63.9 | 45.1 |
BERT-base (Devlin et al., 2019) | 90.1 | 51.4 | 45.9 | 53.0 | 49.3 |
GPT-2 (Radford et al., 2019) | 89.5 | 52.1 | 44.8 | 57.5 | 48.3 |
Feng et al. (2020) | 81.2 | 58.2 | 52.4 | 76.8 | 60.1 |
Ours – full model (A0) | 87.5 | 65.9 | 66.7 | 87.8 | 62.2 |
w/o knowledge ① | 87.2 | 62.8 | 62.2 | 77.0 | 61.7 |
w/o knowledge ② | 87.4 | 65.8 | 64.2 | 81.7 | 51.7 |
w/o knowledge ③ | 87.5 | 65.6 | 65.9 | 83.6 | 61.6 |
w/o knowledge ④ | 87.6 | 65.4 | 64.7 | 83.3 | 58.2 |
w/o knowledge ①②③④ (A1) | 87.6 | 65.0 | 64.8 | 77.3 | 48.8 |
w/o answer driven revision (A2) | 87.5 | 65.4 | 65.5 | 85.1 | 60.9 |
w/o introspective revision (A3) | 87.6 | 62.1 | 60.7 | 74.4 | 53.3 |
w/o relation augmentation (A4) | 87.8 | 59.6 | 54.7 | 74.7 | 59.9 |
Ours w/ masked attention (A5) | 75.9 | 61.3 | 61.6 | 70.9 | 54.6 |
Model . | SNLI . | HELP . | MED . | MoNLI . | Nature Logic-2hop . |
---|---|---|---|---|---|
ESIM (Chen et al., 2017b) | 88.0 | 55.3 | 51.8 | 63.9 | 45.1 |
BERT-base (Devlin et al., 2019) | 90.1 | 51.4 | 45.9 | 53.0 | 49.3 |
GPT-2 (Radford et al., 2019) | 89.5 | 52.1 | 44.8 | 57.5 | 48.3 |
Feng et al. (2020) | 81.2 | 58.2 | 52.4 | 76.8 | 60.1 |
Ours – full model (A0) | 87.5 | 65.9 | 66.7 | 87.8 | 62.2 |
w/o knowledge ① | 87.2 | 62.8 | 62.2 | 77.0 | 61.7 |
w/o knowledge ② | 87.4 | 65.8 | 64.2 | 81.7 | 51.7 |
w/o knowledge ③ | 87.5 | 65.6 | 65.9 | 83.6 | 61.6 |
w/o knowledge ④ | 87.6 | 65.4 | 64.7 | 83.3 | 58.2 |
w/o knowledge ①②③④ (A1) | 87.6 | 65.0 | 64.8 | 77.3 | 48.8 |
w/o answer driven revision (A2) | 87.5 | 65.4 | 65.5 | 85.1 | 60.9 |
w/o introspective revision (A3) | 87.6 | 62.1 | 60.7 | 74.4 | 53.3 |
w/o relation augmentation (A4) | 87.8 | 59.6 | 54.7 | 74.7 | 59.9 |
Ours w/ masked attention (A5) | 75.9 | 61.3 | 61.6 | 70.9 | 54.6 |
We also include the model that masks both the past and the future hypothesis chunks in the transformer attention layers for local relation prediction (A5). The model with masked attention yields significantly lower performance on SNLI, partly due to the fact that aggressively masking the past hypothesis chunks changes the structure of the pretrained GPT-2 model, and thus the model benefits less from the pretrained representations.
5.3 Systematicity of Monotonicity Inference
Making systematic generalizations from limited data is an essential property of human language (Lake and Baroni, 2018). While finetuning pretrained transformers achieves high NLI accuracy, Yanaka et al. (2020) have recently shown that these models have limited capability of capturing the systematicity of monotonicity inference. We use the dataset proposed by Yanaka et al. (2020) to evaluate the model’s ability in compositional generalization: The model is exposed to all primitive types of quantifiers Q and predicate replacements R, but samples in the training set and test set contain different combinations of quantifiers and predicate replacements. Specifically, with an arbitrarily selected set of quantifiers {q} and predicate replacement {r}, the training set contains data D{q},R ∪DQ,{r} while the test data only includes the complementary set DQ∖{q},R∖{r}. An example of compositional generalization is shown below:
- (1)
P: Some dogs run ⇒ H: Some animals run
- (2)
P: No animals run ⇒ H: No dogs runs
- (3)
P: Some small dogs run ⇒ H: Some dogs run
An ideal model can learn from the training samples (1), (2), and (3) the entailment relations between concepts small dogdoganimal, as well as the fact that the quantifier some indicates the upward monotonicity and no indicates the downward. During testing, the model needs to compose the entailment relations and the monotonicity signatures to make inference over unseen combinations, for example, sample (4):
- (4)
P: No dogs run ⇒ H: No small dogs run
- (5)
P: Near the shore, no dogs run ⇒
H: Near the shore, no small dogs run
To test the model stability, Yanaka et al. (2020) also added adverbs or prepositional phrases as test-only noise to the beginning of both the premise and the hypothesis, for example, sample (5).
In Table 7, all models are trained with 3,270 samples and tested on the complementary test set with about 9,112 examples, exactly following the data split in Yanaka et al. (2020). While all baseline models achieved high training accuracy, BERT has limited performance on the test set. For our model, there is only a 3% gap between the training and test performance, which demonstrates that our model successfully learns to identify and compose the natural logic relations of the predicate replacements with limited training examples.
Results for compositional generalization; ↑↓ marks the models with polarity features.
Model . | Train . | Test . | Testadv . | Testpp . |
---|---|---|---|---|
BERT-base | 100.0 | 69.2 | 50.8 | 49.3 |
GPT-2 | 100.0 | 25.6 | 35.6 | 35.4 |
BERT-base↑↓ | 100.0 | 65.4 | 51.4 | 52.7 |
GPT-2↑↓ | 100.0 | 28.1 | 35.1 | 39.6 |
Ours w/o. IR | 91.3 | 79.3 | 57.1 | 54.0 |
Ours | 98.4 | 95.1 | 61.0 | 61.5 |
Model . | Train . | Test . | Testadv . | Testpp . |
---|---|---|---|---|
BERT-base | 100.0 | 69.2 | 50.8 | 49.3 |
GPT-2 | 100.0 | 25.6 | 35.6 | 35.4 |
BERT-base↑↓ | 100.0 | 65.4 | 51.4 | 52.7 |
GPT-2↑↓ | 100.0 | 28.1 | 35.1 | 39.6 |
Ours w/o. IR | 91.3 | 79.3 | 57.1 | 54.0 |
Ours | 98.4 | 95.1 | 61.0 | 61.5 |
We also compare our model to variants of BERT and GPT-2 models that are aware of the token projectivity (models with ↑↓ in Table 7). Specifically, for each token, we concatenate the hidden states in the final layer of transformer with its projectivity feature. We aggregated the concatenated features with multiple feed-forward layers and applied average pooling before sending them to the classification layer. Results show that BERT and GPT-2 do not benefit from the projectivity features. The test accuracy drops with additional adverbs and preposition phrases, leaving space for future research on the robustness to unseen perturbations.
5.4 Evaluation of Model Explainability
From the perspective of natural logic, we follow Feng et al. (2020) to evaluate the quality of the natural logic programs. For each sample, the Natural Logic 2-hop dataset provides the gold program execution states, and we evaluated the accuracy of our predicted states against the ground-truth. We compare our model with representative neural rationalization models proposed by Lei et al. (2016), which learns to extract rationales without direct supervision, and Feng et al. (2020), which explains its prediction by generating natural logic reasoning paths. The summary statistics in Table 8 shows that our model matches Lei et al. (2016) on the IOU score, and that it produces rationales with significantly higher precision and F1-scores on the e-SNLI test set. The superior rationalization performance is also supported by the human evaluation mentioned above (the 4th column in Table 8). Compared to Feng et al. (2020), our model produces intermediate natural logic states that better agree with the ground truth. The results in Table 8 show that the model explanation significantly benefits from the external knowledge (B0 vs. B1), and the answer-driven revision alone does not improve the quality of the generated rationales (B1 vs. B2). We also compare our model to the system that replaces the uni-directional attention model GPT-2 with the bi-directional attention model BERT. The model with BERT encoder yields significantly lower scores on interpretability (B0 vs. B4).
Evaluation for the model generated explanation.
Model . | e-SNLI . | e-SNLI . | e-SNLI . | Natural Logic 2-hop . |
---|---|---|---|---|
IOU . | Precision / Recall / F1 . | Human Eval. . | Acc. . | |
Lei et al. (2016) | 0.42 | 0.37 / 0.46 / 0.41 | 56 / 100 | – |
Feng et al. (2020) | 0.27 | 0.21 / 0.35 / 0.26 | 52 / 100 | 0.44 |
Ours – full model (B0) | 0.44 | 0.58 / 0.49 / 0.53 | 80 / 100 | 0.52 |
w/o. external knowledge (B1) | 0.41 | 0.53 / 0.45 / 0.48 | 67 / 100 | 0.44 |
w/o. introspective revision (B2) | 0.40 | 0.52 / 0.43 / 0.47 | 68 / 100 | 0.43 |
w/o. relation augmentation (B3) | 0.44 | 0.57 / 0.48 / 0.52 | 75 / 100 | 0.51 |
Ours – BERT encoder (B4) | 0.14 | 0.20 / 0.15 / 0.17 | 29 / 100 | 0.28 |
Model . | e-SNLI . | e-SNLI . | e-SNLI . | Natural Logic 2-hop . |
---|---|---|---|---|
IOU . | Precision / Recall / F1 . | Human Eval. . | Acc. . | |
Lei et al. (2016) | 0.42 | 0.37 / 0.46 / 0.41 | 56 / 100 | – |
Feng et al. (2020) | 0.27 | 0.21 / 0.35 / 0.26 | 52 / 100 | 0.44 |
Ours – full model (B0) | 0.44 | 0.58 / 0.49 / 0.53 | 80 / 100 | 0.52 |
w/o. external knowledge (B1) | 0.41 | 0.53 / 0.45 / 0.48 | 67 / 100 | 0.44 |
w/o. introspective revision (B2) | 0.40 | 0.52 / 0.43 / 0.47 | 68 / 100 | 0.43 |
w/o. relation augmentation (B3) | 0.44 | 0.57 / 0.48 / 0.52 | 75 / 100 | 0.51 |
Ours – BERT encoder (B4) | 0.14 | 0.20 / 0.15 / 0.17 | 29 / 100 | 0.28 |
5.5 Case Study
The upper part of the Figure 2 shows how our natural logic model makes predictions during testing. The left example involves upward monotone. Upon seeing the premise and the first hypothesis phrase A biker rides, the model predicts the local relation as forward entailment (r1 =‘’) at time step t = 1. The predicted relation stays unchanged after applying the projection function ρ(‘’) =‘’ because it is in the context of upward monotone. According to Table 3 we have z1 = z0 ⊗ r1 =‘’. Similarly, as the second prediction for the phrase next to, relation equivalence (r2 =‘≡’) does not change the reasoning states because z2 = z1 ⊗ r2 =‘’. The third hypothesis phrase the ocean is a distinct concept against a fountain in the premise, our model outputs relation alternation (r3 =‘ ∣ ’) and we have z3 = z2 ⊗ r3 =‘ ∣ ’. The model runs out of the hypothesis phrases after 3 steps, and reaches contradiction according to the final state z3.
Examples for predictions and explanation for some cases from SNLI (left) and MoNLI (right).
Examples for predictions and explanation for some cases from SNLI (left) and MoNLI (right).
An additional example with downward monotone is illustrated on the right of Figure 2. Our model predicts the relation forward entailment (r3 =‘’) at the third time step since food includes hamburger. The projection function flips the relation to reverse entailment () because according to the projectivity in Table 2, the first argument that follows negation did not is in downward monotone, i.e., ρ(‘’) =‘’.
At the bottom of Figure 2, we provide examples for the reasoning processes produced by the natural logic model that is built upon the bi-directional attention model BERT. Although it produces the same final labels as our proposed model, the model based on BERT can predict wrong local relations due to its entangling effect. Specifically, the model with bi-directional attention is prone to make its final decision in the first place (82% of the cases in the human evaluation), and then predict local relations that can keep the initial decision during the program execution (according to the composition rules in Table 3). In the first example in Figure 2, to keep the first predicted relation alternation (∣) unchanged during execution, the model subsequently predicts a series of equivalence (≡) relations. In the second example, the model predicts local relation forward entailment () for each hypothesis phrase, and at the last step, the forward entailment () relation is projected to reverse entailment () according to the projectivity.
6 Summary
The proposed neuro-symbolic framework integrates the long-studied natural logic with reinforcement learning and introspective revision, effectively rewarding the intermediate proof paths and leveraging external knowledge to alleviate spurious reasoning. The model has built-in interpretability following natural logic, which allows for a wide range of intuitive inferences easily understandable by humans. Experimental results show the model’s superior capability in monotonicity-based inferences and systematic generalization, compared to previous models on the existing datasets, while the model keeps competitive performance on the generic SNLI test set.
Acknowledgments
This research was supported by NSERC Discovery Grants. We thank the anonymous reviewers and action editors for their helpful comments.
Notes
References
Author notes
Action Editor: Benjamin Van Durme
Equal contribution.