Neuro-symbolic Natural Logic with Introspective Revision for Natural Language Inference

We introduce a neuro-symbolic natural logic framework based on reinforcement learning with introspective revision. The model samples and rewards specific reasoning paths through policy gradient, in which the introspective revision algorithm modifies intermediate symbolic reasoning steps to discover reward-earning operations as well as leverages external knowledge to alleviate spurious reasoning and training inefficiency. The framework is supported by properly designed local relation models to avoid input entangling, which helps ensure the interpretability of the proof paths. The proposed model has built-in interpretability and shows superior capability in monotonicity inference, systematic generalization, and interpretability, compared with previous models on the existing datasets.


Introduction
In the past decade, deep neural networks have achieved impressive performance on modeling natural language inference (NLI) (Dagan et al., 2005;MacCartney, 2009;Bowman et al., 2015;Chen et al., 2017a,b), which aims to determine the entailment relations between a premise sentence and its corresponding hypothesis.Progress in NLI has greatly benefited from the models' capabilities at approximating complex underlying functions, discovering and utilizing rich (true and/or spurious) patterns, and exhibiting robustness to noise and ambiguity.However, the black-box models inherently lack interpretability, and still fail to capture many aspects of human reasoning, including monotonicity inference (Yanaka et al., 2019b(Yanaka et al., ,a, 2020)), systematic compositionality and generalization (Fodor and Pylyshyn, 1988;Aydede, 1997;Yanaka et al., 2020), and negation (Geiger et al., 2020), among others.
In this paper, we present a neuro-symbolic framework that integrates natural logic with neural networks for natural language inference.At the local level, we explore appropriate transformer networks to model the local relations between the constituents of a premise and hypothesis, in order to prevent attention from fully entangling the input, which otherwise can seriously impair the interpretability of proof paths built on local relations.We then construct natural logic programs and use reinforcement learning to reward the aggregation of the local relations.When reinforcement learning passes the final reward signals (NLI labels) through the neural natural logic composition network, it faces the challenges of excessive spurious programs (incorrect programs that lead to correct final NLI labels) as well as training inefficiency; the former is particularly harmful to interpretability.Our framework leverages the proposed Introspective Revision method to discover better reward-earning operations and leverage external knowledge to reduce spurious proofs.

Related Work
Natural Logic: Rather than performing deduction over an abstract logical form, natural logic (Lakoff, 1970;van Benthem, 1988;Valencia, 1991;Van Benthem, 1995;Nairn et al., 2006;MacCartney, 2009;MacCartney and Manning, 2009;Icard, 2012;Angeli and Manning, 2014) models logical inferences in natural language by operating directly on the structure of language.Natural logic allows for a wide range of intuitive inferences in a conceptually clean way (MacCartney, 2009;Angeli and Manning, 2014) and hence provides a good framework for developing explainable neural natural language inference models.Specifically, our work is motivated by the natural logic variant proposed by MacCartney and Manning (2009), for which we will provide more background in Sec. 3.
Natural Language Inference: Natural language inference (NLI) (Dagan et al., 2005;Mac-Cartney, 2009;Bowman et al., 2015) aims to identify the entailment relations between the premise-hypothesis sentence pairs.Benefited from pre-training on large-scale unlabeled corpora and then fine-tuning on large crowd-sourced datasets like SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018), the pre-trained language models (Devlin et al., 2019;Radford et al., 2019Radford et al., , 2018) ) have achieved the state-of-theart performance.However, recent work revealed several drawbacks of the current deep NLI systems.The research in (Gururangan et al., 2018;Poliak et al., 2018) has shown that deep NLI models learn to utilize dataset biases and label-relevant artifacts for prediction.Yanaka et al. (2019a,b); Geiger et al. (2020) showed that a dominating proportion of samples in SNLI and MultiNLI are in upward monotone, and models trained on these datasets have limited abilities to generalize to downward monotone.More recently, systematically generated datasets have been proposed to evaluate the current models' ability on compositional generalization and showed that pretrained transformers generalize poorly to unseen combinations of the semantic fragments (Geiger et al., 2019;Richardson et al., 2020;Yanaka et al., 2020;Goodwin et al., 2020) .
Neural Network with Logic Components for NLI: Recent works (Kalouli et al., 2020;Hu et al., 2020;Chen et al., 2021;Feng et al., 2020;Wu et al., 2021) have started to combine neural networks with logic-based components.The work most related to ours is Feng et al. (2020), which adapts ESIM (Chen et al., 2017b) to predict relations between tokens in a premise and hypothesis, and composes them to predict final inferential labels.Rather than optimizing the likelihood of specific reasoning paths, the model maximizes the sum of the likelihood of all possible paths (i.e., marginal likelihood) that reach the correct final NLI labels.As a result, the model potentially encourages a large set of spurious reasoning paths and has to rely on external prior and strong constraints to predict meaningful intermediate local relations.
This paper, instead, proposes a reinforcement learning with introspective revision framework to sample and reward specific reasoning paths through the policy gradient method.The introspective revision leverages external commonsense knowledge to tackle spurious proof paths and training inefficiency, key issues in developing interpretable neuro-symbolic models.To support that, local relation components need to be carefully designed.We will demonstrate that the proposed model substantially outperforms that proposed in (Feng et al., 2020) on five datasets.
Policy Gradient: Policy gradient algorithms like REINFORCE (Williams, 1992) have been used in neuro-symbolic models to connect neural representation learning and symbolic reasoning (Andreas et al., 2017;Liang et al., 2017;Mascharka et al., 2018;Yi et al., 2018;Mao et al., 2018).The original REINFORCE algorithm suffers from sparse rewards and high variances in the gradient.To overcome these issues, the research presented in Popov et al. (2017); Goyal et al. (2019);Trott et al. (2019) proposes reward shaping, which leverages domain-specific knowledge to carefully design the reward functions.Instead of learning only from the desired outcomes, some  MacCartney and Manning (2009).approaches also learn from failed attempts.Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) and Scheduled Auxiliary Control (SAC-X) (Riedmiller et al., 2018)  Assuming the availability of the alignment between a premise and hypothesis, the system first infers the relations between aligned pairs of words or phrases.Consider the top-left example in Fig. 1: the relation between "the child" and "the kid" is equivalence ("), same as the relation between "does not love" and "doesn't like", while "sports" reversely entails (Ą) "table-tennis".
The next step is monotonicity inference.Monotonicity is a pervasive feature of natural language that explains the impact of semantic composition on entailment relations (Van Benthem, 1986;Valencia, 1991;Icard and Moss, 2014).Similar to the monotone functions in calculus, upward monotone keeps the entailment relation when the argument "increases" (e.g., cat Ă animal).Downward monotone keeps the entailment relation when the argument "decreases" (e.g., in all animals Ă all cats).The system performs monotonicity inference through a projection function ρ : B Ñ B, which is determined by the context and projection rules.Table 2 shows some examples.Consider the last row in the table-it shows how the project function ρ works in the negated context following the negation word not.Specifically, this row shows seven relations that ρprq will output, given the corresponding input relations r.For example, if the input relation is forward entailment (Ă), the function ρ projects it to reverse entailment (Ą); i.e., ρp'Ă'q "'Ą'.As a result, in the example in Fig. 1, the reverse entailment relation (Ą) between "sports" and "table-tennis" will be projected to forward entailment (Ă) in the negated context.Built on that, the system aggregates/composes the projected local relations to obtain the inferential relation between a premise and hypothesis.Specifically, Table 3 shows the composition function when a relation (in a row) is composed with another (in a column).In practice, multiple compositions as such are performed in sequential order or from leaves to root along a constituency parse tree.MacCartney (2009) shows that differ-ent orders of compositions yield consistent results except in some rare artificial cases.Therefore, many works, including ours here, perform a sequential (left-to-right) composition.In the example in Fig. 1, composing two equivalence (") with forward entailment (Ă) yields forward entailment (Ă), resulting in a prediction that the premise entails the hypothesis.

Method
This section introduces our neural natural logic framework based on the proposed Reinforcement Learning with Introspective Revision approach.We start with local relation modeling, in which caution needs to be taken to avoid the input entangling problem, which can seriously harm the model's interpretability.By viewing the local relation distribution as the stochastic policy, our model then samples and rewards specific reasoning paths through policy gradient, in which the Introspective Revision model can modify intermediate symbolic reasoning steps to discover better rewardearning operations and leverages external knowledge to alleviate spurious reasoning and training inefficiency.

Local Relation Modeling
We use phrases/chunks instead of words as the basic reasoning units.The primary motivation for chunking is to shorten the reasoning paths and hence reduce the number of possible paths, both of which make the reasoning process more efficient.Motivated by Ouyang and McKeown (2019), we segment the premise P and the hypothesis H into several phrases/chunks.Specifically, we first extract noun phrases with spaCy (Honnibal et al., 2020) and then group the continuous spans of words between two noun phrases as chunks.As shown in Fig. 1, by identifying the noun phrases "the kid" and "table tennis", the hypothesis sentence H is segmented into three chunks.We denote the number of chunks in the hypothesis as m, and the t-th hypothesis chunk (and its vectorized representation) as s t .Similarly the t 1 -th premise phrase is denoted as st 1 .
As the first step of the neuro-symbolic natural logic, we use a neural network to model the local natural logic relation between each hypothesis phrase s t and its associated premise constituents.However, accurately finding the hard alignment between s t and the corresponding phrase st 1 in the premise is a hard problem (MacCartney et al., 2008).Current state-of-the-art NLI systems, like BERT (Devlin et al., 2019), use bi-directional soft attention to model the cross-sentence relationship, however, we observe that it tends to fully entangle the input (DeYoung et al., 2020).Consider the top-left example in Fig. 1.If we use BERT to encode the input sentences, then the bi-directional attention model can infer the final NLI label solely based on the last-layer hidden states of the first hypothesis phrase "the kid" because the contextualized representation of this phrase entangles the information of the whole input through attention.Consequently, the hidden states of the phrase contain global information, thus not being suitable for modeling the local relations.
To alleviate the undesired entangling, we model local relations with uni-directional attention (such as GPT-2).On the one hand, the uni-directional attention prevents entangling future inputs.For example, in Fig. 1, the phrase "table tennis" will not affect the relation prediction anchored on "The kid".On the other hand, although the last hypothesis phrase attends to all previous inputs, without knowing whether the current phrase is the ending one (the future inputs are not available), the model cannot skip predicting the natural logic relation at the current phrase s t and postpone all the required reasoning to the last phrase.Specifically, suppose a model always predicts equivalence (") at each step t and postpones its final decision to the last hypothesis phrase.Without knowing that "table tennis" is the ending phrase, the model can predict equivalence (") for "table tennis" and wait to make a better decision upon seeing the next input phrase, which actually does not exist.Failing to make timely local predictions that lead to the correct label before running out of the hypothesis phrases, the model will receive a negative reward in the end.In this way, the model is encouraged to be more careful in predicting the local relation for each hypothesis phrase.We also develop a model that achieves local relations by masking both the past and future hypothesis chunks.Compared to such a model, we will show later (Table 6) that the uni-directional attention model performs better, partly due to that it preserves the structure of the pretrained GPT-2 model.Specifically, we propose to model the local relation between s t and the premise P , which can be efficiently achieved by the pretrained GPT-2 The kid doesn't like table tennis model (Radford et al., 2019).We concatenate a premise and hypothesis as the input and separate them with a special token xsepy.The contextualized encoding h τ for the τ -th hypothesis token is extracted from the GPT-2 last-layer hidden states at the corresponding location: For the t-th phrase in the hypothesis s t " H τ 1 :τ 2 , which starts from position τ 1 and ends at position τ 2 , we concatenate features of the starting token h τ 1 and the ending token h τ 2 as the vectorized phrase representation: We use a feed-forward network f with ReLU activation to model the local natural logic relations between the hypothesis phrase s t and its potential counterpart in the premise.The feed-forward network outputs 7 logits that correspond to the seven natural logic relations listed in Table 1.The logits are converted with softmax to obtain the local relation distribution: Intuitively, the model learns to align each hypothesis phrase s t with the corresponding premise constituents through attention, and combines information from both sources to model local relations.In practice, the local relation distribution is defined over five relations: we merge relation negation (^) and alternation (|) because they have similar behaviors in Table 3, and we suppress cover ( ), because it is rare in the current NLI datasets.Hence we only need to model five natural logic relation types, following Feng et al. (2020).

Natural Logic Program
We propose to use reinforcement learning to develop neural natural logic, which views the local relation distribution p t as the stochastic policy.
At each time step t, the model samples a relation r t P B according to the policy, and we treat the sequence of sampled relations tr t u m t"1 as a symbolic program, which executes to produce the final inferential relation between a premise and hypothesis.According to the best of our knowledge, this is the first model that integrates reinforcement learning with natural logic.
Built on the natural logic formalism of Mac-Cartney and Manning (2009), a projection function ρ (Eq.5) maps r t to a new relation rt .In our model, the projection function ρ is determined by the projectivity feature from the StanfordCoreNLP natlog parser1 .For each input token, the projectivity feature specifies the projected relation rt for each input relation r t .In this work, we extend the token-level projectivity to handle phrases: for a phrase with multiple tokens, ρ is determined by the projectivity of the first token in the phrase.In Fig. 1, the projectivity of the phrase "table tennis" is determined by the first token "table", and ρ projects the predicted reverse entailment (Ą) relation to forward entailment (Ă).r t " samplingpp t q, (4) rt " ρpr t q (5) The program then composes the projected relations tr t u m t"1 to derive the final relation prediction, as shown in top-right part in Fig. 1.Specifically, at time step t " 0, the executor starts with the default state z 0 " equivalence (").For each hypothesis phrase s t , t ą 0, the program performs one step update to compose the previous state z t´1 with the projected relation rt : The final prediction is yielded from the last state z m of program execution.Following Angeli and Manning (2014), we group equivalence (") and forward entailment (Ă) as entailment; negation (^) and alternation (|) as contradiction, and; reverse entailment (Ą), cover (Y), and independence (#) as neutral.
Rewards and Optimization: During training, we reward the model when the program executes to the correct answer.Given a sequence of local relations r " tr t u m t"1 , at each step t the model receives a reward R t as follows: where µ is the constant reward unit, γ P p0, 1s is the discount factor, and y is the ground-truth label.
In addition to Eq. 7, different rewards are applied under two exceptional cases: (1) if at step t there is no chance for the program to get a positive reward, then the execution is terminated and the model receives an immediate reward R t " ´µ; (2) when the true label is entailment, the model receives no positive reward if the last state z m is equivalence (").In this way, we encourage the model to select at least one forward entailment (Ă) relation during prediction, instead of aggregating a sequence of equivalence (") for all entailment cases.In the current NLI datasets, it is less likely that the premise and hypothesis sentences are semantically equivalent to each other.We apply the REINFORCE (Williams, 1992) algorithm to optimize the model parameters.During training, the local relations r t are sampled from the predicted distribution, and we minimize the policy gradient objective:

Introspective Revision
The key challenges of developing interpretable neural natural logic models include coping with spurious reasoning paths (incorrect paths r " tr t u m t"1 leading to the correct inferential label for a premise-hypothesis pair) as well as training inefficiency.Finding a correct program that reaches the correct label is challenging because it is inefficient to explore a space of 5 m paths for a reward.A positive reward to the correct path is often sparse.
We propose to use the fail-and-fix approach based on the newly proposed Back-Search algorithm (Li et al., 2020) to mitigate training inefficiency caused by sparse positive rewards, which, based on a failed program that earns no positive reward, searches for better proof paths in its neighborhood that reaches the correct final prediction.To solve the spurious issue in this fail-andfix framework, we propose Introspective Revision that leverages external commonsense knowledge (denoted as K) to control spurious proof paths.We believe unstated commonsense knowledge is important not only for improving prediction accuracy (which, as discussed in Sec. 2, often results from fitting to spurious correlations), but critical for developing interpretable natural language reasoning models by avoiding spurious proofs.
Without loss of generality, we distinguish a nonspurious program r ˚from spurious ones based on the following assumption, whose effectiveness will be shown and discussed in our experiments.Assumption 4.1 A program r ˚has a larger probability than another program r to be a nonspurious program if r ˚has a better agreement with the external knowledge base K.
External Knowledge: Previous work (Chen et al., 2017a) queries the knowledge base for each pair of words between a premise and hypothesis exhaustively, which is inefficient and likely to introduce undesired local relations.As a remedy, we found that the lightweight text alignment tool JacanaAligner (Yao et al., 2013), though not accurate enough to align all pairs of associated phrases in the input, can be used to guide the search.For a hypothesis phrase s, we first apply JacanaAligner to obtain its associated premise phrase s, and then query the WordNet (Miller, 1998)  where u, v denote tokens in the phrase and s Ă s means that s is a sub-phrase of s.The local relations suggested by the knowledge base are formulated as a set of triplet proposals pt, rt , p t rrsq, where t is the time step, rt is the suggested relation, and p t rr t s is the model predicted probability that corresponds to rt .
Human-curated rules, which are designed to retrieve natural logic relations from the knowledge base, are often imperfect.They inevitably introduce errors due to language variations.For example, intuitively s Ă s indicates forward entailment (Ă); e.g."white cat" entails "cat", while there are cases where the sub-phrase rule indicates equivalence ("); e.g.,"have a chat with" is equivalent to "chat with" in meaning.In rare cases, the relation can be alternation (|); e.g."fake gun" and "gun" are distinct concepts.While s " s often indicates equivalence ("), our rules need to handle cases where the adverbial is posed in separate phrases; e.g."a bike" and "near the park" entails "a bike".
To deal with this issue, instead of making an intensive effort to design sophisticated rules to pinpoint a single accurate relation, we design relatively coarse rules to narrow down the possibilities and leave the final choice to the model.Specifically, at each step we provide the model with multiple possible candidates, and the proposed introspective revision algorithm introduced in this section decides to accept a useful proposal or reject a misleading one, based on both the reasoning objective (i.e. the label) and the predicted relation distribution.
Algorithm: Given a program r " tr t u m t"1 , the goal of the Introspective Revision algorithm is to find a program r ˚in the neighbourhood of r that executes to the correct answer y while maintaining a large agreement with the external knowledge K, as detailed in Algorithm 1.The algorithm starts with knowledge-driven revision (line 2"15).We arrange the triplet proposals obtained from the knowledge base as a priority queue Φ " tpt, rt , p t rr t sq | 0 ă t ď m, r P Bu.In each iteration the queue pops the triplet with the largest probability p t rr t s that specifies a modification to the sampled program r 1 " F ixpr, t, rt q.In other words, changing the relation r t at step t of program r to the proposed relation rt yields a new program r 1 .Following Li et al. (2020), the modification is accepted with a probability 1 ´ if r 1 executes to the correct answer y; otherwise, it is accepted with a probability minp1, p t rr t s{p t rr t sq.
The hyperparameter encourages the model to explore low-probability proposals.For each sample, the model accepts or rejects up to M triplets.
The knowledge-driven revision above is conservative because only the top-M proposals are considered.However, there are complex cases where the program still cannot reach the correct answer after M steps, or where the provided proposals are insufficient to solve the problem.In these cases, we apply the answer-driven revision (line 17"22) by conducting a 5ˆm grid search to find modifications that lead to the correct answers.Among the search results Ψ, we accept the triplet with the maximum probability.A detailed description of the grid search is presented in Algorithm 2.
Following the reward in Eq. 7 and the objective function in Eq. 8, we compute a new objective function J 1 with the modified program r ˚and its where λ is a weight that specifies the importance of the revision.The introspective revision algorithm is only applied during training since the label y is required to determine whether a proposal is accepted or not.

Experiments
We evaluate the performance of the proposed model on six NLI tasks from various perspectives: the ability of performing monotonicity inference (Sec.5.2), reasoning systematicity (Sec.5.3), and model interpretability (Sec.5.4).Our model is trained on Stanford Natural Language Inference (SNLI) (Bowman et al., 2015), in which the relation between a premise and hypothesis is classified to either entailment, contradiction, or neutral.We set the unit reward µ " 1.0, and optimize our model with Adam gradient descent for six epochs with a learning rate of 2e-5.We compare the models with discount factor γ P t0.25, 0.50, 0.75, 1.00u and P t0.05, 0.10, 0.20u.We found that the test accuracies are not sensitive to γ when γ ě 0.50, and we select γ "0.50, " 0.20, which achieved the best validation accuracy on SNLI.For the introspective revision algorithm we set M"3 based on the average number of proposals (2.383 proposals/sample) in Table 5.We treat the revised program and the original pro-  A0) " (A5) are the average of 3 models starting from different consistently-seeded initializations.

Statistics for Introspective Revision
In Table 4, we present the statistics for the introspective revision at the start/end of the training, where the natural logic programs are sampled from the predicted distribution.Approximately 80% of the samples perform at least one step of revision, and at the end of the training, there is an increasing chance (98.4% vs. 59.4%) that introspective revision helps the model reach the final correct NLI prediction.In Table 5, we show the statistics of the average number of triplet proposals obtained from WordNet and the average number of proposals accepted by knowledge or answerdriven revision during training.Equivalence (") and forward entailment (Ă) make up a large portion of the proposals, while the alternation relation is scarce due to the sparsity of the antonym relation obtained from WordNet.As a result, the numbers of proposals accepted in the knowledgedriven revision are imbalanced across different relations.Moreover, we found that the number of accepted answer-driven revisions slightly increased at the end of the training, which is due to the fact that as the training proceeds, the programs produced by the model are closer to the target labels.

Performance on Monotonicity Reasoning
We conduct experiments on multiple recently proposed challenging test datasets for monotonic-ity inference: HELP (Yanaka et al., 2019b), MED (Yanaka et al., 2019a), and Monotonicity NLI (MoNLI) (Geiger et al., 2020).Unlike SNLI, half of the samples in HELP, MED, and MoNLI are in downward monotone, and they are categorized as entailment or non-entailment.In the above datasets, a premise and the corresponding hypothesis differ by 1-hop; i.e., they are different by either a lexical substitution, insertion, or deletion.In addition, we also evaluated our model on the Natural Logic 2-hop dataset (Feng et al., 2020), which requires a model to perform a 2-hop natural logic composition according to Table 3.We compare our model with popular natural language inference baselines including ESIM (Chen et al., 2017b), BERT-base (Devlin et al., 2019), GPT-2 (Radford et al., 2019), and (Feng et al., 2020).Following Yanaka et al. (2019a) and to ensure a fair comparison, all models are trained on SNLI, and during testing, we regard contradiction and neutral as non-entailment if a binary prediction is required.
Table 6 shows the test accuracy on SNLI and four challenging test datasets.Our model performs consistently and significantly better than previous state-of-the-art models on all challenging datasets while achieving competitive "in-domain" performance on SNLI.Manual inspection shows that compared to GPT-2, a significant proportion of the failure cases on SNLI are due to errors from the projectivity parser, and the ambiguity between contradiction and neutral (Bowman et al., 2015).The introspective revision algorithm sig-nificantly boosts the model performance on the monotonicity reasoning test sets (A0 vs. A3).Ablation shows that the knowledge-driven revision improves the performance on MoNLI and the 2hop dataset (A0 vs. A1), which suggests that without proper constraints, the answer-driven revision can lead to spurious reasoning.We found that removing equivalence (") (knowledge 1 ) from the knowledge-driven revision lowers the performance, because in this case the knowledge-driven revision mistakenly encourages the model to replace equivalence (") with forward entailment (Ă), which may lead to incorrect prediction under downward monotonicity.Compared to forward entailment (Ă) (knowledge 2 ), removing reverse entailment (Ą) (knowledge 3 ) and alternation (|) (knowledge 4 ) does not significantly affect the results.We deduce that the relative importance of different relations are affected by the frequency of the external knowledge, and without the help of the knowledge-driven revision, the model can still learn the reverse entailment (Ą) relation from relation augmentation in Sec.4.2.The performance drops when the relation augmentation is vacant (A0 vs. A4).
We also include the model that masks both the past and the future hypothesis chunks in the transformer attention layers for local relation prediction (A5).The model with masked attention yields significantly lower performance on SNLI, partly due to the fact that aggressively masking the past hypothesis chunks changes the structure of the pretrained GPT-2 model, and thus the model benefits less from the pretrained representations.

Systematicity of Monotonicity Inference
Making systematic generalizations from limited data is an essential property of human language (Lake and Baroni, 2018).While funetuning pretrained transformers achieves high NLI accuracy, Yanaka et al. (2020) have recently shown that these models have limited capability of capturing the systematicity of monotonicity inference.We use the dataset proposed by Yanaka et al. (2020) to evaluate the model's ability in compositional generalization: the model is exposed to all primitive types of quantifiers Q and predicate replacements R, but samples in the training set and test set contain different combinations of quantifiers and predicate replacements.Specifically, with an arbitrarily selected set of quantifiers tqu and predi-cate replacement tru, the training set contains data D tqu,R Y D Q,tru while the test data only includes the complementary set D Qztqu,Rztru .An example of compositional generalization is shown below: (1) P: Some dogs run ñ H: Some animals run (2) P: No animals run ñ H: No dogs runs (3) P: Some small dogs run ñ H: Some dogs run An ideal model can learn from the training samples (1), (2), and (3) the entailment relations between concepts small dog Ă dog Ă animal, as well as the fact that the quantifier some indicates the upward monotonicity and no indicates the downward.During testing, the model needs to compose the entailment relations and the monotonicity signatures to make inference over unseen combinations, e.g., sample (4): To test the model stability, Yanaka et al. (2020) also added adverbs or prepositional phrases as test-only noise to the beginning of both the premise and the hypothesis, e.g., sample (5).In Table 7, all models are trained with 3,270 samples and tested on the complementary test set with about 9,112 examples, exactly following the data split in Yanaka et al. (2020).While all baseline models achieved high training accuracy, BERT has limited performance on the test set.For our model, there is only a 3% gap between the training and test performance, which demonstrates that our model successfully learns to identify and compose the natural logic relations of the predicate replacements with limited training examples.
We also compare our model to variants of BERT and GPT-2 models that are aware of the token projectivity (models with Ö in Table 7).Specifically, for each token, we concatenate the hidden states in the final layer of transformer with its projectivity feature.We aggregated the concatenated features with multiple feed-forward layers and applied average pooling before sending them to the classification layer.Results show that BERT and GPT-2 do not benefit from the projectivity features.The test accuracy drops with additional adverbs and preposition phrases, leaving space for future research on the robustness to unseen perturbations.

Evaluation of Model Explainability
The proposed model provides built-in interpretability following natural logic-the execution of programs tz t u m t"1 (Eq.6) provides explanation along with the model's decision making process, namely giving a faithful explanation (Jacovi and Goldberg, 2020).To evaluate the model interpretability, we derive the predicted rationales from the natural logic programs and compare it with human annotations in e-SNLI (Camburu et al., 2018).Specifically, our model regards as rationales the hypothesis phrases s t that satisfies: (1) z t points to the final prediction according to the grouping described at the end of Sec.4.2; (2) z t ‰ z t´1 .Following DeYoung et al. ( 2020), we use Intersection Over Union (IOU) formulated in Eq. 11 as the evaluation metric: the numerator is the number of shared tokens between the model generated rationales and the gold rationales, and the denominator is the number of tokens in the union.We also compute finer-grained statistics over individual rationale phrases.Following DeYoung et al. ( 2020), a predicted rationale phrase p matches an annotated rationale phrase q when IOU pp, qq ě 0.5, and we use precision, recall and F1 score to measure the phrasal agreement between the predicted rationales and human annotations.We also invited 3 graduate students (not the authors of this paper) to evaluate the quality of the predicted rationales on the first 100 test samples in e-SNLI.Given the premise-hypothesis pair and the golden label, the evaluators judged the explanation as plausible if the predicted rationale (1) alone is sufficient to justify the label, and; (2) does not include the whole hypothesis sentence.
From the perspective of natural logic, we follow (Feng et al., 2020) to evaluate the quality of the natural logic programs.For each sample, the Natural Logic 2-hop dataset provides the gold program execution states, and we evaluated the accuracy of our predicted states ẑt against the groundtruth.We compare our model with representative neural rationalization models proposed by Lei et al. (2016), which learns to extract rationales without direct supervision, and Feng et al. (2020), which explains its prediction by generating natural logic reasoning paths.The summary statistics in Table 8 shows that our model matches Lei et al. (2016) on the IOU score, and that it produces rationales with significantly higher precision and F1-scores on the e-SNLI test set.The superior rationalization performance is also supported by the human evaluation mentioned above (the 4th column in Table 8).Compared to Feng et al. (2020), our model produces intermediate natural logic states that better agree with the ground truth.The results in Table 8 show that the model explanation significantly benefits from the external knowledge (B0 vs. B1), and the answer-driven revision alone does not improve the quality of the generated rationales (B1 vs. B2).We also compare our model to the system that replaces the uni-directional attention model GPT-2 with the bidirectional attention model BERT.The model with BERT encoder yields significantly lower scores on interpretability (B0 vs. B4).

Case Study
The upper part of the Fig. 2 shows how our natural logic model makes predictions during testing.The left example involves upward monotone.Upon seeing the premise and the first hypothesis phrase A biker rides, the model predicts the local relation as forward entailment (r 1 "'Ă') at time step t"1.The predicted relation stays unchanged after applying the projection function ρp'Ă'q "'Ă' because it is in the context of upward monotone.According to Table 3 we have z 1 "z 0 b r 1 "'Ă'.Similarly, as the second prediction for the phrase next to, relation equivalence (r 2 "'"') does not change the reasoning states because z 2 "z 1 b r 2 "'Ă'.The third hypothesis phrase the ocean is a distinct concept against a fountain in the premise, our model outputs relation alternation (r 3 "' | ') and we have z 3 " z 2 b r 3 "' | '.The model runs out of the hypothesis phrases after 3 steps, and reaches contradiction according to the final state z 3 .
An additional example with downward mono-   2, the first argument that follows negation did not is in downward monotone, i.e., ρp'Ă'q "'Ą'.
At the bottom of Fig. 2, we provide examples for the reasoning processes produced by the natural logic model that is built upon the bi-directional attention model BERT.Although it produces the same final labels as our proposed model, the model based on BERT can predict wrong local relations due to its entangling effect.Specifically, the model with bi-directional attention is prone to make its final decision in the first place (82% of the cases in the human evaluation), and then predict local relations that can keep the initial decision during the program execution (according to the composition rules in Table 3).In the first example in Fig. 2, to keep the first predicted relation alternation (|) unchanged during execution, the model subsequently predicts a series of equivalence (") relations.In the second example, the model predicts local re-lation forward entailment (Ă) for each hypothesis phrase, and at the last step, the forward entailment (Ă) relation is projected to reverse entailment (Ą) according to the projectivity.

Summary
The proposed neuro-symbolic framework integrates the long-studied natural logic with reinforcement learning and introspective revision, effectively rewarding the intermediate proof paths and leveraging external knowledge to alleviate spurious reasoning.The model has built-in interpretability following natural logic, which allows for a wide range of intuitive inferences easily understandable by humans.Experimental results show the model's superior capability in monotonicity-based inferences and systematic generalization, compared to previous models on the existing datasets, while the model keeps competitive performance on the generic SNLI test set.

Figure 1 :
Figure 1: An overview of the proposed neuro-symbolic natural logic framework.
database for the possible natural logic relations for the phrase pair xs, sy: 1 Equivalence ("): s = s or s Ă s; 2 Forward Entailment (Ă): s Ă s or s = s or D u P s, v P s and u is a hypernym of v; 3 Reverse Entailment (Ą): s Ă s or D u P s, v P s and v is a hypernym of u; 4 Alternation (|): D u P s, v P s and u is a antonym of v; (4) P: No dogs run ñ H: No small dogs run (5) P: Near the shore, no dogs run ñ H: Near the shore, no small dogs run

Figure 2 :
Figure 2: Examples for predictions and explanation for some cases from SNLI (left) and MoNLI (right).

Table 1 :
A set B of seven natural logic relations proposed by

Table 2 :
The projection function ρ maps input relations to output relations under different contexts (here, different surrounding quantifiers).

Table 3
(Bowman et al., 2015): It can be hard to learn the reverse entailment (Ą) relation from the existing NLI datasets because the relation of a pair of premise and hypothesis is labeled as neutral, if H entails P and P does not entails H.To help the model distinguish reverse entailment pĄq from independence p#q, both of which result in the NLI label neutral, we perform relation augmentation to create samples whose hypothesis entails the premise.Specifically, for each sample that is originally labeled as entailment in the training set, we create an augmented sample by exchanging the premise and the hypothesis.Note that we avoid augmenting the case where P and H mutually entail each other because the new premise still entails the hypothesis after the exchange.To achieve this, we exclude an exchanged sample from relation augmentation if it is still identified as entailment by a pretrained model finetuned on SNLI(Bowman et al., 2015).In terms of the augmented samples, the program receives a positive reward during training if and only if it reaches the correct final state reverse entailment pĄq.
1 log p t rr t s ¨Rt , (8) where p t rr t s is the probability that corresponds to the sampled relation r t .During the test, the model picks the relation with largest probability.

Table 4 :
The percentage of samples being revised and the revision success rate at the start/end of the training.

Table 5 :
The average number of triplet proposals obtained from the WordNet per sample and the average number of proposals accepted by knowledge or answerdriven revision at the start / end of the training.

Table 6 :
Model accuracy on multiple challenging test datasets.All models are trained on SNLI and the results of model (

Table 7 :
Results for compositional generalization; Ö marks the models with polarity features.

Table 8 :
Evaluation for the model generated explanation.