## Abstract

This paper presents *Diff-*Explainer, the first *hybrid* framework for explainable *multi-hop inference* that integrates explicit constraints with neural architectures through differentiable convex optimization. Specifically, *Diff-* Explainer allows for the fine-tuning of neural representations within a constrained optimization framework to answer and explain multi-hop questions in natural language. To demonstrate the efficacy of the hybrid framework, we combine existing ILP-based solvers for multi-hop Question Answering (QA) with Transformer-based representations. An extensive empirical evaluation on scientific and commonsense QA tasks demonstrates that the integration of explicit constraints in a end-to-end differentiable framework can significantly improve the performance of non- differentiable ILP solvers (8.91%–13.3%). Moreover, additional analysis reveals that *Diff-*Explainer is able to achieve strong performance when compared to standalone Transformers and previous multi-hop approaches while still providing structured explanations in support of its predictions.

## 1 Introduction

Explainable Question Answering (QA) in complex domains is often modeled as a *multi-hop inference* problem (Thayaparan et al., 2020; Valentino et al., 2021; Jansen et al., 2021). In this context, the goal is to answer a given question through the construction of an explanation, typically represented as a graph of multiple interconnected sentences supporting the answer (Figure 1). (Khashabi et al., 2018; Jansen, 2018; Kundu et al., 2019; Thayaparan et al., 2021).

However, explainable QA models exhibit lower performance when compared to state-of-the-art approaches, which are generally represented by Transformer-based architectures (Khashabi et al., 2020; Devlin et al., 2019; Khot et al., 2020). While Transformers are able to achieve high accuracy due to their ability to transfer linguistic and semantic information to downstream tasks, they are typically regarded as black-boxes (Liang et al., 2021), posing concerns about the interpretability and transparency of their predictions (Rudin, 2019; Guidotti et al., 2018).

To alleviate the aforementioned limitations and find a better trade-off between explainability and inference performance, this paper proposes *Diff-* Explainer (*∂*-Explainer), a novel *hybrid* framework for multi-hop and explainable QA that combines constraint satisfaction layers with pre- trained neural representations, enabling end-to- end differentiability.

Recent works have shown that certain convex optimization problems can be represented as individual layers in larger end-to-end differentiable networks (Agrawal et al., 2019a, b; Amos and Kolter, 2017), demonstrating that these layers can be adapted to encode constraints and dependencies between hidden states that are hard to capture via standard neural networks.

In this paper, we build upon this line of research, showing that convex optimization layers can be integrated with Transformers to improve explainability and robustness in multi-hop inference problems. To illustrate the impact of end-to-end differentiability, we integrate the constraints of existing ILP solvers (i.e., TupleILP [Khot et al., 2017], ExplanationLP [Thayaparan et al., 2021]) into a hybrid framework. Specifically, we propose a methodology to transform existing constraints into differentiable convex optimization layers and subsequently integrate them with pre-trained sentence embeddings based on Transformers (Reimers et al., 2019).

To evaluate the proposed framework, we perform extensive experiments on complex multiple-choice QA tasks requiring scientific and commonsense reasoning (Clark et al., 2018; Xie et al., 2020). In summary, the contributions of the paper are as follows:

A novel differentiable framework for multi- hop inference that incorporates constraints via convex optimization layers into broader Transformer-based architectures.

An extensive empirical evaluation demonstrating that the proposed framework allows end-to-end differentiability on downstream QA tasks for both explanation and answer selection, leading to a substantial improvement when compared to non-differentiable constraint-based and transformer-based approaches.

We demonstrate that

*Diff-*Explainer is more robust to distracting information in addressing multi-hop inference when compared to Transformer-based models.

## 2 Related Work

##### Constraint-Based Multi-hop QA Solvers

ILP has been employed to model structural and semantic constraints to perform multi-hop QA. TableILP (Khashabi et al., 2016) is one of the earliest approaches to formulate the construction of explanations as an optimal sub-graph selection problem over a set of structured tables and evaluated on multiple-choice elementary science question answering. In contrast to TableILP, TupleILP (Khot et al., 2017) was able to perform inference over free-form text by building semi-structured representations using Open Information Extraction. SemanticILP (Khashabi et al., 2018) also comes from the same family of solvers that leveraged different semantic abstractions, including semantic role labelling, named entity recognition and lexical chunkers for inference. In contrast to previous approaches, Thayaparan et al. (2021) proposed the ExplanationLP model that is optimized towards answer selection via Bayesian optimization. ExplanationLP was limited to fine-tuning only nine parameters as it is intractable to finetune large models using Bayesian optimization.

##### Hybrid Reasoning with Transformers

A growing line of research focuses on adopting Transformers for interpretable reasoning over text (Clark et al., 2021; Gontier et al., 2020; Saha et al., 2020; Tafjord et al., 2021). For example, Saha et al. (2020) introduced the PROVER model that provides an interpretable transformer-based model that jointly answers binary questions over rules while generating the corresponding proofs. These models are related to the proposed framework for exploring hybrid architectures. However, these models assume that the rules are fully available in the context and are still mostly applied on synthetically generated datasets. In this paper, we take a step forward in this direction proposing an hybrid model for addressing scientific and commonsense QA which require the construction of complex explanations through multi-hop inference on external knowledge bases.

##### Differentiable Convex Optimization Layers

Our work is in line with previous works that have attempted to incorporate optimization as a neural network layer. These works have introduced differentiable modules for quadratic problems (Donti et al., 2017; Amos and Kolter, 2017), satisfiability solvers (Wang et al., 2019), and submodular optimizations (Djolonga and Krause, 2017; Tschiatschek et al., 2018). Recent works also offer differentiation through convex cone programs (Busseti et al., 2019; Agrawal et al., 2019a). In this work, we use the differentiable convex optimization layers proposed by Agrawal et al. (2019b). These layers provide a way to abstract away from the conic form, letting users define convex optimization in natural syntax. The defined convex optimization problem is converted by the layers into a conic form to be solved by a conic solver (O’Donoghue, 2021).

## 3 Multi-hop Question Answering via Differentiable Convex Optimization

The problem of Explainable Multi-Hop Question Answering can be stated as follows:

*Explanations in Multi-Hop Question Answering*).

Given a question *Q*, answer *a* and a knowledge base *F*_{kb} (composed of natural language sentences), we say that we may *infer* hypothesis *h* (where hypotheses *h* is the concatenation of *Q* with *a*) if there exists a subset (*F*_{exp}) of supporting facts {*f*_{1},*f*_{2},…}⊆ *F*_{kb} of statements which would allow a human being to deduce *h* from {*f*_{1},*f*_{2},…}. We call this set of facts an *explanation* for *h*.

Given a question (*Q*) and a set of candidate answers *C* = {*c*_{1},*c*_{2}, *c*_{3},…, *c*_{n}} ILP-based approaches (Khashabi et al., 2016; Khot et al., 2017; Thayaparan et al., 2021), convert them into a list of hypothesis *H* = {*h*_{1},*h*_{2}, *h*_{3}, …,*h*_{n}} by concatenating question and candidate answer. For each hypothesis *h*_{i} these approaches typically adopt a retrieval model (e.g., BM25, FAISS (Johnson et al., 2017)), to select a list of candidate explanatory facts *F* = {*f*_{1},*f*_{2}, *f*_{3},…, *f*_{k}}, and construct a weighted graph *G* = (*V*,*E*,*W*) with edge weights $W:E\u2192R$ where *V* = {{*h*_{i}} ∪ *F*}, edge weight *W*_{ik} of each edge *E*_{ik} denote how relevant a fact *f*_{k} is with respect to the hypothesis *h*_{i}.

Based on these definitions, ILP-based QA can be defined as follows:

*ILP-Based Multi-Hop QA*).

Find a subset *V*^{*} ⊆ *V*, *h* ∈ *V*^{*}, *V*^{*}∖{*h*} = *F*_{exp} and *E*^{*} ⊆ *E* such that the induced subgraph *G*^{*} = (*V*^{*},*E*^{*}) is connected, weight $W[G*=(V*,E*)]\u2254\u2211e\u2208E*W(e)$ is maximal and adheres to set of constraints *M*_{c} designed to emulate multi-hop inference. The hypothesis *h*_{i} with the highest subgraph weight *W*[*G*^{*} = (*V*^{*},*E*^{*})] is selected to be the correct answer *c*_{ans}.

The ILP-based inference has two main challenges in producing convincing explanations. First, design edge weights *W*, ideally capturing a quantification of the relevance of the fact to the hypothesis. Second, define constraints that emulate the multi-hop inference process.

### 3.1 Limitations with Existing ILP formulations

In previous work, the construction of the graph *G* requires predetermined edge-weights based on lexical overlaps (Khot et al., 2017) or semantic similarity using sentence embeddings (Thayaparan et al., 2021), on top of which combinatorial optimization strategies are performed separately. From those approaches, ExplanationLP proposed by Thayaparan et al. (2021) is the only approach that modifies the graph weight function by optimizing the weight parameters *θ* by fine-tuning them for inference via Bayesian Optimization over pre-trained embeddings.

In contrast, we posit that learning the graph weights dynamically by fine-tuning the underlying neural embeddings towards answer and explanation selection will lead to more accurate and robust performance. To this end, the constraint optimization strategy should be differentiable and efficient. However, Integer Linear Programming based approaches present two critical shortcomings that prevent achieving this goal:

The Integer Linear Programming formulation operates with discrete inputs/outputs resulting in

*non-differentiability*(Paulus et al., 2021). Consequently, it cannot be integrated with deep neural networks and trained end- to-end. Making ILP differentiable requires non-trivial assumptions and approximations (Paulus et al., 2021).Integer Programming is known to be NP- complete, with the special case of 0-1 integer linear programming being one of Karp’s 21 NP-complete problems (Karp, 1972). Therefore, as the size of the combinatorial optimization problem increases, finding exact solutions becomes computationally intractable. This intractability is a strong limitation for multi-hop QA in general since these systems typically operate on large knowledge bases and corpora.

### 3.2 Subgraph Selection via Semi-Definite Programming

Differentiable convex optimization (DCX) layers (Agrawal et al., 2019b) provide a way to encode constraints as part of a deep neural network. However, an ILP formulation is non-convex (Wolsey, 2020; Schrijver, 1998) and cannot be incorporated into a differentiable convex optimization layer. The challenge is to approximate ILP with convex optimization constraints.

In order to alleviate this problem, we turn to *Semi-Definite programming* (SDP) (Vandenberghe and Boyd, 1996). SDP is non-linear but convex and has shown to efficiently approximate combinatorial problems.

Here $X\u2208Sn$ is the optimization variable and $C,A1,\u2026,Ap\u2208Sn$, and *b*_{1},…,*b*_{p} ∈ℝ. *X* ≽ 0 is a matrix inequality with $Sn$ denotes a set of *n* × *n* symmetric matrices.

SDP is often used as a convex approximation of traditional NP-hard combinatorial graph optimization problems, such as the max-cut problem, the dense k-subgraph problem and the quadratic {0 − 1} programming problem (Lovász and Schrijver, 1991).

Here *W* is the edge weight matrix of the graph *G* and the optimal solution for this problem $\u0177$ indicates if a node is part of the induced subgraph *G*^{*}.

*y*∈{0,1}

^{n}, we optimize over the set of

*positive semidefinite matrices*satisfying the SDP constraint in the following relaxed convex optimization problem:

^{1}

*Y*=

*yy*

^{T},

*diag*(

*Y*) =

*y*.

The optimal solution for *Y* in this problem $\xca\u2208[0,1]$ indicates if an edge is part of the subgraph *G*^{*}. In addition to the semi-definite constraints, we also impose multi-hop inference constraints *M*_{c}. These constraints are introduced in Section 3.4 and the Appendix.

This reformulation provides the tightest approximation for the optimization with the convex constraints. Since this formulation is convex, we can now integrate it with differentiable convex optimization layers. Moreover, the semi-definite program relaxation can be solved by adopting the interior-point method (De Klerk, 2006; Vandenberghe and Boyd, 1996) which has been proved to run in polynomial time (Karmarkar, 1984). To the best of our knowledge, we are the first to employ SDP to solve a natural language processing task.

### 3.3 *Diff-*Explainer: End-to-End Differentiable Architecture

*Diff-*Explainer is an end-to-end differentiable architecture that simultaneously solves the constraint optimization problem and dynamically adjusts the graph edge weights for better performance. We adopt *differentiable convex optimization* for the optimal subgraph selection problem. The complete architecture and setup are described in the subsequent subsections and Figure 2.

We transform a multi-hop question answering dataset into a multi-hop QA dataset by converting an example’s question (*q*) and the set of candidate answers *C* = {*c*_{1},*c*_{2}, *c*_{3}, …, *c*_{n}} into hypotheses *H* = {*h*_{1},*h*_{2}, *h*_{3}, …, *h*_{n}} (See Figure 2A) by using the approach proposed by Demszky et al. (2018). To build the initial graph, for the hypotheses set *H* we adopt a retrieval model to select a list of candidate explanatory facts *F* = {*f*_{1},*f*_{2}, *f*_{3}, …, *f*_{k}} to construct a weighted complete bipartite graph *G* = (*H*,*F*, *E*, *W*), where the weights *W*_{ik} of each edge *E*_{ik} denote how relevant a fact *f*_{k} is with respect to the hypothesis *h*_{i}. Departing from traditional ILP approaches (Thayaparan et al., 2021; Khashabi et al., 2016, 2018), the aim is to select the correct answer *c*_{ans} and relevant explanations *F*_{exp} with a single graph.

In order to demonstrate the impact of *Diff*- Explainer, we reproduce the formalization introduced by previous ILP solvers. Specifically, we approximate the two following solvers:

**TupleILP**(Khot et al., 2017): TupleILP constructs a semi-structured knowledge base using tuples extracted via Open Information Extraction (OIE) and performs inference over them. For example, in Figure 2A,*F*1 will be decomposed into*(a stick; is a; object)*and the subject (*a stick*) will be connected to the hypothesis to enforce constraints and build the subgraph.**ExplanationLP**(Thayaparan et al., 2021): ExplanationLP classifies facts into abstract and grounding facts. Abstract facts are core scientific statements. Grounding facts help connect the generic terms in the abstract facts to the terms in the hypothesis. For example, in Figure 2A,*F*_{1}is a grounding fact and helps to connect the hypothesis with the abstract fact*F*_{7}. The approach aims to emulate abstract reasoning.

Further details of these approaches can be found in the Appendix.

To demonstrate the impact of integrating a convex optimization layer into a broader end-to-end neural architecture, *Diff-*Explainer employs a transformer-based sentence embedding model. Figure 2B describes the end-to-end architectural diagram of *Diff-*Explainer. Specifically, we incorporate a differentiable convex optimization layer with Sentence-Transformer (STrans) (Reimers et al., 2019), which has demonstrated state-of-the-art performance on semantic sentence similarity benchmarks.

STrans is adopted to estimate the relevance between hypothesis and facts during the construction of the initial graph. We use STrans as a bi-encoder architecture to minimize the computational overload and operate on large number of sentences. The semantic relevance score from STrans is complemented with a lexical relevance score computed considering the shared terms between hypotheses and facts. We calculate semantic and lexical relevance as follows:

##### Semantic Relevance (*s*):

*h*

_{i}and fact

*f*

_{j}we compute sentence vectors of $hi\u2192=STrans(hi)$ and $fj\u2192=STrans(fj)$ and calculate the semantic relevance score using cosine-similarity as follows:

##### Lexical Relevance (*l*):

*h*

_{i}and

*f*

_{j}is given by the percentage of overlaps between unique terms (here, the function

*trm*extracts the lemmatized set of unique terms from the given text):

*W*) as follows:

Here relevance scores are weighted by weight parameters (*θ*) with *θ* clamped to [0,1]. $sijDk$ (or $lijDk$) is *s*_{ij} (or *l*_{ij}) if (*i*,*j*) satisfy condition $Dk$ or 0 otherwise. TupleILP has two weights each for lexical and semantic relevance. Meanwhile, ExplanationLP has nine weights based on the type of fact and relevance type. Additional details on how to calculate *W* for each approach can be found in the Appendix.

### 3.4 Answer and Explanation Selection

*Y*and node variable

*y*(

*diag*(

*Y*) =

*y*) (see section 3.2) where 1 means the edge/node is part of the subgraph and 0 otherwise, we design the the answer selection constraint is defined as follows:

*m*.

Apart from these functional constraints, ILP based methods also impose semantic and structural constraints. For instance, ExplanationLP places explicit grounding-abstract fact chain constraints to perform efficient abstractive reasoning and TupleILP enforces constraints to leverage the SPO structure to align and select facts. See the Appendix on how these constraints are designed and imposed within *Diff-*Explainer.

*l*

_{1}between the selected answer and correct answer

*c*

_{ans}for the answer loss ℒ

_{ans}:

*l*

_{b}between the selected explanatory facts and true explanatory facts

*F*

_{exp}for the explanatory loss ℒ

_{exp}:

We add the losses to backpropagate to learn the *θ* weights and fine-tune the sentence transformers. The pseudo-code to train *Diff-*Explainer end-to-end is summarized in Algorithm 1.

## 4 Empirical Evaluation

##### Question Sets:

We use the following multiple- choice question sets to evaluate the *Diff-*Explainer.

##### (1) WorldTree Corpus

(Xie et al., 2020): The 4,400 question and explanations in the WorldTree corpus are split into three different subsets: *train-set*, *dev-set*, and *test-set*. We use the *dev-set* to assess the explainability performance since the explanations for *test-set* are not publicly available.

##### (2) ARC-Challenge Corpus

(Clark et al., 2018): ARC-Challenge is a multiple-choice question dataset which consists of question from science exams from grade 3 to grade 9. These questions have proven to be challenging to answer for other LP-based question answering and neural approaches.

##### Experimental Setup:

We use *all-mpnet-base- v2* model as the Sentence Transformer model for the sentence representation in *Diff-*Explainer. The motivation to choose this model is to use a pre-trained model on natural language inference and MPNet_{Base} (Song et al., 2020) is smaller compared to large models like BERT_{Large}, enabling us to encode a larger number of facts. Similarly, for fact retrieval representation, we use *all-mpnet-base-v2* trained with gold explanations of WorldTree Corpus to achieve a Mean Average Precision of 40.11 in the dev-set. We cache all the facts from the background knowledge and retrieve the top *k* facts using MIPS retrieval (Johnson et al., 2017). We follow a similar setting proposed by Thayaparan et al. (2021) for the background knowledge base by combining over 5000 abstract facts from the WorldTree table store (WTree) and over 100,000 *is-a* grounding facts from ConceptNet (CNet) (Speer et al., 2016). Furthermore, we also set *m* = 2 in line with the previous configurations from TupleILP and ExplanationLP.^{2}

##### Baselines:

In order to assess the complexity of the task and the potential benefits of the convex optimization layers presented in our approach, we show the results for different baselines. We run all models with *k* = {1,…,10,25,50,75,100} to find the optimal setting for each baseline and perform a fair comparison. For each question, the baselines take as input a set of hypotheses, where each hypothesis is associated with *k* facts, ranked according to the fact retrieval model.

##### (1) IR Solver

(Clark et al., 2018): This approach attempts to answer the questions by computing the accumulated score from all *k* obtained from summing up the retrieval scores. In this case, the retrieval scores are calculated using the cosine similarity of fact and hypothesis sentence vectors obtained from the STrans model trained on gold explanations. The hypothesis associated with the highest score is selected as the one containing the correct answer.

##### (2) BERT_{Base} and BERT_{Large}

(Devlin et al., 2019): To use BERT for this task, we concatenate every hypothesis with *k* retrieved facts, using the separator token [SEP]. We use the HuggingFace (Wolf et al., 2019) implementation of *BertForSequenceClassification*, taking the prediction with the highest probability for the positive class as the correct answer.^{3}

##### (3) PathNet

(Kundu et al., 2019): PathNet is a graph-based neural approach that constructs a single linear path composed of two facts connected via entity pairs for reasoning. It uses the constructed paths as evidence of its reasoning process. They have exhibited strong performance for multiple-choice science questions.

##### (4) TupleILP

and **ExplanationLP**: Both replications of the non-differentiable solvers are implemented with the same constraints as *Diff-* Explainer via SDP approximation without fine- tuning end-to-end; instead, we fine-tune the *θ* parameters using Bayesian optimization^{4} and frozen STrans representations. This baseline helps us to understand the impact of the end-to-end fine-tuning.

### 4.1 Answer Selection

##### WorldTree Corpus:

Table 1 presents the answer selection performance on the WorldTree corpus in terms of accuracy, presenting the best results obtained for each model after testing for different values of *k*. We also include the results for BERT without explanation in order to evaluate the influence extra facts can have on the final score. We also present the results for two different training goals, optimizing for only the answer and optimizing jointly for answer and explanation selection.

Model
. | Acc
. |
---|---|

Baselines | |

IR Solver | 50.48 |

BERT_{Base} (Without Retrieval) | 45.43 |

BERT_{Base} | 58.06 |

BERT_{Large} (Without Retrieval) | 49.63 |

BERT_{Large} | 59.32 |

TupleILP | 49.81 |

ExplanationLP | 62.57 |

PathNet | 43.40 |

Diff-Explainer | |

TupleILP constraints | |

- Answer Selection only | 61.13 |

- Answer and explanation selection | 63.11 |

ExplanationLP constraints | |

- Answer selection only | 69.73 |

- Answer and explanation selection | 71.48 |

Model
. | Acc
. |
---|---|

Baselines | |

IR Solver | 50.48 |

BERT_{Base} (Without Retrieval) | 45.43 |

BERT_{Base} | 58.06 |

BERT_{Large} (Without Retrieval) | 49.63 |

BERT_{Large} | 59.32 |

TupleILP | 49.81 |

ExplanationLP | 62.57 |

PathNet | 43.40 |

Diff-Explainer | |

TupleILP constraints | |

- Answer Selection only | 61.13 |

- Answer and explanation selection | 63.11 |

ExplanationLP constraints | |

- Answer selection only | 69.73 |

- Answer and explanation selection | 71.48 |

We draw the following conclusions from the empirical results obtained on the WorldTree corpus (the performance increase here is expressed in absolute terms):

- (1)
*Diff-*Explainer with ExplanationLP and TupleILP outperforms the respective non-differentiable solvers by 13.3% and 8.91%. This increase in performance indicates that*Diff-*Explainer can incorporate different types of constraints and significantly improve performance compared with the non-differentiable version. - (2)
It is evident from the performance obtained by a large model such as BERT

_{Large}(59.32%) that we are dealing with a non-trivial task. The best*Diff-*Explainer setting (with ExplanationLP) outperforms the best transformer-based models with and without explanations by 12.16% and 21.85%. Additionally, we can also observe that both with TupleILP and ExplanationLP, we obtain better scores over the transformer-based configurations. - (3)
Fine-tuning with explanations yielded better performance than only answer selection with ExplanationLP and TupleILP, improving performance by 1.75% and 1.98%. The increase in performance indicates that

*Diff-*Explainer can learn from the distant supervision of answer selection and improve in a strong supervision setting. - (4)
Overall, we can conclude that incorporating constraints using differentiable convex optimization with transformers for multi-hop QA leads to better performance than pure constraint-based or transformer-only approaches.

##### ARC Corpus:

Table 2 presents a comparison of baselines and our approach with different background knowledge bases: TupleInf, the same as used by TupleILP (Khot et al., 2017), and WorldTree & ConceptNet as used by ExplanationLP (Thayaparan et al., 2021). We have also reported the original scores reported by the respective approaches.

Model
. | Background KB
. | Acc
. |
---|---|---|

TupleILP (Khot et al., 2017) | TupleInf | 23.83 |

ExplanationLP (Thayaparan et al., 2021) | WTree & CNet | 40.21 |

TupleILP (Ours) | TupleInf | 29.12 |

ExplanationLP (Ours) | WTree & CNet | 37.40 |

Diff-Explainer | ||

TupleILP Constraints | TupleInf | 33.95 |

ExplanationLP Constraints | WTree & CNet | 42.95 |

For this dataset, we use our approach with the same settings as the model applied to WorldTree, and fine-tune for only answer selection since ARC does not have gold explanations. Models employing Large Language Models (LLMs) trained across multiple question answering datasets like UnifiedQA (Khashabi et al., 2020) and AristoBERT (Xu et al., 2021) have demonstrated strong performance in ARC with an accuracy of 81.14 and 68.95 in respectively.

To ensure a fair comparison, we only compare the best configuration of *Diff-*Explainer with other approaches that have been trained *only* on the ARC corpus and provide some form of explanations in Table 3. Here the explainability column indicates if the model delivers an explanation for the predicted answer. A subset of the approaches produces evidence for the answer but remains intrinsically black-box. These models have been marked as *Partial*.

Model
. | Explainable
. | Accuracy
. |
---|---|---|

BERT_{Large} | No | 35.11 |

IR Solver (Clark et al., 2016) | Yes | 20.26 |

TupleILP (Khot et al., 2017) | Yes | 23.83 |

TableILP (Khashabi et al., 2016) | Yes | 26.97 |

ExplanationLP (Thayaparan et al., 2021) | Yes | 40.21 |

DGEM (Clark et al., 2016) | Partial | 27.11 |

KG^{2} (Zhang et al., 2018) | Partial | 31.70 |

ET-RR (Ni et al., 2019) | Partial | 36.61 |

Unsupervised AHE (Yadav et al., 2019a) | Partial | 33.87 |

Supervised AHE (Yadav et al., 2019a) | Partial | 34.47 |

AutoRocc (Yadav et al., 2019b) | Partial | 41.24 |

Diff-Explainer (ExplanationLP) | Yes | 42.95 |

Model
. | Explainable
. | Accuracy
. |
---|---|---|

BERT_{Large} | No | 35.11 |

IR Solver (Clark et al., 2016) | Yes | 20.26 |

TupleILP (Khot et al., 2017) | Yes | 23.83 |

TableILP (Khashabi et al., 2016) | Yes | 26.97 |

ExplanationLP (Thayaparan et al., 2021) | Yes | 40.21 |

DGEM (Clark et al., 2016) | Partial | 27.11 |

KG^{2} (Zhang et al., 2018) | Partial | 31.70 |

ET-RR (Ni et al., 2019) | Partial | 36.61 |

Unsupervised AHE (Yadav et al., 2019a) | Partial | 33.87 |

Supervised AHE (Yadav et al., 2019a) | Partial | 34.47 |

AutoRocc (Yadav et al., 2019b) | Partial | 41.24 |

Diff-Explainer (ExplanationLP) | Yes | 42.95 |

- (1)
*Diff-*Explainer improves the performance of non-differentiable solvers regardless of the background knowledge and constraints. With the same background knowledge, our model improves the original TupleILP and ExplanationLP by 10.12% and 2.74%, respectively. - (2)
Our approach also achieves the highest performance for partially and fully explainable approaches trained

*only*on ARC corpus. - (3)
As illustrated in Table 3, we outperform the next best fully explainable baseline (ExplanationLP) by 2.74%. We also outperform the stat-of-the-art model AutoRocc (Yadav et al., 2019b) (uses BERT

_{Large}) that is only trained on ARC corpus by 1.71% with 230 million fewer parameters. - (4)
Overall, we achieve consistent performance improvement over different knowledge bases (TupleInf, Wordtree & ConceptNet) and question sets (ARC, WorldTree), indicating that the robustness of the approach.

### 4.2 Explanation Selection

Table 4 shows the Precision@K scores for explanation retrieval for PathNet, ExplanationLP/ TupleILP, and *Diff-*Explainer with ExplanationLP/TupleILP trained on answer and explanation selection. We choose Precision@K as the evaluation metric as the design of the approaches is not to construct full explanations but to take the top *k* = 2 explanations and select the answer.

Model
. | Precision@1 . | Precision@2 . |
---|---|---|

TupleILP | 40.44 | 31.21 |

ExplanationLP | 51.99 | 40.41 |

PathNet | 19.79 | 13.73 |

Diff-Explainer | ||

TupleILP (Best) | 40.64 | 32.23 |

ExplanationLP (Best) | 56.77 | 41.91 |

Model
. | Precision@1 . | Precision@2 . |
---|---|---|

TupleILP | 40.44 | 31.21 |

ExplanationLP | 51.99 | 40.41 |

PathNet | 19.79 | 13.73 |

Diff-Explainer | ||

TupleILP (Best) | 40.64 | 32.23 |

ExplanationLP (Best) | 56.77 | 41.91 |

As evident from the table, our approach significantly outperforms PathNet. We also improved the explanation selection performance over the non-differentiable solvers indicating the end-to- end fine-tuning also helps improve the selection of explanatory facts.

### 4.3 Answer Selection with Increasing Distractors

As noted by previous works (Yadav et al., 2019b, 2020), the answer selection performance can decrease when increasing the number of used facts *k* for Transformer. We evaluate how our approach stacks compared with transformer-based approaches in this aspect, presented in Figure 3.

As we can see, the IR Solver decreases in performance as we add more facts, while the scores for transformer-based models start deteriorating for *k* > 5. Such results might seem counter-intuitive since it would be natural to expect a model’s performance to increase as we add supporting facts. However, in practice, that does not apply as by adding more facts, there is an addition of distractors that such models may not filter out.

We can prominently see this for BERT_{Large} with a sudden drop in performance for *k* = 10, going from 56.61 to 30.26. Such a drop is likely being caused by substantial overfitting; with the added noise, the model partially lost the ability for generalization. A softer version of this phenomenon is also observed for BERT_{Base}.

In contrast, our model’s performance increases as we add more facts, reaching a stable point around *k* = 50. Such performance stems from our combination of overlap and relevance scores along with the structural and semantic constraints. The obtained results highlight our model’s robustness to distracting knowledge, allowing its use in data-rich scenarios, where one needs to use facts from extensive knowledge bases. PathNet is also exhibiting robustness across increasing distractors, but we consistently outperform it across all *k* configurations.

On the other hand, for smaller values of *k* our model is outperformed by transformer-based approaches, hinting that our model is more suitable for scenarios involving large knowledge bases such as the one presented in this work.

### 4.4 Qualitative Analysis

We selected some qualitative examples that showcase how end-to-end fine-tuning can improve the quality and inference and presented them in Table 5. We use the ExplanationLP for non- differentiable solver and *Diff-*Explainer as they yield higher performance in answer and explanation selection.

Question (1): Fanning can make a wood fire burn hotter because the fanning: Correct Answer: adds more oxygen needed for burning. |

PathNet |

Answer: provides the energy needed to keep the fire going. Explanations: (i) fanning a fire increases the oxygen near the fire, (ii) placing a heavy blanket over a fire can be used to keep oxygen from reaching a fire |

ExplanationLP |

Answer: increases the amount of wood there is to burn. Explanations: (i) more burning causes fire to be hotter, (ii) wood burns |

Diff-Explainer ExplanationLP |

Answer: adds more oxygen needed for burning. Explanations: (i) more burning causes fire to be hotter, (ii) fanning a fire increases the oxygen near the fire |

Question (2): Which type of graph would best display the changes in temperature over a 24 hour period? Correct Answer: line graph. |

PathNet |

Answer: circle/pie graph. Explanations: (i) a line graph is used for showing change; data over time |

ExplanationLP |

Answer: circle/pie graph. Explanations: (i) 1 day is equal to 24 hours, (ii) a circle graph; pie graph can be used to display percents; ratios |

Diff-Explainer ExplanationLP |

Answer: line graph. Explanations: (i) a line graph is used for showing change; data over time, (ii) 1 day is equal to 24 hours |

Question (3): Why has only one-half of the Moon ever been observed from Earth? Correct Answer: The Moon rotates at the same rate that it revolves around Earth. |

PathNet |

Answer: The Moon has phases that coincide with its rate of rotation. Explanations: (i) the moon revolving around; orbiting the Earth causes the phases of the moon, (ii) a new moon occurs 14 days after a full moon |

ExplanationLP |

Answer: The Moon does not rotate on its axis. Explanations: (i) the moon rotates on its axis, (ii) the dark half of the moon is not visible |

Diff-Explainer ExplanationLP |

Answer: The Moon is not visible during the day. Explanations: (i) the dark half of the moon is not visible, (ii) a complete revolution; orbit of the moon around the Earth takes 1; one month |

Question (1): Fanning can make a wood fire burn hotter because the fanning: Correct Answer: adds more oxygen needed for burning. |

PathNet |

Answer: provides the energy needed to keep the fire going. Explanations: (i) fanning a fire increases the oxygen near the fire, (ii) placing a heavy blanket over a fire can be used to keep oxygen from reaching a fire |

ExplanationLP |

Answer: increases the amount of wood there is to burn. Explanations: (i) more burning causes fire to be hotter, (ii) wood burns |

Diff-Explainer ExplanationLP |

Answer: adds more oxygen needed for burning. Explanations: (i) more burning causes fire to be hotter, (ii) fanning a fire increases the oxygen near the fire |

Question (2): Which type of graph would best display the changes in temperature over a 24 hour period? Correct Answer: line graph. |

PathNet |

Answer: circle/pie graph. Explanations: (i) a line graph is used for showing change; data over time |

ExplanationLP |

Answer: circle/pie graph. Explanations: (i) 1 day is equal to 24 hours, (ii) a circle graph; pie graph can be used to display percents; ratios |

Diff-Explainer ExplanationLP |

Answer: line graph. Explanations: (i) a line graph is used for showing change; data over time, (ii) 1 day is equal to 24 hours |

Question (3): Why has only one-half of the Moon ever been observed from Earth? Correct Answer: The Moon rotates at the same rate that it revolves around Earth. |

PathNet |

Answer: The Moon has phases that coincide with its rate of rotation. Explanations: (i) the moon revolving around; orbiting the Earth causes the phases of the moon, (ii) a new moon occurs 14 days after a full moon |

ExplanationLP |

Answer: The Moon does not rotate on its axis. Explanations: (i) the moon rotates on its axis, (ii) the dark half of the moon is not visible |

Diff-Explainer ExplanationLP |

Answer: The Moon is not visible during the day. Explanations: (i) the dark half of the moon is not visible, (ii) a complete revolution; orbit of the moon around the Earth takes 1; one month |

For Question (1), *Diff-*Explainer retrieves both explanations correctly and is able to answer correctly. Both PathNet and ExplanationLP have correctly retrieved at least one explanation but performed incorrect inference. We hypothesize that the other two approaches were distracted by the lexical overlaps in question/answer and facts, while our approach is robust towards distractor terms. In Question (2), our model was able only to retrieve one explanation correctly and was distracted by the lexical overlap to retrieve an irrelevant one. However, it still was able to answer correctly. In Question (3), all the approaches answered the question incorrectly, including our approach. Even though our approach was able to retrieve at least one correct explanation, it was not able to combine the information to answer and was distracted by lexical noise. These shortcomings indicate that more work can be done, and different constraints can be experimented with for combining facts.

## 5 Conclusion

We presented a novel framework for encoding explicit and controllable assumptions as an end-to- end learning framework for question answering. We empirically demonstrated how incorporating these constraints in broader Transformer-based architectures can improve answer and explanation selection. The presented framework adopts constraints from TupleILP and ExplanationLP, but *Diff-*Explainer can be extended to encode different constraints with varying degrees of complexity.

This approach can also be extended to handle other forms of multi-hop QA, including open- domain, cloze style, and answer generation. ILP has also been employed for relation extraction (Roth and Yih, 2004; Choi et al., 2006; Chen et al., 2014), semantic role labeling (Punyakanok et al., 2004; Koomen et al., 2005), sentiment analysis (Choi and Cardie, 2009), and explanation regeneration (Gupta and Srinivasaraghavan, 2020). We can adapt and improve the constraints presented in this approach to build explainable approaches for the respective tasks.

*Diff-*Explainer is the first work investigating the intersection of explicit constraints and latent neural representations to the best of our knowledge. We hope this work will open the way for future lines of research on neuro-symbolic models, leading to more controllable, transparent and explainable NLP models.

## Acknowledgments

The work is partially funded by the EPSRC grant EP/T026995/1 entitled “EnnCore: End-to-End Conceptual Guarding of Neural Architectures” under *Security for all in an AI enabled society*. The authors would like to thank the anonymous reviewers and editors of TACL for the constructive feedback. Additionally, we would like to thank the Computational Shared Facility of the University of Manchester for providing the infrastructure to run our experiments.

## 6 Appendix

### 6.1 Model Description

This section presents a detailed explanation of TupleILP and ExplanationLP:

##### TupleILP

TupleILP uses Subject-Predicate- Object tuples for aligning and constructing the explanation graph. As shown in Figure 4C, the tuple graph is constructed and lexical overlaps are aligned to select the explanatory facts. The constraints are designed based on the position of text in the tuple.

##### ExplanationLP

Given hypothesis *H*_{1} from Figure 4A, the underlying concept the hypothesis attempts to test is the understanding of *friction*. Different ILP approaches would attempt to build explanation graph differently. For example, ExplanationLP (Thayaparan et al., 2021) would classify core scientific facts (*F*_{6}-*F*_{8}) into *abstract facts* and the linking facts (*F*_{1}-*F*_{5}) that connects generic or abstract terms in the hypothesis into *grounding fact*. The constraints are designed to emulate abstraction by starting to from the concrete statement to more abstract concepts via the grounding facts as shown in Figure 4B.

### 6.2 Objective Function

In this section, we explain how to design the objective function for TupleILP and ExplanationLP to adopt with *Diff-*Explainer.

*n*candidate hypotheses and

*k*candidate explanatory facts,

*A*represents an adjacency matrix of dimension ((

*n*+

*k*) × (

*n*+

*k*)) where the first

*n*columns and rows denote the candidate hypotheses, while the remaining rows and columns represent the candidate explanatory facts. The adjacency matrix denotes the graph’s lexical connections between hypotheses and facts. Specifically, each entry in the matrix

*A*

_{ij}contains the following values:

Given the relevance scoring functions, we construct edge weights matrix (*W*) via a weighted function for each approach as follows:

##### TupleILP

*Diff-*Explainer with TupleILP constraints is:

##### ExplanationLP

*F*

_{A}) and Grounding KB (

*F*

_{G}), the weight function for

*Diff-*Explainer with Explanation LP is as follows:

### 6.3 Constraints with Disciplined Parameterized Programming (DPP)

In order to adopt differentiable convex optimization layers, the constraints should be defined following the Disciplined Parameterized Programming (DPP) formalism (Agrawal et al., 2019b), providing a set of conventions when constructing convex optimization problems. DPP consists of functions (or *atoms*) with a known curvature (affine, convex or concave) and per-argument monotonicities. In addition to these, DPP also consists of *Parameters* which are symbolic constants with an unknown numerical value assigned during the solver run.

##### TupleILP

We extract SPO tuples $fit={fiS,fiP,fiO}$ for each fact *f*_{i} using an Open Information Extraction model (Stanovsky et al., 2018). From the hypothesis *h*_{i} we extract the set of unique terms $hiht={t1hi,t2hi,t3hi,\u2026,tlhi}$ excluding stopwords.

In addition to the aforementioned constraints and semidefinite constraints specified in Equation 7, we adopt part of the constraints from TupleILP (Khot et al., 2017). In order to implement TupleILP constraints, we extract SPO tuples $fit={fiS,fiP,fiO}$ for each fact *f*_{i} using an Open Information Extraction model (Stanovsky et al., 2018). From the hypotheses *H* we also extract the set of unique terms *H*^{t} = {*t*_{1},*t*_{2},*t*_{3}, …, *t*_{l}} excluding stopwords. The constraints are described in Table 6.

Description
. | DPP Format
. | Parameters
. |
---|---|---|

TupleILP | ||

Sub graph must have ≤ w_{1} active tuples | $\u2211i\u2208FYii\u2264w1+1$ (16) | – |

Active hypothesis term must have ≤ w_{2} edges | $H\theta [:,:,i]\u2299Y\u2264w2\u2200i\u2208Ht$ (17) | H_{θ} is populated by hypothesis term matrix H with dimension ((n + k) × (n + k) × l) and the values are given by: $Hijk=1,\u2200k\u2208Ht,i\u2208H,j\u2208F,tk\u2208trm(hi),tk\u2208trm(fj)1,\u2200k\u2208Ht,i\u2208F,j\u2208H,tk\u2208trm(hj),tk\u2208trm(fi)0,otherwise$ (18) |

Active tuple must have active subject | $Y\u2299T\theta S>=E\u2299A\theta $ (19) | A_{θ} populated by adjacency matrix A, $T\theta S$ by subject tuple matrix T^{S} with dimension ((n + k) × (n + k)) and the values are given by: $TijS=1,i\u2208H,j\u2208F,|trm(hi)\u2229trm(fjS)|>01,i\u2208F,j\u2208H,|trm(hj)\u2229trm(fiS)|>00,otherwise$ (20) |

Active tuple must have ≥ w_{3} active fields | $Y\u2299T\theta S+Y\u2299T\theta P+Y\u2299T\theta O\u2265w3(Y\u2299A\theta )$ (21) | A_{θ} populated by adjacency matrix A and $T\theta S$, $T\theta P$, $T\theta O$ populated by subject, predicate and object matrices T^{S}, T^{P}, T^{O} respectively. Predicate and object tuples are converted into T^{P},T^{O} matrices similar to T^{S} |

Active tuple must have an edge to some hypothesis term | Implemented during graph construction by only considering tuples that have lexical overlap with a hypothesis | – |

ExplanationLP | ||

Limits the total number of abstract facts to w_{4} | $diag(Y)\xb7F\theta AB\u2264w4$ (22) | $F\theta AB$ is populated by Abstract fact matrix F^{AB}, where: $FijAB=1,i\u2208H,j\u2208FA0,otherwise$ (23) |

Description
. | DPP Format
. | Parameters
. |
---|---|---|

TupleILP | ||

Sub graph must have ≤ w_{1} active tuples | $\u2211i\u2208FYii\u2264w1+1$ (16) | – |

Active hypothesis term must have ≤ w_{2} edges | $H\theta [:,:,i]\u2299Y\u2264w2\u2200i\u2208Ht$ (17) | H_{θ} is populated by hypothesis term matrix H with dimension ((n + k) × (n + k) × l) and the values are given by: $Hijk=1,\u2200k\u2208Ht,i\u2208H,j\u2208F,tk\u2208trm(hi),tk\u2208trm(fj)1,\u2200k\u2208Ht,i\u2208F,j\u2208H,tk\u2208trm(hj),tk\u2208trm(fi)0,otherwise$ (18) |

Active tuple must have active subject | $Y\u2299T\theta S>=E\u2299A\theta $ (19) | A_{θ} populated by adjacency matrix A, $T\theta S$ by subject tuple matrix T^{S} with dimension ((n + k) × (n + k)) and the values are given by: $TijS=1,i\u2208H,j\u2208F,|trm(hi)\u2229trm(fjS)|>01,i\u2208F,j\u2208H,|trm(hj)\u2229trm(fiS)|>00,otherwise$ (20) |

Active tuple must have ≥ w_{3} active fields | $Y\u2299T\theta S+Y\u2299T\theta P+Y\u2299T\theta O\u2265w3(Y\u2299A\theta )$ (21) | A_{θ} populated by adjacency matrix A and $T\theta S$, $T\theta P$, $T\theta O$ populated by subject, predicate and object matrices T^{S}, T^{P}, T^{O} respectively. Predicate and object tuples are converted into T^{P},T^{O} matrices similar to T^{S} |

Active tuple must have an edge to some hypothesis term | Implemented during graph construction by only considering tuples that have lexical overlap with a hypothesis | – |

ExplanationLP | ||

Limits the total number of abstract facts to w_{4} | $diag(Y)\xb7F\theta AB\u2264w4$ (22) | $F\theta AB$ is populated by Abstract fact matrix F^{AB}, where: $FijAB=1,i\u2208H,j\u2208FA0,otherwise$ (23) |

##### ExplanationLP

ExplanationLP constraints are described in Table 6.

## Notes

See Helmberg (2000) for the derivation from the original optimization problem.

We fine-tune *Diff-*Explainer using a learning rate of 1*e*-5, 14 epochs, with a batch size of 8.

We fine-tune both versions of BERT using a learning rate of 1*e*-5, 10 epochs, with a batch size of 16 for *Base* and 8 for *Large*.

We fine-tune for 50 epochs using the Adpative Experimentation Platform.

## References

## Author notes

Action Editor: Minlie Huang