## Abstract

We introduce the Probabilistic Worldbuilding Model (PWM), a new fully symbolic Bayesian model of semantic parsing and reasoning, as a first step in a research program toward more domain- and task-general NLU and AI. Humans create internal mental models of their observations that greatly aid in their ability to understand and reason about a large variety of problems. In PWM, the meanings of sentences, acquired facts about the world, and intermediate steps in reasoning are all expressed in a human-readable formal language, with the design goal of interpretability. PWM is Bayesian, designed specifically to be able to generalize to new domains and new tasks. We derive and implement an inference algorithm that reads sentences by parsing and abducing updates to its latent world model that capture the semantics of those sentences, and evaluate it on two out-of-domain question-answering datasets: (1) ProofWriter and (2) a new dataset we call FictionalGeoQA, designed to be more representative of real language but still simple enough to focus on evaluating reasoning ability, while being robust against heuristics. Our method outperforms baselines on both, thereby demonstrating its value as a proof-of-concept.

## 1 Introduction

Despite recent progress in AI and NLP producing algorithms that perform well on a
number of NLP tasks, it is still unclear how to move forward and develop algorithms
that understand language as well as humans do. In particular, large-scale language
models such as BERT (Devlin et al., 2019),
RoBERTa (Liu et al., 2019), GPT-3 (Brown et
al., 2020), XLNet (Yang et al., 2019), and others were trained on a very
large amount of text and can then be applied to perform many different NLP tasks
after some fine-tuning. In the case of GPT-3, some tasks require very few additional
training examples to achieve state-of-the-art performance. As a result of training
on text from virtually every domain, these models are domain-general. This is in
contrast with NLP algorithms that are largely trained on one or a small handful of
domains, and as such, are not able to perform well on new domains outside of their
training. Despite this focus on domain-generality, there are still a large number of
tasks on which these large-scale language models perform poorly (Dunietz et al., 2020). Many limitations of
today’s state-of-the-art methods become evident when comparing with the human
ability to understand language (Lake et al., 2016; Tamari et al., 2020;
Bender and Koller, 2020; Gardner et al., 2019; Linzen, 2020). Many cognitive scientists posit that humans create
rich mental models of the world from their observations which provide superior
explainability, reasoning, and generalizability to new domains and tasks. How do we,
as a field, move from today’s state-of-the-art to more general intelligence?
What are the next steps to develop algorithms that can generalize to new tasks at
the same level as humans? The lack of interpretability in many of these models makes
these questions impossible to answer precisely. One promising direction is to change
the evaluation metric: Brown et al. (2020),
Linzen (2020), and many others have
suggested *zero-shot* or *few-shot accuracy* to
measure the performance of algorithms (i.e., the algorithm is evaluated with a new
dataset, wholly separate from its training; or in the case of few-shot learning,
save for a few examples). While this shift is welcome, it alone will not solve the
above issues.

We introduce the *Probabilistic Worldbuilding Model* (PWM), a probabilistic generative model of reasoning
and semantic parsing. Like some past approaches, PWM explicitly builds an internal mental model, which we call the *theory* (Tamari et al., 2020; Hogan et al., 2021;
Mitchell et al., 2018; Charniak and
Goldman, 1993). The theory constitutes what
the algorithm believes to be true. PWM is fully symbolic
and Bayesian, using a single unified human-readable formal language to represent all
meaning, and is therefore *inherently interpretable*. This is in
contrast to systems that use subsymbolic representations of meaning for some or all
of their components. Every random variable in PWM is
well-defined with respect to other random variables and/or grounded primitives.
Prior knowledge such as the rules of deductive inference, the structure of English
grammar, and knowledge of basic physics and mathematics can be incorporated by
modifying the prior distributions of the random variables in PWM. Incorporating prior knowledge can greatly reduce
the amount of training data required to achieve sufficient generalizability, as we
will demonstrate. Extensibility is key to future research that could enable more
general NLU and AI, as it provides a clearer path forward for future
exploration.

We present an implementation of inference under the proposed model, called the *Probabilistic Worldbuilding from Language* (PWL). While PWM is an
abstract mathematical description of the underlying distribution of axioms, proofs,
logical forms, and sentences, PWL is the algorithm that
reads sentences, computes logical form representations of their meaning, and updates
the axioms and proofs in the theory accordingly. See Figure 1 for a high-level schematic diagram of PWM and PWL. PWM describes the process depicted by the red arrows,
whereas PWL is the algorithm depicted by the green arrows.
We emphasize that the reasoning in PWL is not a theorem
prover and is not purely deductive. Instead, PWL solves a
different problem of finding satisfying *abductive* proofs, which is
computationally easier than deductive inference: Given a set of observations, work
backwards to find a set of axioms that deductively *explain* the
observations. It is these abduced axioms that constitute the internal “mental
model.” Humans often rely on abductive reasoning, for example in commonsense
reasoning (Bhagavatula et al., 2020; Furbach
et al., 2015).

A core principle of our approach is to ensure *generality by design*.
Simplifying assumptions often trade away generality for tractability, such as by
restricting the representation of the meanings of sentences, or number of steps
during reasoning. PWM is designed to be domain- and
task-general, and to this end, uses higher-order logic (i.e., lambda calculus)
(Church, 1940) as the formal language,
which we believe is sufficiently expressive to capture the meaning of declarative
and interrogative sentences in natural language. Furthermore, PWM uses *natural deduction* for
reasoning, which is *complete* in that if a logical form *ϕ* is true, there is a proof of *ϕ* (Henkin, 1950).

In Section 3, we describe PWM and PWL more precisely. In Section 4, as a proof-of-concept of the value of further research, we run experiments on two question-answering datasets: ProofWriter (Tafjord et al., 2021) and a new dataset, called FictionalGeoQA, which we specifically created to evaluate the ability to reason over short paragraphs while being robust against simpler heuristic strategies. Unlike ProofWriter, the text in FictionalGeoQA was not template-generated and is more realistic, but is still simple enough to focus the evaluation on reasoning rather than parsing, with many sentences having semantics that go beyond the Horn clause fragment of first-order logic. PWL outperforms the baselines with respect to zero-shot accuracy (i.e., without looking at any training examples). Our code and data is freely available at github.com/asaparov/PWL and github.com/asaparov/fictionalgeoqa.

In summary, the primary contributions of this paper are the following:

PWM, a new model for more general NLU, and PWL, the implementation that reads sentences, computes their logical forms, and updates its theory accordingly.

Introducing FictionalGeoQA, a new question-answering dataset designed to evaluate the ability to reason over language.

Experiments on ProofWriter and Fictional-GeoQA demonstrating that PWL outperforms baseline methods on question-answering.

## 2 Related Work

*Fully symbolic* methods were commonplace in earlier AI research
(Newell and Simon, 1976; Dreyfus, 1985). However, they were oftentimes brittle:
A new observation would contradict the internal theory or violate an assumption, and
it was not clear how to resolve the impasse in a principled manner and proceed. But
they do have some key advantages: Symbolic approaches that use well-studied
human-readable formal languages such as first-order logic, higher-order logic, type
theory, etc. enable humans to readily inspect and understand the internal processing
of these algorithms, effecting a high degree of interpretability (Dowty, 1981; Gregory, 2015; Cooper et al., 2015). Symbolic systems can be made general by design, by using a
sufficiently expressive formal language and ontology. Hybrid methods have been
explored to alleviate the brittleness of formal systems while engendering their
strengths, such as interpretability and generalizability; for example, the recent
work in *neuro-symbolic* methods (Yi et al., 2020; Saha et al., 2020; Tafjord et al., 2021).
Neural theorem provers are in this vein (Rocktäschel and Riedel, 2017). However, the proofs considered in
these approaches are based on *backward chaining* (Russell and
Norvig, 2010), which restricts the
semantics to the Horn clause fragment of first-order logic. Sun et al. (2020), Ren et al. (2020), and Arakelyan et al. (2021) extend coverage to the existential positive
fragment of first-order logic. In natural language, sentences express more complex
semantics such as negation, nested universal quantification, and higher-order
structures. Our work explores the other side of the tradeoff between tractability
and expressivity/generality. Theorem provers attempt to solve the problem of
deduction: finding a proof of a given formula, given a set of axioms. In contrast,
the reasoning component of PWM is abductive, and the
abduced axioms can be used in downstream tasks, such as question-answering, and to
better read new sentences in the context of the world model, as we will demonstrate.
We posit that abduction is sufficient for more general NLU (Hobbs, 2006; Hobbs et al., 1993). PWM combines Bayesian statistical
machine learning with symbolic representations in order to handle uncertainty in a
principled manner, “smoothing out” or “softening” the
rigidity of a purely symbolic approach. In PWM, the
internal theory is a random variable, so if a new observation is inconsistent with
the theory, there may be other theories in the probability space that are consistent
with the observation. The probabilistic approach provides a principled way to
resolve these impasses.

PWM is certainly not the first to combine symbolic and
probabilistic methods. There is a rich history of *inductive logic
programming* (ILP) (Muggleton, 1991; Cropper and Morel, 2021)
and probabilistic ILP languages (Muggleton, 1996; Cussens, 2001; Sato et
al., 2005; Bellodi and Riguzzi, 2015). These languages could be used to learn
a “theory” from a collection of observations, but they are typically
restricted to learning rules in the form of first-order Horn clauses, for
tractability. In natural language, it is easy to express semantics beyond the Horn
clause fragment of first-order logic.

*Knowledge bases* (KBs) and *cognitive architectures* (Kotseruba and Tsotsos, 2020; Hogan et al., 2021; Laird et al., 1987; Mitchell et al., 2018) have attempted to explicitly model domain-general knowledge in a
form amenable to reasoning. Cognitive architectures aim to more closely replicate
human cognition. Some approaches use probabilistic methods to handle uncertainty
(Niepert et al., 2012; Niepert and
Domingos, 2015; Jain et al., 2019). However, many of these approaches make
strong simplifying assumptions that restrict the expressive power of the formal
language that expresses facts in the KB. For example, many KBs can be characterized
as graphs, where each entity corresponds to a vertex and every fact corresponds to a
labeled edge. For example, the belief plays_sport(s_williams,
tennis) is representable as a directed edge connecting the vertex s_williams to the vertex tennis,
with the edge label plays_sport. While this assumption
greatly aids tractability and scalability, allowing many problems in reasoning to be
solved by graph algorithms, it greatly hinders expressivity and generality, and
there are many kinds of knowledge that simply cannot be expressed and represented in
such KBs. PWM does not make such restrictions on logical
forms in the theory, allowing for richer semantics, such as definitions, universally
quantified statements, conditionals, etc.

## 3 Model

In this section, we provide a mathematical description of PWM. At a high level, the process for generating a sentence sampled from this probability distribution is:

Sample the theory

*T*from a prior distribution*p*(*T*).*T*is a collection of logical forms in higher-order logic that represent what PWL believes to be true.For each observation

*i*, sample a proof*π*_{i}from*p*(*π*_{i}∣*T*). The conclusion of the proof is the logical form*x*_{i}, which represents the meaning of the*i*^{th}sentence.Sample the

*i*^{th}sentence*y*_{i}from*p*(*y*_{i}∣*x*_{i}).

Inference is effectively the inverse of this process, and is implemented by PWL. During inference, PWL is
given a collection of observed sentences *y*_{1},…,*y*_{n} and the goal is to discern the value of the latent variables: the logical form of
each sentence $x\u225c{x1,\u2026,xn}$, the
proofs for each logical form $\pi \u225c{\pi 1,\u2026,\pi n}$, and the
underlying theory *T*. Both the generative process and inference
algorithm naturally divide into two modules:

**Language module:**During inference, this module’s purpose is to infer the logical form of each observed sentence. That is, given the input sentence*y*_{i}, this module outputs the*k*most-probable values of the logical form*x*_{i}(i.e., semantic parsing).**Reasoning module:**During inference, this module’s purpose is to infer the underlying theory that logically entails the observed logical forms (and their proofs thereof). That is, given an input collection of logical forms, this module outputs the posterior distribution of the underlying theory*x**T*and the proofsof those logical forms.*π*

Note that the *y*_{i} need not necessarily be
sentences, and PWM can easily be generalized to other kinds
of data. For example, if a generative model of images is available for *p*(*y*_{i}∣*x*_{i}),
then an equivalent “vision module” may be defined. This module may be
used either in place of, or together with, the language module. In the above
generative process, PWM assumes each sentence to be
independent. A model of context is required to properly handle inter-sentential
anaphora or conversational settings. This can be done by allowing the distribution
on *y*_{i} to depend on previous logical
forms or sentences: *p*(*y*_{i}∣*x*_{1},…,*x*_{i})
(i.e., relaxing the i.i.d. assumption). For simplicity of this proof-of-concept,
this is left to future work.

There is a vast design space for symbolic representations of meaning. We are unable to comprehensively list all of our design choices, but we describe two important ones below.

Neo-Davidsonian semantics (Parsons, 1990) is
used to represent meaning in all logical forms (both in the theory and during
semantic parsing). As a concrete example, a straightforward way to represent the
meaning of “Jason traveled to New York” could be with the logical form
travel(jason,nyc). In neo-Davidsonian semantics, this would instead be represented
with three distinct atoms: travel(*c*_{1}),
arg1(*c*_{1}) = jason, and
arg2(*c*_{1}) = nyc. Here, *c*_{1} is a constant that represents the “traveling event,” whose first
argument is the constant representing Jason, and whose second argument is the
constant representing New York City. This representation allows the event to be more
readily modified by other logical expressions, such as in “Jason quickly
traveled to NYC before nightfall.”

In addition, PWM defers named entity linking to the
reasoning module (it is not done during parsing). That is, the semantic parser does
not parse “Jason” directly into the constant jason. Rather, named
entities are parsed into existentially quantified expressions, for example,
∃*j*(name(*j*) = “Jason”
∧…). This simplifies the parser’s task and allows reasoning to
aid in entity linking. Table 1 details
these design options.

### 3.1 Reasoning Module

#### 3.1.1 Generative Process for the Theory *p*(T)

*T*is a collection of axioms

*a*

_{1},

*a*

_{2},… represented in higher-order logic. We choose a fairly simple prior

*p*(

*T*) for rapid prototyping, but it is straightforward to substitute with a more complex prior. Specifically

*a*

_{1},

*a*

_{2},… are distributed according to a distribution

*G*

_{a}which is sampled from a

*Dirichlet process*(DP) (Ferguson, 1973), an exchangeable non-parametric distribution.

*H*

_{a}is the

*base distribution*and

*α*= 0.1 is the concentration parameter. An equivalent perspective of the DP that better illustrates how the samples are generated is the

*Chinese restaurant process*(Aldous, 1985):

*i*

^{th}sample is drawn from

*H*

_{a}with probability proportional to

*α*, or it is set to a previous sample with probability proportional to the number of times that sample has been drawn.

The base distribution *H*_{a} recursively generates logical forms in higher-order logic. Because any
formula can be written as a tree, they can be generated top-down, starting
from the root. The type of each node (conjunction ∧, disjunction
∨, negation ¬, quantification ∀*x*,
etc.) is sampled from a categorical distribution. If the type of the node is
selected to be an atom (e.g., book(*c*_{1})), then
its predicate is sampled from a non-parametric distribution of predicate
symbols *H*_{p}. The atom’s
argument(s) are each sampled as follows: If *n*_{V} is the number of
available variables (from earlier generated quantifiers), then sample a
variable uniformly at random with probability $1nV+1$;
otherwise, with probability $1nV+1$,
sample a constant from a non-parametric distribution of constant symbols *H*_{c}. For brevity, we refer
the reader to our code for the specific forms of *H*_{p} and *H*_{c}.

Since PWM uses a neo-Davidsonian representation,
another node type that *H*_{a} can
generate is an event argument (e.g., arg1(*c*_{1}) =
jason). When this is selected, the event constant
(*c*_{1} in the example) is sampled in the same
way an atom’s argument is sampled, as described above: first by
trying to sample a variable, and otherwise sampling a constant from *H*_{c}. The right side of the
equality (jason in the example) can either be a variable, constant, string,
or number, so PWM first selects its type from a
categorical distribution. If the type is chosen to be a number, string, or
variable, its value is sampled uniformly. If the type is chosen to be a
constant, it is sampled from *H*_{c}.

*Names of entities* are treated specially in this prior: The
number of names available to each entity is sampled according to a very
light-tailed distribution i.i.d.: for entity *c*_{i} the number of names $nN(ci)\u225c#{s:name(ci)=s}$ is distributed according to $p(nN(ci)=k)\u221d\lambda k2$.
This ensures that entities tend not to have too many names.

*Sets* are also treated specially in this prior: One kind of
axiom that can be generated is one that declares the size of a
set—for example,
size(*λx*.planet(*x*)) = 8 denotes
that the size of the set of planets is 8. In the prior, the size of each set
is distributed according to a geometric distribution with parameter
10^{−4}. Sets can have arity not equal to 1, in which
case their elements are tuples.

**Deterministic Constraints:** We also impose hard constraints on
the theory *T*. Most importantly, *T* is
required to be *globally consistent*. While this is a
conceptually simple requirement, it is computationally expensive (generally
undecideable even in first-order logic). PWL enforces this constraint by keeping track of the known sets in the theory
(i.e., a set is known if its set size axiom is used in a proof, or if the
set appears as a subset/superset in a universally-quantified axiom, such as
in ∀*x*(cat(*x*)
→mammal(*x*)) where the set *λx*.cat(*x*) is a subset of *λx*.mammal(*x*)). For each set, PWL computes which elements are provably
members of that set. If the number of provable members of a set is greater
than its size, or if an element is both provably a member and not a member
of a set, the theory is inconsistent. Relaxing this constraint would be
valuable in future research, perhaps instead by only considering the *relevant* sets rather than all sets in the theory, or
deferring consistency checks altogether. We place a handful of other
constraints on the theory *T*: The name of an entity must be
a string (and not a number or a constant). All constants are distinct; that
is, *c*_{i}≠*c*_{j} for all *i*≠*j*. This helps to
alleviate identifiability issues, as otherwise, there would be a much larger
number of logically equivalent theories. No event can be an argument of
itself (e.g., there is no constant *c*_{i} such that
arg1(*c*_{i}) = *c*_{i}). If a theory *T* satisfies all constraints, we write
“*T*valid.”

*T*is

*conditioned*on

*T*being valid:

*order*of the axioms, constants, and so forth (i.e., the constraints themselves are exchangeable), the distribution of

*T*conditioned on

*T*being valid is exchangeable.

**Properties of the Prior p(T):** We emphasize that these
distributions were chosen for simplicity and ease of implementation, and
they worked well enough in experiments. However, there are likely many
distributions that would work just as well. The parameters in the above
distributions are not learned; they were set and fixed a priori.
Nevertheless, this prior does exhibit useful properties for a domain- and
task-general model of reasoning:

*Occam’s razor:*Smaller/simpler theories are given higher probability than larger and more complex theories.*Consistency:*Inconsistent theories are discouraged or impossible.Entities tend to have a unique name. Our prior above encodes one direction of this prior belief: Each entity is unlikely to have many names. However, the prior does not discourage one name from referring to multiple entities.

Entities tend to have a unique type. Note, however, that this does not discourage types provable by subsumption. For example, if the theory has the axioms novel(

*c*_{1}) and ∀*x*(novel(*x*) →book(*x*)), even though book(*c*_{1}) is provable, it is not an axiom in this example and the prior only applies to axioms.

#### 3.1.2 Generative Process for Proofs *p*(*π*_{i}∣*T*)

PWM uses *natural deduction*, a
well-studied proof calculus, for the proofs (Gentzen, 1935, 1969). Pfenning (2004)
provides an accessible introduction. Figure 2 illustrates a simple example of a natural deduction proof. Each
horizontal line is a deduction step, with the (zero or more) formulas above
the line being its *premises*, and the one formula below the
line being its conclusion. Each deduction step has a label to the right of
the line. For example, the “∧I” step denotes *conjunction introduction*: given that *A* and *B* are true, this step concludes that *A* ∧ *B* is true, where *A* and *B* can be any formula. A natural deduction proof can
rely on axioms (denoted by “Ax”).

We can write any natural deduction proof *π*_{i} as a sequence
of deduction steps $\pi i\u225c(\pi i,1,\u2026,\pi i,k)$ by
traversing the proof tree in prefix order. We define a simple generative
process for *π*_{i}:

First sample the length of the proof

*k*from a Poisson distribution with parameter 20.- 2.
For each

*j*= 1,…,*k*: Select a deduction rule from the proof calculus with a categorical distribution. If the Ax rule is selected, then simply take the next available axiom from the theory*T*=*a*_{1},*a*_{2},… If the deduction rule requires premises, then each premise is selected uniformly at random from*π*_{i,1},…,*π*_{i,j−1}.^{1}

*π*

_{i}is sampled conditioned on

*π*

_{i}being a valid proof. Just as with

*p*(

*T*) in equation 5, this conditioning causes

*p*(

*π*

_{i}∣

*T*) to be intractable to compute. However, only the ratio of the prior probability is needed for inference, which can be computed efficiently:

Although PWL was initially implemented assuming
classical logic, it is easy to adapt PWL to use
other logics, such as *intuitionistic logic*. Intuitionistic
logic is identical to classical logic except that the *law of the
excluded middle**A* ∨¬*A* is not a theorem (see Figure 3 for an example where the
two logics disagree). The interpretable nature of the reasoning module makes
it easy to adapt it to other kinds of logic or proof calculi. PWL supports both classical and intuitionistic
logic.

#### 3.1.3 Inference

*T*and proofs

**, we now describe inference. Given logical forms**

*π***, the goal is to compute the posterior distribution of**

*x**T*and

**such that the conclusion of the each proof**

*π**π*

_{i}is

*x*

_{i}. That is, PWL aims to recover the latent theory and proofs that explain/entail the given observed logical forms. To this end, PWL uses Metropolis-Hastings (MH) (Hastings, 1970; Robert and Casella, 2004). PWL performs inference in a streaming fashion, starting with the case

*n*= 1 to obtain MH samples from

*p*(

*π*

_{1},

*T*|

*x*

_{1}). Then, for every new logical form

*x*

_{n}, PWL uses the last sample from

*p*(

*π*

_{1},…,

*π*

_{n−1},

*T*|

*x*

_{1},…,

*x*

_{n−1}) as a starting point and then obtains MH samples from

*p*(

*π*

_{1},…,

*π*

_{n},

*T*|

*x*

_{1},…,

*x*

_{n}). This warm-start initialization serves to dramatically reduce the number of iterations needed to mix the Markov chain. To obtain the MH samples, the proof of each new logical form $\pi n(0)$ is initialized using Algorithm 1, whereas the proofs of previous logical forms are kept from the last MH sample. The axioms in these proofs constitute the theory sample

*T*

^{(0)}. Then, for each iteration

*t*= 1,…,

*N*

_{iter}, MH proposes a mutation to one or more proofs in

*π*^{(t)}. The possible mutations are listed in Table 2. This may change axioms in

*T*

^{(t)}. Let

*T′*,

*π*

_{i}

*′*be the newly proposed theory and proofs. Then, compute the acceptance probability:

*g*(

*T′*,

*π**′*|

*T*

^{(t)},

*π*^{(t)}) is the probability of proposing the mutation from

*T*

^{(t)},

*π*^{(t)}to

*T′*,

*π**′*, and

*g*(

*T*

^{(t)},

*π*^{(t)}|

*T′*,

*π**′*) is the probability of the

*inverse*of this mutation. Because this quantity depends only on the

*ratio*of probabilities, it can be computed efficiently (see equations 7 and 8). Once this quantity is computed, sample from a Bernoulli with this quantity as its parameter. If it succeeds, MH accepts the proposed theory and proofs as the next sample:

*T*

^{(t +1)}=

*T′*and $\pi i(t+1)=\pi i\u2032$. Otherwise, reject the proposal and keep the old sample:

*T*

^{(t +1)}=

*T*

^{(t)}and $\pi i(t+1)=\pi i(t)$. If every possible theory and proof is reachable from the initial theory by a sequence of mutations, then with sufficiently many iterations, the samples

*T*

^{(t)}and $\pi i(t)$ will be distributed according to the true posterior

*p*(

*T*,

**|**

*π***). If only a subset of possible theories and proofs are reachable from the initial theory, the MH samples will be distributed according to the true posterior**

*x**conditioned*on that subset. This may suffice for many applications, particularly if the theories in the subset have desirable properties such as better tractability. But the subset cannot be too small because then PWL would lose generality.

Probability
. | Proposal of selecting proposal
. |
---|---|

Select a grounded atomic axiom (e.g.,
square(c_{1})) and propose to
replace it with an instantiation of a universal
quantification (e.g.,
∀x(rectangle(x)
∧rhombus(x)
→square(x))), where the
antecedent conjuncts are selected uniformly at random from
the other grounded atomic axioms for the constant c_{1}:
rectangle(c_{1}),
rhombus(c_{1}), etc. | $1N$ |

The inverse of the above proposal: select an instantiation of a universal quantification and replace it with a grounded atomic axiom. | $1N$ |

Select an axiom that declares the size of a set (e.g., of the form size(us_states) = 50), and propose to change the size of the set by sampling from the prior distribution, conditioned on the maximum and minimum consistent set size. | $1N$ |

Select a node from a proof tree of type
∨I, →I, or ∃I.^{3} These nodes
were created in Algorithm 1 on lines 7, 16, and 30,
respectively, where for each node, a single premise was
selected out of a number of possible premises. This proposal
naturally follows from the desire to explore other
selections by re-sampling the proof: it simply calls init_proof again on the formula
at this proof node. | $1N$ |

Merge: Select a
“mergeable” event; that is, three constants
(c_{i},c_{j},c_{k})
such that
arg1(c_{i}) = c_{j},
arg2(c_{i}) = c_{k}, and t(c_{i})
for some constant t are axioms, and there
also exist constants
(c_{i′},c_{j′},c_{k′})
such that i′ > i,
arg1(c_{i′})
= c_{j′},
arg2(c_{i′})
= c_{k′},
and t(c_{i′})
are axioms. Next, propose to merge c_{i′} with c_{i} by
replacing all instances of c_{i′} with c_{i} in the
proof trees, c_{j′} with c_{j}, and c_{k′} with c_{k}. This
proposal is not necessary in that these changes are
reachable with other proposals, but those proposals may have
low probability, and so this can help to more easily escape
local maxima. | $\alpha N$ |

Split: The inverse of the above
proposal. | $\beta N$ |

Probability
. | Proposal of selecting proposal
. |
---|---|

Select a grounded atomic axiom (e.g.,
square(c_{1})) and propose to
replace it with an instantiation of a universal
quantification (e.g.,
∀x(rectangle(x)
∧rhombus(x)
→square(x))), where the
antecedent conjuncts are selected uniformly at random from
the other grounded atomic axioms for the constant c_{1}:
rectangle(c_{1}),
rhombus(c_{1}), etc. | $1N$ |

The inverse of the above proposal: select an instantiation of a universal quantification and replace it with a grounded atomic axiom. | $1N$ |

Select an axiom that declares the size of a set (e.g., of the form size(us_states) = 50), and propose to change the size of the set by sampling from the prior distribution, conditioned on the maximum and minimum consistent set size. | $1N$ |

Select a node from a proof tree of type
∨I, →I, or ∃I.^{3} These nodes
were created in Algorithm 1 on lines 7, 16, and 30,
respectively, where for each node, a single premise was
selected out of a number of possible premises. This proposal
naturally follows from the desire to explore other
selections by re-sampling the proof: it simply calls init_proof again on the formula
at this proof node. | $1N$ |

Merge: Select a
“mergeable” event; that is, three constants
(c_{i},c_{j},c_{k})
such that
arg1(c_{i}) = c_{j},
arg2(c_{i}) = c_{k}, and t(c_{i})
for some constant t are axioms, and there
also exist constants
(c_{i′},c_{j′},c_{k′})
such that i′ > i,
arg1(c_{i′})
= c_{j′},
arg2(c_{i′})
= c_{k′},
and t(c_{i′})
are axioms. Next, propose to merge c_{i′} with c_{i} by
replacing all instances of c_{i′} with c_{i} in the
proof trees, c_{j′} with c_{j}, and c_{k′} with c_{k}. This
proposal is not necessary in that these changes are
reachable with other proposals, but those proposals may have
low probability, and so this can help to more easily escape
local maxima. | $\alpha N$ |

Split: The inverse of the above
proposal. | $\beta N$ |

The function init_proof in Algorithm 1 recursively
calls init_disproof. Due to space limitations, we
refer the reader to our code for this function; it closely mirrors the
structure of init_proof. The purpose of init_proof is to find *some* proof
of a given higher-order formula, or return null if none exists. Its task is
finding a satisfying abductive proof, which is easier than theorem proving,
since it can create new axioms as needed. The returned proof need not be
“optimal” because it serves as the initial state for MH, which
will further refine the proof. The validity of the proofs is guaranteed by
the fact that init_proof only returns valid proofs
and the MH proposals preserve validity.

### 3.2 Language Module

For the language module, PWM uses the probabilistic
model of Saparov et al. (2017). The
generative nature of their semantic parsing model allows it to fit seamlessly
into PWM and PWL. The logical
forms in their model are distributed according to a *semantic
prior*, which we replace with our distribution of logical forms
conditioned on the theory *p*(*π*_{i}|*T*).
Their parser is probabilistic and finds the *k*-best logical
forms that maximize *p*(*y*_{i}|*x*_{i},*T*)
for a given input sentence. Combined with our reasoning module’s ability
to compute the probability of a logical form, the parser can resolve ambiguous
interpretations of sentences by exploiting acquired knowledge. We will
demonstrate the utility of this property in resolving lexical ambiguity.

However, the semantic grammar in Saparov et al. (2017) was designed for a Datalog representation of logical forms. Thus, we designed and implemented a new grammar for our more domain-general formalism in higher-order logic. Though their model induces preterminal production rules from data (e.g., N $\u2192$ “cat”), we must manually specify the nonterminal production rules (e.g., NP $\u2192$ ADJP NP). This allows us to encode prior knowledge of the English language into PWM, dramatically improving its statistical efficiency and obviating the need for massive training sets to learn English syntax. It is nonetheless tedious to design these rules while maintaining domain-generality. Once specified, however, these rules can be re-used in new tasks and domains with minimal or no changes. We also improved their model to generalize over inflected forms of words. In the generative process, instead of generating sentence tokens directly (e.g., “I am sleeping”), PWM generates word roots with flags indicating their inflection (e.g., “I be[1st,sg] sleep[prs,ptcp]”). During parsing, this has the effect of performing morphological and semantic parsing jointly. We extracted the necessary comprehensive morphology information from Wiktionary (Wikimedia Foundation, 2020).

We train this new grammar to learn the parameters that govern the conditional
distributions and the preterminal production rules. To do so, we construct a
small *seed training set* consisting of 55 labeled sentences, 47
nouns, 55 adjectives, and 20 verbs.^{3} We wrote and labeled these sentences by hand, largely
in the domain of astronomy, with the aim to cover a diverse range of English
syntactic constructions. This small training set was sufficient thanks to the
statistical efficiency of PWM.

While PWL uses the same parsing algorithm of Saparov et
al. (2017), we provide an
easier-to-understand presentation. Given an input sentence *y*_{i}, the parser aims to find
the logical form(s) *x*_{i} and
derivation trees *t*_{i} that maximize
the posterior probability *p*(*x*_{i},*t*_{i}|*y*_{i},*T*).
This discrete optimization is performed using *branch-and-bound* (Land and Doig, 1960): The algorithm
starts by considering the set of all derivation trees and partitions it into a
number of subsets (the “branch” step). For each subset *S*, the parser computes an upper bound on the log
probability of any derivation in *S* (the “bound”
step). Having computed the bound for each subset, the parser puts them into a
priority queue, prioritized by the bound. The parser then dequeues the subset
with the highest bound and repeats this process, further subdividing this set,
computing the bound for each subdivision, and adding them to the queue.
Eventually, the parser will dequeue a subset containing a single derivation
whose log probability is at least the highest priority in the queue. This
derivation is optimal. The algorithm can be continued to obtain the
top-*k* derivations/logical forms. Because this algorithm
operates over *sets* of logical forms (where each set is possibly
infinite), we implemented a data structure to sparsely represent such sets of
higher-order formulas, as well as algorithms to perform set operations, such as
intersection and subtraction.

## 4 Experiments

### 4.1 ProofWriter

^{4}portion of the dataset, as the authors evaluated their method on this portion zero-shot, just as we do (i.e., the algorithm did not see any example from this portion during training). This portion of the data is subdivided into 6 sections, each with varying degrees of difficulty. An example from this dataset is shown in Figure 3. For each example, PWL reads the context and abduces a theory. Next, it parses the query sentence

*y*

_{n +1}into a logical form

*x*

_{n +1}and estimates its

*unnormalized*probability:

**are the previously read logical forms (the context). Since the quantity in equation 11 is intractable to compute, PWL approximates it by sampling from the posterior**

*x**T*,

*π*

_{1},…,

*π*

_{n +1}∣

*x*

_{1},…,

*x*

_{n +1}and summing over distinct samples. Although this approximation seems crude, the sum is dominated by a small number of the most probable theories and proofs, and MH is an effective way to find them, as we observe in experiments. MH is run for 400 iterations, and at every 100

^{th}iteration, PWL re-initializes the Markov chain by performing 20 “exploratory” MH steps (i.e., consisting of only the third and fourth proposals in Table 2 and accepting every proposal). This re-initialization is analogous to a random restart and can help to escape from local maxima. However, it may be promising to explore other approaches to compute this quantity, such as Luo et al. (2020). Once PWL has computed this probability for the query sentence, it does the same for the negation of the sentence. These unnormalized probabilities are compared, and if they are within 2000 in log probability, PWL returns the label unknown. If the first probability is sufficiently larger than the second, PWL returns true, and otherwise, returns false. The parameters in the prior were set by hand initially by choosing values that we thought were reasonable (e.g., the average length of a natural deduction proof for a sentence containing a simple subject noun phrase, object noun phrase, and transitive verb is around 20 steps, which is why the Poisson parameter for the proof length is set to 20). The values were tweaked as necessary by running the algorithm on toy examples during debugging. Note that the sentences “Bill is a bird” and “Bill is not a bird” can still both be true if each “Bill” refers to distinct entities. To avoid this, we chose an extreme value of the prior parameter such that the log prior probability of a theory with two entities having the same name is 2000 less than that of a theory where the name is unique. It is for this reason 2000 was chosen as the threshold for determining whether a query is true/false vs unknown. This prior worked well enough in our experiments, but the goal is to have a single prior work well with any task, so further work to explore which priors work better across a wider variety of tasks is welcome. We evaluated PWL using both classical and intuitionistic logic, even though the ground truth labels in the dataset were generated using

*intuitionistic logic*.

Table 3 lists the zero-shot
accuracy of PWL, comparing with baselines based on the
T5 transformer (Raffel et al., 2020).
We emphasize here that PWL is not perfectly comparable
to the baseline, because they aim to demonstrate that their method can *learn* to reason. We instead aim to demonstrate that PWL’s ability to parse and reason end-to-end
generalizes to an out-of-domain question-answering task. The baseline is trained
on other portions of the ProofWriter data, whereas PWL is trained only on its seed training set. PWL performed much better using intuitionistic
logic than classical logic, as expected since the ground truth labels were
generated using intuitionistic semantics. However, most real-world reasoning
tasks would take the law of the excluded middle to be true, and classical logic
would serve as a better default. Although the task is relatively simple, it
nevertheless demonstrates the proof-of-concept and the promise of further
research.

### 4.2 FictionalGeoQA

The sentences in the ProofWriter experiment are
template-generated and have simple semantics. For the sake of evaluation more
representative of real-world language, we introduce a new question-answering
dataset called FictionalGeoQA.^{5} To create this dataset, we took questions from GeoQuery (Zelle and Mooney, 1996), and for each question, we wrote a paragraph
context containing the information necessary to answer the question. We added
distractor sentences to make the task more robust against heuristics. Whenever
possible, the sentences in this paragraph were taken from Simple English
Wikipedia. However, some facts, such as the lengths of rivers, are not expressed
in sentences in Wikipedia (they typically appear in a table on the right side of
the page), so we wrote those sentences by hand: We took questions from GeoQuery that expressed the desired fact in interrogative form
(e.g., “What is the length of <river
name>?”) and converted them into declarative form
(e.g., “The length of <river name> is <length>.”). The resulting
dataset contains 600 examples, where 67.4% of the sentences are from
Simple English Wikipedia, and 90% of the examples contain at least one
sentence *not* from Wikipedia. We replaced all place names with
fictional ones to remove any confounding effects from pretraining. To keep the
focus of the evaluation on reasoning ability, we chose to restrict the
complexity of the language. In particular, each sentence is independent and can
be understood in isolation (e.g., no cross-sentential anaphora). The sentences *are* more complex than those in ProofWriter, having more of the complexities of
real language, such as synonymy, lexical ambiguity (e.g., what is the semantics
of “has” in “a state has city” vs “a state
has area”; or whether “largest state” refers to area or
population), and syntactic ambiguity. This increased difficulty is evident in
the results. This dataset is meant to evaluate out-of-domain generalizability,
so we do not provide a separate training set for fine-tuning. An example is
shown in Figure 4.

We compare PWL (using classical logic) with a number of
baselines: (1) UnifiedQA (Khashabi et al., 2020), a QA system based on large-scale
neural language models, (2) Boxer (Bos, 2015), a wide-coverage semantic parser,
combined with Vampire 4.5.1 (Kovács and
Voronkov, 2013), a theorem prover for
full first-order logic, (3) Boxer combined with E 2.6 (Schulz et al., 2019), another theorem prover for full first-order
logic, (4) the language module of PWL combined with Vampire, and (5) the language module of PWL combined with E. The
results are shown in Table 4,
along with a breakdown across multiple subsets of the dataset. UnifiedQA performs relatively well but fares more
poorly on questions with negation and subjective concept definitions (e.g.,
“Every river longer than 500km is major…What are the major
rivers?”). Humans are easily able to understand and utilize such
definitions, and the ability to do so is instrumental in learning about new
concepts or words in new domains. PWL is able to fare
better than UnifiedQA in examples with lexical
ambiguity, as a result of the language module’s ability to exploit
acquired knowledge to resolve ambiguities. We find that Boxer has significantly higher coverage than PWL (100*%* vs
79.8*%*) but much lower precision. For instance, Boxer uses the semantic representation in the
Parallel Meaning Bank (Abzianidze et al., 2017) which has a simpler representation of superlatives, and is
thus unable to capture the correct semantics of superlatives in examples of this
dataset. We also find that for most examples, Boxer produces different semantics for the question vs. the context sentences,
oftentimes predicting the incorrect semantic role for the interrogative words,
which leads to the theorem provers being unable to find a proof for these extra
semantic roles. We also experimented with replacing our reasoning module with a
theorem prover and found that for almost all examples, the search of the theorem
prover would explode combinatorially. This was due to the fact that our semantic
representation relies heavily on *sets*, so a number of simple
set theoretic axioms are required for the theorem provers, but this quickly
causes the deduction problem to become undecideable. Our reasoning module
instead performs abduction, and is able to create axioms to more quickly find an
initial proof, and then refine that proof using MH. Despite our attempt to
maximize the generalizability of the grammar in PWL,
there are a number of linguistic phenomena that we did not yet implement, such
as interrogative subordinate clauses, wh-movement, spelling or grammatical
mistakes, and so forth, and this led to the lower coverage on this dataset. Work
remains to be done to implement these missing production rules in order to
further increase the coverage of the parser.

. | all
. | superlative . | subjective concept def. . | objective concept def. . | lexical ambiguity . | negation . | large context . | arithmetic . | counting . | 0 subsets . | 1 subset . | 2 subsets . | 3 subsets . | 4 subsets . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

N | 600 | 210 | 150 | 170 | 180 | 102 | 100 | 20 | 30 | 85 | 213 | 187 | 85 | 30 |

UnifiedQA | 33.8 | 29.5 | 7.3 | 33.5 | 32.8 | 14.7 | 43.0 | 10.0 | 20.0 | 41.2 | 47.9 | 27.8 | 8.2 | 23.3 |

Boxer + E | 9.7 | 0.0 | 12.0 | 11.8 | 0.0 | 15.7 | 14.0 | 10.0 | 0.0 | 7.1 | 17.8 | 5.3 | 4.7 | 0.0 |

Boxer + Vampire | 9.7 | 0.0 | 12.0 | 11.8 | 0.0 | 15.7 | 14.0 | 10.0 | 0.0 | 7.1 | 17.8 | 5.3 | 4.7 | 0.0 |

PWL parser + E | 5.0 | 0.0 | 13.3 | 2.9 | 0.0 | 15.7 | 4.0 | 10.0 | 0.0 | 1.2 | 7.0 | 5.3 | 4.7 | 0.0 |

PWL parser + Vampire | 9.0 | 0.0 | 13.3 | 11.2 | 0.0 | 15.7 | 4.0 | 10.0 | 0.0 | 12.9 | 13.6 | 5.3 | 4.7 | 0.0 |

PWL | 43.1 | 40.5 | 33.3 | 33.5 | 34.4 | 23.5 | 45.0 | 10.0 | 0.0 | 43.5 | 62.9 | 39.0 | 17.6 | 0.0 |

. | all
. | superlative . | subjective concept def. . | objective concept def. . | lexical ambiguity . | negation . | large context . | arithmetic . | counting . | 0 subsets . | 1 subset . | 2 subsets . | 3 subsets . | 4 subsets . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

N | 600 | 210 | 150 | 170 | 180 | 102 | 100 | 20 | 30 | 85 | 213 | 187 | 85 | 30 |

UnifiedQA | 33.8 | 29.5 | 7.3 | 33.5 | 32.8 | 14.7 | 43.0 | 10.0 | 20.0 | 41.2 | 47.9 | 27.8 | 8.2 | 23.3 |

Boxer + E | 9.7 | 0.0 | 12.0 | 11.8 | 0.0 | 15.7 | 14.0 | 10.0 | 0.0 | 7.1 | 17.8 | 5.3 | 4.7 | 0.0 |

Boxer + Vampire | 9.7 | 0.0 | 12.0 | 11.8 | 0.0 | 15.7 | 14.0 | 10.0 | 0.0 | 7.1 | 17.8 | 5.3 | 4.7 | 0.0 |

PWL parser + E | 5.0 | 0.0 | 13.3 | 2.9 | 0.0 | 15.7 | 4.0 | 10.0 | 0.0 | 1.2 | 7.0 | 5.3 | 4.7 | 0.0 |

PWL parser + Vampire | 9.0 | 0.0 | 13.3 | 11.2 | 0.0 | 15.7 | 4.0 | 10.0 | 0.0 | 12.9 | 13.6 | 5.3 | 4.7 | 0.0 |

PWL | 43.1 | 40.5 | 33.3 | 33.5 | 34.4 | 23.5 | 45.0 | 10.0 | 0.0 | 43.5 | 62.9 | 39.0 | 17.6 | 0.0 |

Legend:**superlative** The subset of the dataset with examples
that require reasoning over superlatives, i.e., “longest
river.”

**subjective concept def.** Subset with definitions of
“subjective” concepts, i.e., “Every river
longer than 500 km is major.”

**objective concept def.** Subset with definitions of
“objective” concepts, i.e., the population of a
location is the number of people living there.

**lexical ambiguity** Subset with lexical ambiguity, i.e.,
“has” means different things in “a state has a
city named” vs “a state has an area of...”

**negation** Subset with examples that require reasoning
with classical negation (negation-as-failure is insufficient).

**large context** Subset of examples where there are at
least 100 sentences in the context.

**arithmetic** Subset with examples that require simple
arithmetic. **counting** Subset with examples that require
counting.

*n***subset(s)** Examples that belong to exactly *n* of the above subsets (no example is a member
of more than 4 subsets).

## 5 Conclusions and Future Work

We introduced PWM, a fully symbolic Bayesian model of semantic parsing and reasoning, which we hope serves as a compelling first step in a research program toward more domain- and task-general NLU. We derived PWL, an efficient inference algorithm that reads sentences by parsing and abducing updates to its latent world model that capture the semantics of those sentences, and empirically demonstrated its ability to generalize to two out-of-domain question-answering tasks. To do so, we created a new question-answering dataset, FictionalGeoQA, designed specifically to evaluate reasoning ability while capturing more of the complexities of real language and being robust against heuristic strategies. PWL is able to read and understand sentences with richer semantics, such as definitions of new concepts. In contrast with past deductive reasoning approaches, PWL performs abduction, which is computationally easier. The highly underspecified nature of the problem of abduction is alleviated by the probabilistic nature of PWL, as it gives a principled way to find the most probable theories. We present an inference strategy where Metropolis-Hastings (MH) is performed on each sentence, in sequence, where the previous sample of the theory and proofs provides a warm-start for inference of the next sentence, reducing the number of MH iterations.

There are many avenues for future work: A simple prior was used for proofs *p*(*π*_{i}|*T*),
and an alternative is to use a compositional exchangeable prior such as adaptor
grammars (Johnson et al., 2006).

The first MH proposal in Table 2 is
simple but restrictive: The antecedent conjuncts and the consequent are restricted
to be atomic. MH would be able to explore a much larger and semantically richer set
of theories if the antecedent or consequent could contain more complex formulas,
including quantified formulas. In addition, the inference algorithm sometimes
becomes stuck in local maxima. One way to improve the efficiency of inference is to
add a new MH proposal that specifically proposes to split or merge types. For
example, if the theory has the axioms cat(*c*_{1}) and
dog(*c*_{1}), this proposal would split *c*_{1} into two concepts:
cat(*c*_{1}) and dog(*c*_{2}).
This kind of type-based Markov chain Monte Carlo is similar in principle to Liang et
al. (2010).

As mentioned earlier, a model of context is necessary in the language module to
properly handle cross-sentential anaphora and conversational contexts. Real language
very rarely consists of sentences that are independent of context. There are also
many research questions on the issue of *scalability*. Although PWL is able to scale to examples in FictionalGeoQA with more than 100 sentences, there are
two main bottlenecks currently preventing it from scaling to significantly larger
theories: (1) the maintenance of global consistency, and (2) the unfocused nature of
the current MH proposals. When checking for consistency of a new axiom, rather than
considering all other axioms/sets in the theory, it would be preferable to only
consider the portion of the theory relevant to the new axiom. Additionally, the
current MH proposals do not take into account the goal of reasoning. For example, if
the current task is to answer a question about geography, then MH proposals for
proofs unrelated to geography are wasteful, and would increase the number of MH
steps needed. A more clever goal-aware approach for selecting proofs to mutate would
help to alleviate this problem and improve scalability. PWM provides a path to incorporate information from additional modalities in principled
fashion: for example by adding a generative model of images, which would serve as a
separate “vision module.” In addition, even though PWL is fully-symbolic, non-symbolic methods could be
used for expressive prior/proposal distributions or approximate inference. There are
many fascinating research paths to pursue from here.

## Acknowledgments

We thank the anonymous reviewers and the Action Editor for their invaluable feedback. We also thank Peter Clark, William W. Cohen, Rik van Noord, Johan Bos, and Emmanouil A. Platanios for their insightful and helpful discussion. This research was funded in part by the Air Force Office of Scientific Research under AFOSR grant FA95502010118.

## Notes

^{1}

Some deduction rules require additional parameters, and we refer the reader to our code for details on how these parameters are sampled.

^{2}

swap randomly selects an element in its input list to
swap with the first element. The probability of moving an element *c* to the front of the list is computed as follows:
Recursively inspect the atoms in the formula *f*(*c*) and count the number of
“matching” atoms: The atoms *t*(*c*) or *c*(*t*) is considered
“matching” if it is provable in *T*. Next,
count the number of “mismatching” axioms: for each atom *t*(*c*) in the formula *f*(*c*), an axiom *t′*(*c*) is
“mismatching” if *t*≠*t′*. And similarly for
each atom *c*(*t*) in the formula *f*(*c*), an axiom *c*(*t′*) is
“mismatching” if *t*≠*t′*. Let *n* be the number of “matching” atoms and *m* be the number of “mismatching” axioms,
then the probability of moving *c* to the front of the list
is proportional to $exp{n\u22122m}$.
This greatly increases the chance of finding a high-probability proof in the
first iteration of the loop on line 31, and since this function is also used
in an MH proposal, it dramatically improves the acceptance rate. This
reduces the number of MH iterations needed to sufficiently mix the Markov
chain.

^{3}

Also disproofs of conjunctions, if using classical logic.

^{3}

The grammar, morphology data, code, as well as the seed training set are available in our Github repository.

^{4}

The dataset comes in two flavors: one that makes the closed-world assumption, and one that does not.

^{5}

Available at github.com/asaparov/fictionalgeoqa.

## References

## Author notes

Action Editor: Hinrich Schütze