Towards General Natural Language Understanding with Probabilistic Worldbuilding

We introduce the Probabilistic Worldbuilding Model (PWM), a new fully symbolic Bayesian model of semantic parsing and reasoning, as a first step in a research program toward more domain- and task-general NLU and AI. Humans create internal mental models of their observations that greatly aid in their ability to understand and reason about a large variety of problems. In PWM, the meanings of sentences, acquired facts about the world, and intermediate steps in reasoning are all expressed in a human-readable formal language, with the design goal of interpretability. PWM is Bayesian, designed specifically to be able to generalize to new domains and new tasks. We derive and implement an inference algorithm that reads sentences by parsing and abducing updates to its latent world model that capture the semantics of those sentences, and evaluate it on two out-of-domain question-answering datasets: (1) ProofWriter and (2) a new dataset we call FictionalGeoQA, designed to be more representative of real language but still simple enough to focus on evaluating reasoning ability, while being robust against heuristics. Our method outperforms baselines on both, thereby demonstrating its value as a proof-of-concept.


Introduction
Despite recent progress in AI and NLP producing algorithms that perform well on a number of NLP tasks, it is still unclear how to move forward and develop algorithms that understand language as well as humans do. In particular, large-scale language models such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), GPT-3 (Brown et al., 2020), XLNet (Yang et al., 2019), and others were trained on a very large amount of text and can then be applied to perform many different NLP tasks after some fine-tuning. In the case of GPT-3, some tasks require very few additional training examples to achieve state-of-the-art performance.
As a result of training on text from virtually every domain, these models are domain-general. This is in contrast with NLP algorithms that are largely trained on one or a small handful of domains, and as such, are not able to perform well on new domains outside of their training. Despite this focus on domain-generality, there are still a large number of tasks on which these large-scale language models perform poorly (Dunietz et al., 2020). Many limitations of today's state-of-the-art methods become evident when comparing with the human ability to understand language (Lake et al., 2016;Tamari et al., 2020;Bender and Koller, 2020;Gardner et al., 2019;Linzen, 2020). Many cognitive scientists posit that humans create rich mental models of the world from their observations which provide superior explainability, reasoning, and generalizability to new domains and tasks. How do we, as a field, move from today's state-of-the-art to more general intelligence? What are the next steps to develop algorithms that can generalize to new tasks at the same level as humans? The lack of interpretability in many of these models makes these questions impossible to answer precisely. One promising direction is to change the evaluation metric: Brown et al. (2020), Linzen (2020), and many others have suggested zero-shot or few-shot accuracy to measure the performance of algorithms (i.e., the algorithm is evaluated with a new dataset, wholly separate from its training; or in the case of few-shot learning, save for a few examples). While this shift is welcome, it alone will not solve the above issues.
We introduce the Probabilistic Worldbuilding Model (PWM), a probabilistic generative model of reasoning and semantic parsing. Like some past approaches, PWM explicitly builds an internal mental model, which we call the theory (Tamari et al., 2020;Hogan et al., 2021;Mitchell et al., 2018;Charniak and Goldman, 1993). The theory constitutes what the algorithm believes to Figure 1: The generative process and inference in our model, with an example of a theory, generating a proof of a logical form which itself generates the sentence ''Bob is a mammal.'' During inference, only the sentences are observed, whereas the theory and proofs are latent. Given sentence y i , the language module outputs the logical form. The reasoning module then infers the proof π i of the logical form and updates the posterior of the theory T . be true. PWM is fully symbolic and Bayesian, using a single unified human-readable formal language to represent all meaning, and is therefore inherently interpretable. This is in contrast to systems that use subsymbolic representations of meaning for some or all of their components. Every random variable in PWM is well-defined with respect to other random variables and/or grounded primitives. Prior knowledge such as the rules of deductive inference, the structure of English grammar, and knowledge of basic physics and mathematics can be incorporated by modifying the prior distributions of the random variables in PWM. Incorporating prior knowledge can greatly reduce the amount of training data required to achieve sufficient generalizability, as we will demonstrate. Extensibility is key to future research that could enable more general NLU and AI, as it provides a clearer path forward for future exploration.
We present an implementation of inference under the proposed model, called the Probabilistic Worldbuilding from Language (PWL). While PWM is an abstract mathematical description of the underlying distribution of axioms, proofs, logical forms, and sentences, PWL is the algorithm that reads sentences, computes logical form representations of their meaning, and updates the axioms and proofs in the theory accordingly. See Figure 1 for a high-level schematic diagram of PWM and PWL. PWM describes the process depicted by the red arrows, whereas PWL is the algorithm depicted by the green arrows. We emphasize that the reasoning in PWL is not a theorem prover and is not purely deductive. Instead, PWL solves a different problem of finding satisfying abductive proofs, which is computationally easier than deductive inference: Given a set of observations, work backwards to find a set of axioms that deductively explain the observations. It is these abduced axioms that constitute the internal ''mental model.'' Humans often rely on abductive reasoning, for example in commonsense reasoning (Bhagavatula et al., 2020;Furbach et al., 2015).
A core principle of our approach is to ensure generality by design. Simplifying assumptions often trade away generality for tractability, such as by restricting the representation of the meanings of sentences, or number of steps during reasoning. PWM is designed to be domain-and task-general, and to this end, uses higher-order logic (i.e., lambda calculus) (Church, 1940) as the formal language, which we believe is sufficiently expressive to capture the meaning of declarative and interrogative sentences in natural language. Furthermore, PWM uses natural deduction for reasoning, which is complete in that if a logical form φ is true, there is a proof of φ (Henkin, 1950).
In Section 3, we describe PWM and PWL more precisely. In Section 4, as a proof-of-concept of the value of further research, we run experiments on two question-answering datasets: ProofWriter (Tafjord et al., 2021) and a new dataset, called FictionalGeoQA, which we specifically created to evaluate the ability to reason over short paragraphs while being robust against simpler heuristic strategies. Unlike ProofWriter, the text in Fic-tionalGeoQA was not template-generated and is more realistic, but is still simple enough to focus the evaluation on reasoning rather than parsing, with many sentences having semantics that go beyond the Horn clause fragment of first-order logic. PWL outperforms the baselines with respect to zero-shot accuracy (i.e., without looking at any training examples). Our code and data is freely available at github.com/asaparov/PWL and github.com/asaparov/fictionalgeoqa.
In summary, the primary contributions of this paper are the following: • PWM, a new model for more general NLU, and PWL, the implementation that reads sentences, computes their logical forms, and updates its theory accordingly.
• Introducing FictionalGeoQA, a new questionanswering dataset designed to evaluate the ability to reason over language.
• Experiments on ProofWriter and Fictional-GeoQA demonstrating that PWL outperforms baseline methods on question-answering.

Related Work
Fully symbolic methods were commonplace in earlier AI research (Newell and Simon, 1976;Dreyfus, 1985). However, they were oftentimes brittle: A new observation would contradict the internal theory or violate an assumption, and it was not clear how to resolve the impasse in a principled manner and proceed. But they do have some key advantages: Symbolic approaches that use well-studied human-readable formal languages such as first-order logic, higher-order logic, type theory, etc. enable humans to readily inspect and understand the internal processing of these algorithms, effecting a high degree of interpretability (Dowty, 1981;Gregory, 2015;Cooper et al., 2015). Symbolic systems can be made general by design, by using a sufficiently expressive formal language and ontology. Hybrid methods have been explored to alleviate the brittleness of formal systems while engendering their strengths, such as interpretability and generalizability; for example, the recent work in neuro-symbolic methods (Yi et al., 2020;Saha et al., 2020;Tafjord et al., 2021). Neural theorem provers are in this vein (Rocktäschel and Riedel, 2017). However, the proofs considered in these approaches are based on backward chaining (Russell and Norvig, 2010), which restricts the semantics to the Horn clause fragment of first-order logic. Sun et al. (2020), Ren et al. (2020), and Arakelyan et al. (2021) extend coverage to the existential positive fragment of first-order logic. In natural language, sentences express more complex semantics such as negation, nested universal quantification, and higher-order structures. Our work explores the other side of the tradeoff between tractability and expressivity/generality. Theorem provers attempt to solve the problem of deduction: finding a proof of a given formula, given a set of axioms. In contrast, the reasoning component of PWM is abductive, and the abduced axioms can be used in downstream tasks, such as question-answering, and to better read new sentences in the context of the world model, as we will demonstrate. We posit that abduction is sufficient for more general NLU (Hobbs, 2006;Hobbs et al., 1993). PWM combines Bayesian statistical machine learning with symbolic representations in order to handle uncertainty in a principled manner, ''smoothing out'' or ''softening'' the rigidity of a purely symbolic approach. In PWM, the internal theory is a random variable, so if a new observation is inconsistent with the theory, there may be other theories in the probability space that are consistent with the observation. The probabilistic approach provides a principled way to resolve these impasses. PWM is certainly not the first to combine symbolic and probabilistic methods. There is a rich history of inductive logic programming (ILP) (Muggleton, 1991;Cropper and Morel, 2021) and probabilistic ILP languages (Muggleton, 1996;Cussens, 2001;Sato et al., 2005;Bellodi and Riguzzi, 2015). These languages could be used to learn a ''theory'' from a collection of observations, but they are typically restricted to learning rules in the form of first-order Horn clauses, for tractability. In natural language, it is easy to express semantics beyond the Horn clause fragment of first-order logic.
Knowledge bases (KBs) and cognitive architectures (Kotseruba and Tsotsos, 2020;Hogan et al., 2021;Laird et al., 1987;Mitchell et al., 2018) have attempted to explicitly model domaingeneral knowledge in a form amenable to reasoning. Cognitive architectures aim to more closely replicate human cognition. Some approaches use probabilistic methods to handle uncertainty (Niepert et al., 2012;Niepert and Domingos, 2015;Jain et al., 2019). However, many of these approaches make strong simplifying assumptions that restrict the expressive power of the formal language that expresses facts in the KB. For example, many KBs can be characterized as graphs, where each entity corresponds to a vertex and every fact corresponds to a labeled edge. For example, the belief plays sport(s williams, tennis) is representable as a directed edge connecting the vertex s williams to the vertex tennis, with the edge label plays sport. While this assumption greatly aids tractability and scalability, allowing many problems in reasoning to be solved by graph algorithms, it greatly hinders expressivity and generality, and there are many kinds of knowledge that simply cannot be expressed and represented in such KBs. PWM does not make such restrictions on logical forms in the theory, allowing for richer semantics, such as definitions, universally quantified statements, conditionals, etc.

Model
In this section, we provide a mathematical description of PWM. At a high level, the process for generating a sentence sampled from this probability distribution is: 1. Sample the theory T from a prior distribution p(T ). T is a collection of logical forms in higher-order logic that represent what PWL believes to be true.
2. For each observation i, sample a proof π i from p(π i | T ). The conclusion of the proof is the logical form x i , which represents the meaning of the i th sentence.
3. Sample the i th sentence y i from p(y i | x i ).
Inference is effectively the inverse of this process, and is implemented by PWL. During inference, PWL is given a collection of observed sentences y 1 , . . . , y n and the goal is to discern the value of the latent variables: the logical form of each sentence x {x 1 , . . . , x n }, the proofs for each logical form π {π 1 , . . . , π n }, and the underlying theory T . Both the generative process and inference algorithm naturally divide into two modules: • Language module: During inference, this module's purpose is to infer the logical form of each observed sentence. That is, given the input sentence y i , this module outputs the k most-probable values of the logical form x i (i.e., semantic parsing).
• Reasoning module: During inference, this module's purpose is to infer the underlying theory that logically entails the observed logical forms (and their proofs thereof). That is, given an input collection of logical forms x, this module outputs the posterior distribution of the underlying theory T and the proofs π of those logical forms.
Note that the y i need not necessarily be sentences, and PWM can easily be generalized to other kinds of data. For example, if a generative model of images is available for p(y i | x i ), then an equivalent ''vision module'' may be defined. This module may be used either in place of, or together with, the language module. In the above generative process, PWM assumes each sentence to be independent. A model of context is required to properly handle inter-sentential anaphora or conversational settings. This can be done by allowing the distribution on y i to depend on previous logical forms or sentences: relaxing the i.i.d. assumption). For simplicity of this proof-of-concept, this is left to future work. There is a vast design space for symbolic representations of meaning. We are unable to comprehensively list all of our design choices, but we describe two important ones below.
Neo-Davidsonian semantics (Parsons, 1990) is used to represent meaning in all logical forms (both in the theory and during semantic parsing). As a concrete example, a straightforward way to represent the meaning of ''Jason traveled to New York'' could be with the logical form travel(jason, nyc). In neo-Davidsonian semantics, this would instead be represented with three distinct atoms: travel(c 1 ), arg1(c 1 ) = jason, and arg2(c 1 ) = nyc. Here, c 1 is a constant that represents the ''traveling event,'' whose first argument is the constant representing Jason, and whose second argument is the constant representing New York City. This representation allows the event to be more readily modified by other logical expressions, such as in ''Jason quickly traveled to NYC before nightfall.'' In addition, PWM defers named entity linking to the reasoning module (it is not done during parsing). That is, the semantic parser does not parse ''Jason'' directly into the constant jason. Rather, named entities are parsed into existentially quantified expressions, for example, ∃j(name(j) = ''Jason'' ∧ . . .). This simplifies the parser's task and allows reasoning to aid in entity linking. Table 1 details these design options.

Generative Process for the Theory p(T)
The theory T is a collection of axioms a 1 , a 2 , . . . represented in higher-order logic. We choose a fairly simple prior p(T ) for rapid prototyping, but it is straightforward to substitute with a more complex prior. Specifically a 1 , a 2 , . . . are distributed according to a distribution G a which is sampled from a Dirichlet process (DP) (Ferguson, 1973), an exchangeable non-parametric distribution.
where H a is the base distribution and α = 0.1 is the concentration parameter. An equivalent perspective of the DP that better illustrates how the samples are generated is the Chinese restaurant process (Aldous, 1985): That is, the i th sample is drawn from H a with probability proportional to α, or it is set to a previous sample with probability proportional to the number of times that sample has been drawn. The base distribution H a recursively generates logical forms in higher-order logic. Because any formula can be written as a tree, they can be gen-erated top-down, starting from the root. The type of each node (conjunction ∧, disjunction ∨, negation ¬, quantification ∀x, etc.) is sampled from a categorical distribution. If the type of the node is selected to be an atom (e.g., book(c 1 )), then its predicate is sampled from a non-parametric distribution of predicate symbols H p . The atom's argument(s) are each sampled as follows: If n V is the number of available variables (from earlier generated quantifiers), then sample a variable uniformly at random with probability 1 n V +1 ; otherwise, with probability 1 n V +1 , sample a constant from a non-parametric distribution of constant symbols H c . For brevity, we refer the reader to our code for the specific forms of H p and H c .
Since PWM uses a neo-Davidsonian representation, another node type that H a can generate is an event argument (e.g., arg1(c 1 ) = jason). When this is selected, the event constant (c 1 in the example) is sampled in the same way an atom's argument is sampled, as described above: first by trying to sample a variable, and otherwise sampling a constant from H c . The right side of the equality (jason in the example) can either be a variable, constant, string, or number, so PWM first selects its type from a categorical distribution. If the type is chosen to be a number, string, or variable, its value is sampled uniformly. If the type is chosen to be a constant, it is sampled from H c .
Names of entities are treated specially in this prior: The number of names available to each entity is sampled according to a very light-tailed distribution i.i.d.: for entity c i the number of names This ensures that entities tend not to have too many names.
Sets are also treated specially in this prior: One kind of axiom that can be generated is one that declares the size of a set-for example, size(λx.planet(x)) = 8 denotes that the size of the set of planets is 8. In the prior, the size of each set is distributed according to a geometric distribution with parameter 10 −4 . Sets can have arity not equal to 1, in which case their elements are tuples.
Deterministic Constraints: We also impose hard constraints on the theory T . Most importantly, T is required to be globally consistent. While this is a conceptually simple requirement, it is computationally expensive (generally undecideable even in first-order logic). PWL enforces this constraint by keeping track of the known sets in the theory (i.e., a set is known if its set size axiom is used in a proof, or if the set appears as a subset/superset in a universally-quantified axiom, For each set, PWL computes which elements are provably members of that set. If the number of provable members of a set is greater than its size, or if an element is both provably a member and not a member of a set, the theory is inconsistent. Relaxing this constraint would be valuable in future research, perhaps instead by only considering the relevant sets rather than all sets in the theory, or deferring consistency checks altogether. We place a handful of other constraints on the theory T : The name of an entity must be a string (and not a number or a constant). All constants are distinct; that is, c i = c j for all i = j. This helps to alleviate identifiability issues, as otherwise, there would be a much larger number of logically equivalent theories. No event can be an argument of itself (e.g., there is no constant c i such that arg1(c i ) = c i ). If a theory T satisfies all constraints, we write ''T valid.'' These constraints do slightly complicate computation of the prior, since the generative process for generating T is conditioned on T being valid: and the denominator is intractable to compute. However, we show in Section 3.1.3 that for inference, it suffices to be able to efficiently compute the ratio of prior probabilities: Additionally note that because the above constraints do not depend on the order of the axioms, constants, and so forth (i.e., the constraints themselves are exchangeable), the distribution of T conditioned on T being valid is exchangeable.

Properties of the Prior p(T ):
We emphasize that these distributions were chosen for simplicity and ease of implementation, and they worked well enough in experiments. However, there are likely many distributions that would work just as well. The parameters in the above distributions are not learned; they were set and fixed a priori. Nevertheless, this prior does exhibit useful properties for a domain-and task-general model of reasoning: • Occam's razor: Smaller/simpler theories are given higher probability than larger and more complex theories.
• Consistency: Inconsistent theories are discouraged or impossible.
• Entities tend to have a unique name. Our prior above encodes one direction of this prior belief: Each entity is unlikely to have many names. However, the prior does not discourage one name from referring to multiple entities.
• Entities tend to have a unique type. Note, however, that this does not discourage types provable by subsumption. For example, if the theory has the axioms novel(c 1 ) and ∀x(novel(x) → book(x)), even though book(c 1 ) is provable, it is not an axiom in this example and the prior only applies to axioms.

Generative Process for Proofs p(π i | T )
PWM uses natural deduction, a well-studied proof calculus, for the proofs (Gentzen, 1935(Gentzen, , 1969. Pfenning (2004) provides an accessible introduction. Figure 2 illustrates a simple example of a natural deduction proof. Each horizontal line is a deduction step, with the (zero or more) formulas above the line being its premises, and the one formula below the line being its conclusion. Each deduction step has a label to the right of the line. For example, the ''∧I'' step denotes conjunction introduction: given that A and B are true, this step concludes that A ∧ B is true, where A and B can be any formula. A natural deduction proof can rely on axioms (denoted by ''Ax''). We can write any natural deduction proof π i as a sequence of deduction steps π i (π i,1 , . . . , π i,k ) by traversing the proof tree in prefix order. We define a simple generative process for π i : 1. First sample the length of the proof k from a Poisson distribution with parameter 20.
2. For each j = 1, . . . , k: Select a deduction rule from the proof calculus with a categorical distribution. If the Ax rule is selected, then simply take the next available axiom from the theory T = a 1 , a 2 , . . . If the deduction rule requires premises, then each premise is selected uniformly at random from π i,1 , . . . , π i,j−1 . 1 The above generative process may produce a forest rather than a single proof tree. Thus, π i is sampled conditioned on π i being a valid proof. Just as with p(T ) in equation 5, this conditioning causes p(π i | T ) to be intractable to compute. However, only the ratio of the prior probability is needed for inference, which can be computed efficiently: Although PWL was initially implemented assuming classical logic, it is easy to adapt PWL to use other logics, such as intuitionistic logic. Intuitionistic logic is identical to classical logic except that the law of the excluded middle A ∨ ¬A is not a theorem (see Figure 3 for an example where the two logics disagree). The interpretable nature of the reasoning module makes it easy to adapt it to other kinds of logic or proof calculi. PWL supports both classical and intuitionistic logic. Figure 3: An example from the Electricity1 section in the ProofWriter dataset. Its label is unknown. Under classical logic, the query is provably true from the information in the 1st, 3rd, and 4th sentences.

Inference
Having described the generative process for the theory T and proofs π, we now describe inference. Given logical forms x, the goal is to compute the posterior distribution of T and π such that the conclusion of the each proof π i is x i . That is, PWL aims to recover the latent theory and proofs that explain/entail the given observed logical forms. To this end, PWL uses Metropolis-Hastings (MH) (Hastings, 1970;Robert and Casella, 2004). PWL performs inference in a streaming fashion, starting with the case n = 1 to obtain MH samples from p(π 1 , T |x 1 ). Then, for every new logical form x n , PWL uses the last sample from p(π 1 , . . . , π n−1 , T |x 1 , . . . , x n−1 ) as a starting point and then obtains MH samples from p(π 1 , . . . , π n , T |x 1 , . . . , x n ). This warm-start initialization serves to dramatically reduce the number of iterations needed to mix the Markov chain. To obtain the MH samples, the proof of each new logical form π (0) n is initialized using Algorithm 1, whereas the proofs of previous logical forms are kept from the last MH sample. The axioms in these proofs constitute the theory sample T (0) . Then, for each iteration t = 1, . . . , N iter , MH proposes a mutation to one or more proofs in π (t) . The possible mutations are listed in Table 2. This may change axioms in T (t) . Let T , π i be the newly proposed theory and proofs. Then, compute the acceptance probability: where g(T , π |T (t) , π (t) ) is the probability of proposing the mutation from T (t) , π (t) to T , π , and g(T (t) , π (t) |T , π ) is the probability of the inverse of this mutation. Because this quantity depends only on the ratio of probabilities, it can be computed efficiently (see equations 3.1.1 and 8). Once this quantity is computed, sample from a Bernoulli with this quantity as its parameter. If it succeeds, MH accepts the proposed theory and proofs as the next sample: T (t+1) = T and π (t+1) i = π i . Otherwise, reject the proposal and keep the old sample: i . If every possible theory and proof is reachable from the initial theory by a sequence of mutations, then with sufficiently many iterations, the samples T (t) and π (t) i will be distributed according to the true posterior p(T, π|x). If only a subset of possible theories and proofs are reachable from the initial theory, the MH samples will be distributed according to the true posterior conditioned on that subset. This may suffice for many applications, particularly if the theories in the subset have desirable properties such as better tractability. But the subset cannot be too small because then PWL would lose generality.
The function init proof in Algorithm 1 recursively calls init disproof. Due to space limitations, we refer the reader to our code for this function; it closely mirrors the structure of init proof. The purpose of init proof is to find some proof of a given higher-order formula, or return null if none exists. Its task is finding a satisfying abductive proof, which is easier than theorem proving, since it can create new axioms as needed. The returned proof need not be ''optimal'' because it serves as the initial state for MH, which will further refine the proof. The validity of the proofs is guaranteed by the fact that init proof only returns valid proofs and the MH proposals preserve validity. 2 swap randomly selects an element in its input list to swap with the first element. The probability of moving an element c to the front of the list is computed as follows: Recursively inspect the atoms in the formula f (c) and count the number of ''matching'' atoms: The atoms t(c) or c(t) is considered ''matching'' if it is provable in T . Next, count the number of ''mismatching'' axioms: for each atom t(c) in the formula f (c), an axiom t (c) is ''mismatching'' if t = t . And similarly for each atom c(t) in the formula f (c), an axiom c(t ) is ''mismatching'' if t = t . Let n be the number of ''matching'' atoms and m be the number of ''mismatching'' axioms, then the probability of moving c to the front of the list is proportional to exp{n − 2m}. This greatly increases the chance of finding a high-probability proof in the first iteration of the loop on line 31, and since this function is also used in an MH proposal, it dramatically improves the acceptance rate. This reduces the number of MH iterations needed to sufficiently mix the Markov chain.
Algorithm 1: Pseudocode for proof initialization. If any new axiom violates the deterministic constraints in Section 3.1.1, the function returns null.

Language Module
For the language module, PWM uses the probabilistic model of Saparov et al. (2017). The generative nature of their semantic parsing model allows it to fit seamlessly into PWM and PWL.

Proposal of selecting proposal
Select a grounded atomic axiom (e.g., square(c 1 )) and propose to replace it with an instantiation 1 N of a universal quantification (e.g., ∀x(rectangle(x) ∧ rhombus(x) → square(x))), where the antecedent conjuncts are selected uniformly at random from the other grounded atomic axioms for the constant c 1 : rectangle(c 1 ), rhombus(c 1 ), etc. The inverse of the above proposal: select an instantiation of a universal quantification and replace it 1 N with a grounded atomic axiom. Select an axiom that declares the size of a set (e.g., of the form size(us states) = 50), and propose 1 N to change the size of the set by sampling from the prior distribution, conditioned on the maximum and minimum consistent set size. Select a node from a proof tree of type ∨I, → I, or ∃I. 3 These nodes were created in Algorithm 1 on 1 N lines 7, 16, and 30, respectively, where for each node, a single premise was selected out of a number of possible premises. This proposal naturally follows from the desire to explore other selections by re-sampling the proof: it simply calls init proof again on the formula at this proof node. Merge: Select a ''mergeable'' event; that is, three constants arg2(c i ) = c k , and t(c i ) for some constant t are axioms, and there also exist constants (c i , c j , c k ) such that i > i, arg1(c i ) = c j , arg2(c i ) = c k , and t(c i ) are axioms. Next, propose to merge c i with c i by replacing all instances of c i with c i in the proof trees, c j with c j , and c k with c k . This proposal is not necessary in that these changes are reachable with other proposals, but those proposals may have low probability, and so this can help to more easily escape local maxima.
Split: The inverse of the above proposal.
β N Table 2: A list of the Metropolis-Hastings proposals implemented in PWL thus far. N , here, is a normalization term: N = |A| + |U | + |C| + |P | + α|M | + β|S| where: A is the set of grounded atomic axioms in T (e.g., square(c 1 )), U is the set of universally-quantified axioms that can be eliminated by the second proposal, C is the set of axioms that declare the size of a set (e.g., size(A) = 4), P is the set of nodes of type ∨I, → I, or ∃I 3 in the proofs π, M is the set of ''mergeable'' events (described above), and S is the set of ''splittable'' events. In our experiments, α = 2 and β = 0.001.
The logical forms in their model are distributed according to a semantic prior, which we replace with our distribution of logical forms conditioned on the theory p(π i |T ). Their parser is probabilistic and finds the k-best logical forms that maximize p(y i |x i , T ) for a given input sentence. Combined with our reasoning module's ability to compute the probability of a logical form, the parser can resolve ambiguous interpretations of sentences by exploiting acquired knowledge. We will demonstrate the utility of this property in resolving lexical ambiguity. However, the semantic grammar in Saparov et al. (2017) was designed for a DATALOG representation of logical forms. Thus, we designed and implemented a new grammar for our more domain-general formalism in higher-order logic. Though their model induces preterminal production rules from data (e.g., N → ''cat''), we must manually specify the nonterminal production rules (e.g., NP → ADJP NP). This allows 3 Also disproofs of conjunctions, if using classical logic. us to encode prior knowledge of the English language into PWM, dramatically improving its statistical efficiency and obviating the need for massive training sets to learn English syntax. It is nonetheless tedious to design these rules while maintaining domain-generality. Once specified, however, these rules can be re-used in new tasks and domains with minimal or no changes. We also improved their model to generalize over inflected forms of words. In the generative process, instead of generating sentence tokens directly (e.g., ''I am sleeping''), PWM generates word roots with flags indicating their inflection (e.g., ''I be[1ST,SG] sleep[PRS,PTCP]''). During parsing, this has the effect of performing morphological and semantic parsing jointly. We extracted the necessary comprehensive morphology information from Wiktionary (Wikimedia Foundation 2020).
We train this new grammar to learn the parameters that govern the conditional distributions and the preterminal production rules. To do so, we construct a small seed training set consisting of 55 labeled sentences, 47 nouns, 55 adjectives, and 20 verbs. 4 We wrote and labeled these sentences by hand, largely in the domain of astronomy, with the aim to cover a diverse range of English syntactic constructions. This small training set was sufficient thanks to the statistical efficiency of PWM.
While PWL uses the same parsing algorithm of Saparov et al. (2017), we provide an easier-to-understand presentation. Given an input sentence y i , the parser aims to find the logical form(s) x i and derivation trees t i that maximize the posterior probability p(x i , t i |y i , T ). This discrete optimization is performed using branch-and-bound (Land and Doig, 1960): The algorithm starts by considering the set of all derivation trees and partitions it into a number of subsets (the ''branch'' step). For each subset S, the parser computes an upper bound on the log probability of any derivation in S (the ''bound'' step). Having computed the bound for each subset, the parser puts them into a priority queue, prioritized by the bound. The parser then dequeues the subset with the highest bound and repeats this process, further subdividing this set, computing the bound for each subdivision, and adding them to the queue. Eventually, the parser will dequeue a subset containing a single derivation whose log probability is at least the highest priority in the queue. This derivation is optimal. The algorithm can be continued to obtain the top-k derivations/logical forms. Because this algorithm operates over sets of logical forms (where each set is possibly infinite), we implemented a data structure to sparsely represent such sets of higher-order formulas, as well as algorithms to perform set operations, such as intersection and subtraction.

ProofWriter
To demonstrate our implementation as a proofof-concept, we evaluate it on two questionanswering tasks. The first is the ProofWriter dataset (Tafjord et al., 2021), which itself is based on the earlier RuleTaker dataset . To evaluate and demonstrate the out-of-domain language understanding and reasoning ability of PWL, we use the Birds-Electricity ''open-world'' 5 portion of the dataset, as the authors evaluated their method on this portion zero-shot, just as we do (i.e., the algorithm did not see any example from this portion during training). This portion of the data is subdivided into 6 sections, each with varying degrees of difficulty. An example from this dataset is shown in Figure 3. For each example, PWL reads the context and abduces a theory. Next, it parses the query sentence y n+1 into a logical form x n+1 and estimates its unnormalized probability: . (12) Here, x are the previously read logical forms (the context). Since the quantity in equation 11 is intractable to compute, PWL approximates it by sampling from the posterior T, π 1 , . . . , π n+1 | x 1 , . . . , x n+1 and summing over distinct samples. Although this approximation seems crude, the sum is dominated by a small number of the most probable theories and proofs, and MH is an effective way to find them, as we observe in experiments. MH is run for 400 iterations, and at every 100 th iteration, PWL re-initializes the Markov chain by performing 20 ''exploratory'' MH steps (i.e., consisting of only the third and fourth proposals in Table 2 and accepting every proposal). This re-initialization is analogous to a random restart and can help to escape from local maxima. However, it may be promising to explore other approaches to compute this quantity, such as Luo et al. (2020). Once PWL has computed this probability for the query sentence, it does the same for the negation of the sentence. These unnormalized probabilities are compared, and if they are within 2000 in log probability, PWL returns the label unknown. If the first probability is sufficiently larger than the second, PWL returns true, and otherwise, returns false. The parameters in the prior were set by hand initially by choosing values that we thought were reasonable (e.g., the average length of a natural deduction proof for a sentence containing a simple subject noun phrase, object noun phrase, and transitive verb is around 20 steps, which is why the Poisson parameter for the proof length is set to 20). The values were tweaked as necessary by running the algorithm on toy examples during debugging. Note that the sentences ''Bill is a bird'' and ''Bill is not a bird'' can still both be true if each ''Bill'' refers to distinct entities. To avoid this, we chose an extreme value of the prior parameter such that the log prior probability of a theory with two entities having the same name is 2000 less than that of a theory where the name is unique. It is for this reason 2000 was chosen as the threshold for determining whether a query is true/false vs unknown. This prior worked well enough in our experiments, but the goal is to have a single prior work well with any task, so further work to explore which priors work better across a wider variety of tasks is welcome. We evaluated PWL using both classical and intuitionistic logic, even though the ground truth labels in the dataset were generated using intuitionistic logic. Table 3 lists the zero-shot accuracy of PWL, comparing with baselines based on the T5 transformer (Raffel et al., 2020). We emphasize here that PWL is not perfectly comparable to the baseline, because they aim to demonstrate that their method can learn to reason. We instead aim to demonstrate that PWL's ability to parse and reason end-to-end generalizes to an out-of-domain question-answering task. The baseline is trained on other portions of the ProofWriter data, whereas PWL is trained only on its seed training set. PWL performed much better using intuitionistic logic than classical logic, as expected since the ground truth labels were generated using intuitionistic semantics. However, most real-world reasoning tasks would take the law of the excluded middle to be true, and classical logic would serve as a better default. Although the task is relatively simple, it nevertheless demonstrates the proof-of-concept and the promise of further research.

FictionalGeoQA
The sentences in the ProofWriter experiment are template-generated and have simple semantics. For the sake of evaluation more representative of real-world language, we introduce a new question-answering dataset called Fictional-GeoQA. 6 To create this dataset, we took questions from GeoQuery (Zelle and Mooney, 1996), and for each question, we wrote a paragraph context containing the information necessary to answer the question. We added distractor sentences to make the task more robust against heuristics. Whenever possible, the sentences in this paragraph were taken from Simple English Wikipedia. However, some facts, such as the lengths of rivers, are not expressed in sentences in Wikipedia (they typically appear in a table on the right side of the page), so we wrote those sentences by hand: We took questions from GEOQUERY that expressed the desired fact in interrogative form (e.g., ''What is the length of <river name>?'') and converted them into declarative form (e.g., ''The length of <river name> is <length>.''). The resulting dataset contains 600 examples, where 67.4% of the sentences are from Simple English Wikipedia, and 90% of the examples contain at least one sentence not from Wikipedia. We replaced all place names with fictional ones to remove any confounding effects from pretraining. To keep the focus of the evaluation on reasoning ability, we chose to restrict the complexity of the language. In particular, each sentence is independent and can be   understood in isolation (e.g., no cross-sentential anaphora). The sentences are more complex than those in ProofWriter, having more of the complexities of real language, such as synonymy, lexical ambiguity (e.g., what is the semantics of ''has'' in ''a state has city'' vs ''a state has area''; or whether ''largest state'' refers to area or population), and syntactic ambiguity. This increased difficulty is evident in the results. This dataset is meant to evaluate out-of-domain generalizability, so we do not provide a separate training set for fine-tuning. An example is shown in Figure 4.
We compare PWL (using classical logic) with a number of baselines: (1) UnifiedQA (Khashabi et al., 2020), a QA system based on large-scale neural language models, (2) Boxer (Bos, 2015), a wide-coverage semantic parser, combined with Vampire 4.5.1 (Kovács and Voronkov, 2013), a theorem prover for full first-order logic, (3) Boxer combined with E 2.6 (Schulz et al., 2019), another theorem prover for full first-order logic, (4) the language module of PWL combined with Vampire, and (5) the language module of PWL combined with E. The results are shown in Table 4, along with a breakdown across multiple subsets of the dataset. UnifiedQA performs relatively well but fares more poorly on questions with negation and subjective concept definitions (e.g., ''Every river longer than 500km is major. . . What are the major rivers?''). Humans are easily able to understand and utilize such definitions, and the ability to do so is instrumental in learning about new concepts or words in new domains. PWL is able to fare better than UnifiedQA in examples with lexical ambiguity, as a result of the language module's ability to exploit acquired knowledge to resolve ambiguities. We find that Boxer has significantly higher coverage than PWL (100% vs 79.8%) but much lower precision. For instance, Boxer uses the semantic representation in the Parallel Meaning Bank (Abzianidze et al., 2017) which has a simpler representation of superlatives, and is thus unable to capture the correct semantics of superlatives in examples of this dataset. We also find that for most examples, Boxer produces different semantics for the question vs. the context sentences, oftentimes predicting the incorrect semantic role for the interrogative words, which leads to the theorem provers being unable to find a proof for these extra semantic roles. We also experimented with replacing our reasoning module with a theorem prover and found that for almost all examples, the search of the theorem prover would explode combinatorially. This was due to the fact that our semantic representation relies heavily on sets, so a number of simple set theoretic axioms are required for the theorem provers, but this quickly causes the deduction problem to become undecideable. Our reasoning module instead performs abduction, and is able to create axioms to more quickly find an initial proof, and then refine that proof using MH. Despite our attempt to maximize the generalizability of the grammar in PWL, there are a number of linguistic phenomena that we did not yet implement, such as interrogative subordinate clauses, wh-movement, spelling or grammatical mistakes, and so forth, and this led to the lower coverage on this dataset. Work remains to be done to implement these missing production rules in order to further increase the coverage of the parser.

Conclusions and Future Work
We introduced PWM, a fully symbolic Bayesian model of semantic parsing and reasoning, which we hope serves as a compelling first step in a research program toward more domain-and task-general NLU. We derived PWL, an efficient inference algorithm that reads sentences by parsing and abducing updates to its latent world model that capture the semantics of those sentences, and empirically demonstrated its ability to generalize to two out-of-domain question-answering tasks. To do so, we created a new question-answering dataset, FictionalGeoQA, designed specifically to evaluate reasoning ability while capturing more of the complexities of real language and being robust against heuristic strategies. PWL is able to read and understand sentences with richer semantics, such as definitions of new concepts. In contrast with past deductive reasoning approaches, PWL performs abduction, which is computationally easier. The highly underspecified nature of the problem of abduction is alleviated by the probabilistic nature of PWL, as it gives a principled way to find the most probable theories. We present an inference strategy where Metropolis-Hastings (MH) is performed on each sentence, in sequence, where the previous sample of the theory and proofs provides a warm-start for inference of the next sentence, reducing the number of MH iterations.
There are many avenues for future work: A simple prior was used for proofs p(π i |T ), and an alternative is to use a compositional exchangeable prior such as adaptor grammars (Johnson et al., 2006).
The first MH proposal in Table 2 is simple but restrictive: The antecedent conjuncts and the consequent are restricted to be atomic. MH would be able to explore a much larger and semantically richer set of theories if the antecedent or consequent could contain more complex formulas, including quantified formulas. In addition, the inference algorithm sometimes becomes stuck in local maxima. One way to improve the efficiency of inference is to add a new MH proposal that specifically proposes to split or merge types. For example, if the theory has the axioms cat(c 1 ) and dog(c 1 ), this proposal would split c 1 into two concepts: cat(c 1 ) and dog(c 2 ). This kind of type-based Markov chain Monte Carlo is similar in principle to Liang et al. (2010).
As mentioned earlier, a model of context is necessary in the language module to properly handle cross-sentential anaphora and conversational contexts. Real language very rarely consists of sentences that are independent of context. There are also many research questions on the issue of scalability. Although PWL is able to scale to examples in FictionalGeoQA with more than 100 sentences, there are two main bottlenecks currently preventing it from scaling to significantly larger theories: (1) the maintenance of global consistency, and (2) the unfocused nature of the current MH proposals. When checking for consistency of a new axiom, rather than considering all other axioms/sets in the theory, it would be preferable to only consider the portion of the theory relevant to the new axiom. Additionally, the current MH proposals do not take into account the goal of reasoning. For example, if the current task is to answer a question about geography, then MH proposals for proofs unrelated to geography are wasteful, and would increase the number of MH steps needed. A more clever goal-aware approach for selecting proofs to mutate would help to alleviate this problem and improve scalability. PWM provides a path to incorporate information from additional modalities in principled fashion: for example by adding a generative model of images, which would serve as a separate ''vision module.'' In addition, even though PWL is fully-symbolic, non-symbolic methods could be used for expressive prior/proposal distributions or approximate inference. There are many fascinating research paths to pursue from here.