## Abstract

Language models trained on billions of tokens have recently led to unprecedented results on many NLP tasks. This success raises the question of whether, in principle, a system can ever “understand” raw text without access to some form of grounding. We formally investigate the abilities of ungrounded systems to acquire meaning. Our analysis focuses on the role of “assertions”: textual contexts that provide indirect clues about the underlying semantics. We study whether assertions enable a system to emulate representations preserving semantic relations like equivalence. We find that assertions enable semantic emulation of languages that satisfy a strong notion of semantic transparency. However, for classes of languages where the same expression can take different values in different contexts, we show that emulation can become uncomputable. Finally, we discuss differences between our formal model and natural language, exploring how our results generalize to a modal setting and other semantic relations. Together, our results suggest that assertions in code or language do not provide sufficient signal to fully emulate semantic representations. We formalize ways in which ungrounded language models appear to be fundamentally limited in their ability to “understand”.

## 1 Introduction

Recently, language models trained on huge datasets of raw text have pushed the limits of natural language processing (Devlin et al., 2019; Raffel et al., 2019; Brown et al., 2020, among others). Such systems transcend the *expert system* paradigm, where rules about language and meaning are hardcoded into a system, as well as the *supervised learning* paradigm, where a notion of meaning is provided through ground-truth labels. Rather, analysis of massive language models has revealed that, to some degree, knowledge of syntactic and semantic dependencies can emerge *without explicit supervision* (Rogers et al., 2020; Tenney et al., 2019). This knowledge can then be transferred to a variety of downstream NLP tasks.

Yet, today’s NLP systems built on large language models still fall short of human-level general understanding (Yogatama et al., 2019; Zhang et al., 2020). Brown et al. (2020) discuss the limitations of their GPT-3 language model compared with humans, suggesting that:

Scaling up any LM-like model … may eventually run into (or could already be running into) the limits of the pretraining objective.

This possibility raises an interesting theoretical question. What are the fundamental limits of learning meaning from language modeling, even assuming a perfect learner with access to unlimited data? Recently, Bender and Koller (2020) argued that achieving true natural language understanding from text alone is impossible, and that, to really get at meaning, some type of semantic grounding is necessary.^{1} Their style of argumentation largely focused on developing thought experiments, rather than making formal arguments.

One thought experiment featuring prominently in Bender and Koller (2020) was the task of learning to understand a programming language’s semantics from raw code. Here, understanding was defined as fully emulating a compiler. This setup has clear parallels to learning to understand natural language, although the more well-defined nature of programming languages makes them easier to reason about. Bender and Koller (2020) argue that emulation is difficult in this setting, and perhaps impossible, because the source code alone contains no information about how it should be interpreted to create outputs. One counterpoint raised by the paper, as well as others (Michael, 2020; Potts, 2020), is the existence of unit tests, with *assertions* encoding examples of input/output pairs for blocks of code.^{2} For example, systematically observing blocks like x = 3; assert x == 3 could let a system bootstrap the semantics of variable assignment, because a programmer is likely to write assertions that will pass. These assertions constitute a form of implicit grounding embedded within language modeling by the pragmatic concerns of programmers, and they could potentially be leveraged to emulate a compiler.^{3} However, it is not immediately clear if unit tests provide “enough” supervision to do this, even with unlimited data.

Viewing the debate about the power of assertions as central to the larger philosophical question, we aim to clarify it in more formal terms. In this paper, we formally study whether observing a generalized notion of assertions can allow a system to “understand” strings. An assertion is a query about whether two strings evaluate to the same value within a fixed context. This is motivated by the role of assertions in unit tests, where asserting two expressions are equal suggests that they have the same value within the test.

While assertions are directly motivated by the compiler thought experiment, they also have analogs in natural language, where sentences make assertions about the world, and it is reasonable to expect some form of bias towards true statements (Potts, 2020). Indeed, this is one of Grice’s Maxims (Grice, 1975): a set of basic principles proposed to govern the pragmatics of natural language. For example, the truth conditions of *This cat is the cat that Mary owns* verify that two cats in the world identified in distinct ways are the same entity. In general, we might expect a sentence to appear with higher frequency if its truth conditions hold within its context, similar to an assertion in code, although of course there will also be other factors governing sentence frequency besides this. In this sense, the example sentence resembles the Python statement assert cat1 == cat2, where cat1 and cat2 are two Cat objects. See Section 6 for more discussion of how assertions and other formal concepts translate to natural language. We will generalize assertions to an abstract formal language context, allowing us to study how they can be used to emulate semantic relations.

Our findings are as follows. If every expression in a language has the same value in every valid context, then the language can be emulated using a finite number of assertion queries (Section 4). However, we construct a class of languages where expressions can take different values in different contexts, and where assertions do not enable emulation, i.e., infinite queries would be required (Section 5). Intuitively, this means that assertions do not provide enough signal for a Turing-complete emulator to fully “understand” languages from this class. We go on to discuss differences between our formal model and the less well-defined context of natural language (Section 6).

These results provide a formal way to characterize upper bounds on whether it is possible to emulate the semantics of a language from distributional properties of strings. Within our framework, in certain settings, we find that meaning cannot be learned from text alone. We strengthen claims made by Bender and Koller (2020) that assertions in code do not necessarily provide sufficient signal for a language model to emulate understanding. We do not make strong claims about how these results transfer to natural language, although we expect that the added complexity of natural language would make it, if anything, more difficult to “understand” than code.^{4}

## 2 Preliminaries

Let *L* ⊆ *Σ*^{★} denote a formal language over alphabet *Σ*. We will use *λ* to denote the empty string.

Let (*Σ*^{★})^{2} denote the Cartesian product of *Σ*^{★} with itself; that is, the set of all pairs of strings. Resembling Clark (2010), we refer to a tuple 〈*l*,*r*〉∈ (*Σ*^{★})^{2} as a syntactic *context*. We also use other symbols to refer to a context (e.g., *κ* = 〈*l*,*r*〉). We denote by *λ*^{2} the empty context 〈*λ*,*λ*〉.

### 2.1 Meaning

We will model formal languages not just as sets of strings, but as having an associated semantics.^{5} Specifically, we assume the existence of a *denotational semantics* over every substring of *L*, which we now elaborate on. Let *Y* be a countable set of referents. First, we will say that some *e* ∈ *Σ*^{★} is a valid *expression* within the context *κ* = 〈*l*,*r*〉 if there exists some contextual denotation $\u27e6e\u2223\kappa \u27e7L\u2208Y$. Intuitively, this represents the value of *e* when it occurs in the larger context *ler* ∈ *L*. We will also use the notation $\u27e6e\u2223l,r\u27e7L$ where convenient. We will reserve *∅*∈ *Y* as a special null symbol, defining $\u27e6e\u2223\kappa \u27e7L=\u2205$ iff *e* is not a valid expression in the context *κ*.^{6}

*κ*∈ (

*Σ*

^{★})

^{2}also has a

*support*, or set of expressions that are valid within it:

##### Example

Let *L* be a language of integers along with the + operator, for example, 2 + 2. *Y* is simply the integers. We take $\u27e6e\u2223\kappa \u27e7L$ to map *e* to its standard arithmetic interpretation, namely, $\u27e62 + 6\u2223\lambda ,+ 4\u27e7L=8$. We take expressions that are not conventionally well-formed to be invalid: for example, $\u27e6+\u2223\lambda ,+\u27e7L=\u2205$. Finally, let *κ* = 〈*λ*, +4〉. Then supp_{L}(*κ*) = *L*, since any valid expression can occur within *κ*.

### 2.2 Strong Transparency

As defined above, we make very few assumptions about denotations. They are not necessarily compositional, and expressions may take different referents in different contexts. However, we saw in the integer expression language that the meanings of an expression did not depend on its context. We now define a property formalizing this idea.

(Strong transparency) *L* is strongly transparent iff, for all*e* ∈ *Σ*^{★}, *κ* ∈ (*Σ*^{★})^{2}, either$\u27e6e\u2223\kappa \u27e7L=\u27e6e\u2223\lambda 2\u27e7L\u2260\u2205$, or$\u27e6e\u2223\kappa \u27e7L=\u2205$.

Informally, strong transparency says each *e* has a well-defined denotation that exists independent of context, and that this simple denotation can be “plugged into” any context. Our previous example expression 2 + 6 is strongly transparent because it can be said to have a well-defined value 8 independent of its context. We could break strong transparency by adding bound variables to the language, for example, x = 2; x + 6 in Python. In this case, $\u27e6x\u2223\kappa \u27e7L$ non-vacuously depends on *κ*.

Strong transparency resembles referential transparency (Whitehead and Russell, 1925–1927), but is a stronger condition, in that it does not allow the same name to *ever* refer to different values. For example, for a Python program, strong transparency does not allow assigning local variables within a function, even if the function output would remain completely specified by its inputs.

### 2.3 Assertion Queries

We now define an oracle function providing assertion information about expressions in *L*, resembling assert e1 == e2 for two Python expressions e1, e2. A system is granted access to this function, and it can make *assertion queries* to it in order to learn about the semantics of *L*.^{7} An assertion query tells us whether two expressions *e*,*e′* are equivalent within the context *κ*.

*e*,

*e′*,∈

*Σ*

^{★}and

*κ*∈ (

*Σ*

^{★})

^{2}, define the assertion oracle

Recall that we defined $\u27e6e\u2223\kappa \u27e7L=\u2205$ if *e* is not valid in the context *κ*. In our example language of integer expressions, for all *κ*, *ℵ*_{L}(4,2 + 2∣*κ*) = 1, since 4 = 2 + 2. The computational power of this oracle depends on the complexity of the underlying semantics: For arbitrary semantics, it can become uncomputable. In this paper, though, we focus on classes of languages for which the denotation function and assertion oracle are computable.

The *ℵ*_{L} oracle is motivated by assertion statements in programming languages, which occur naturally in environments like unit tests. The distribution of strings in a corpus of code should capture some notion of this oracle, since a programmer is more likely to assert two expressions are equal if they are expected to have the same value. Our goal is to study the limits of understanding achievable from raw text, so we consider an “upper bound” setup by assuming a system has full access to *ℵ*_{L}. Can the system use this powerful oracle to emulate the underlying semantics?

### 2.4 Turing Machines

Our notion of language understanding will be based around the idea of emulation, which in turn requires a model of computational realizability. We will use Turing machines (Turing, 1936) as a model of universal computation. We write *μ*(*e*) for the output of Turing machine *μ* evaluated on input *e* ∈ *Σ*^{★}. We will also define an oracle Turing machine as a standard Turing machine that can compute a blackbox “oracle” function *f* as a subroutine. We imagine the machine has a special *query* instruction and tape. After writing *x* to the query tape and executing the query instruction, the query tape will contain *f*(*x*). We will write *μ*_{f}(*e*) for the Turing machine *μ* evaluated on input *e* with oracle access to *f*. In the case where *f* = *ℵ*_{L}, we will simply write *μ*_{L}(*e*). Whereas, in computability theory, oracle Turing machines are generally leveraged to make reductions from uncomputable problems, here we will use them to formalize the ability of an emulator to make assertion queries about *L*. This oracle provides additional power because these queries contain additional information beyond that encoded in the input expression.

## 3 Research Question: Do Assertions Enable Emulation?

There is a long history in AI of trying to define and measure understanding. Turing (1950) constitutes an early behaviorist perspective; more recent approaches tend to emphasize not just an external view of a system’s behavior, but also “how it is achieved” (Levesque, 2014). Understanding can be behaviorally diagnosed in neural models by evaluating them on benchmarks (Wang et al., 2018). An alternate approach is probing (Adi et al., 2017; Conneau et al., 2018; Hupkes and Zuidema, 2018; Hewitt and Liang, 2019; Belinkov and Glass, 2019), which investigates *how directly* a model’s representations encode semantic relations by measuring if they can be easily decoded from them. Similarly, we take the position that systems are capable of understanding if they *emulate* representations that are isomorphic to underlying meaning under important semantic relations like equivalence. We will formalize this in ^{1}, which asks whether such emulation is possible using assertions.

*ℵ*-emulation) A class of languagesℒ over

*Σ*is

*ℵ*-emulatable if there exists an oracle Turing machine

*μ*and standard Turing machine

*δ*such that, for all

*L*∈ℒ,

*κ*∈ (

*Σ*

^{★})

^{2}, and

*e*,

*e′*∈supp

_{L}(

*κ*),

*μ* can be thought of as an emulator that evaluates expressions, whereas *δ* receives two values and decides whether they are equal. Crucially, only *μ* has direct access to *ℵ*_{L}. *δ* can only use information from the oracle to the extent that it is encoded in the representations *μ*_{L}(*e*) and *μ*_{L}(*e′*).

^{3} formulates emulation as a decision problem, as is typical in theoretical computer science. Equivalently, *δ* can be replaced by a computable function *ρ* such that *ρ*(*μ*_{L}(*e*)∣*κ*) *evaluates**μ*_{L}(*e*) in context *κ*, that is, its output string is isomorphic to $\u27e6e\u2223\kappa \u27e7L$ under =. The functions *δ* and *ρ* are Turing-reducible to each other, implying that if one definition is satisfied, so is the other.

With our definition of emulation in place, we can formally state the research question:

For a class of languages ℒ, is ℒ*ℵ*-emulatable?

How does ^{1} relate to understanding in large language models? We imagine that, with sufficiently large amounts of data, the frequencies of strings in *L* carry enough signal such that the language model objective “supervises” access to *ℵ*_{L}. Thus, *μ*_{L}(*e*) can be thought of as the language model representation of an expression *e*. We then hope to recover underlying semantic relations from the representations produced by the language model via some function *δ*. The class ℒ corresponds to a set of hypothesis languages over which the language model must search for the true *L*. We will see that whether emulation is possible will depend on the properties of ℒ.

Stepping back, ^{1} bears on the role of assertions raised by Bender and Koller (2020). Does observing assertions allow a Turing- complete system to emulate a compiler? In more general terms, are assertions powerful enough implicit grounding to achieve representations that encode the denotational semantics of a language?

## 4 Strong Transparency

We first consider the case where the language being learned is known to be strongly transparent. Let Transparent denote the class of strongly transparent languages. We will show that Transparent is *ℵ*-emulatable. The core idea of the proof is to construct a canonical form for each expression. The canonical form is the first expression in a lexicographic ordering that the assertion oracle deems equivalent to the target expression. For technical reasons, the emulator returns the index of this string under the lexicographic order.

### Theorem 1

transparent*is ℵ-emulatable.*

As Python is Turing-complete, we write $\mu :\Sigma \u2605\u2192N$ as a Python function emulate in Figure 2. The function receives as input an expression expr and a callback function asserteq to an oracle computing *ℵ*_{L}. For each *e* ∈ *Σ*^{★}, there exists *e*^{★} ∈ *Σ*^{★} such that *ℵ*_{L}(*e*,*e*^{★}∣*λ*^{2}) = 1. In the “worst case”, this holds when *e*^{★} = *e* by symmetry. By construction, all_strings reaches all strings in finite time. Therefore, the number of loop iterations before reaching *e*^{★} is finite. We can conclude that emulate halts on every *e* ∈ *Σ*^{★}, establishing that it is computable.

*κ*∈ (

*Σ*

^{★})

^{2}. We note that

*δ*is simply the indicator function for equality over the natural numbers:

*i*∈ℕ, the index of the first string

*e*

^{★}such that $\u27e6e\u2223\lambda 2\u27e7L=\u27e6e\u2605\u2223\lambda 2\u27e7L$. Now, let

*e*,

*e′*∈supp

_{L}(

*κ*) be different inputs to

*μ*. Because the enumeration order of the for loop is fixed across computation of

*μ*

_{L}(

*e*) and

*μ*

_{L}(

*e′*):

^{3}) are fully satisfied.

Through a simple construction, we have shown it is possible to emulate meaning from assertion queries for languages with strongly transparent semantics. The number of bits in the emulated representation *μ*_{L}(*e*) is linear in the size of *e*. In the next section, we consider what happens without strong transparency, where, among other complexities, values can be bound to variables, complicating the construction used in ^{4}.

## 5 General Case

Requiring strong transparency precludes a broad class of linguistic patterns allowing an expression to refer to different values in different contexts. For example, this includes assigning variable or function names in Python, or binding pronouns in natural language. These constructions can make emulation impossible to achieve from assertions. We will construct a class of languages based on Python where emulation is uncomputable.

Let$Leq={Lm\u2223m\u2208N\u222a{\u221e}}$, where strings in*L*_{m} are defined according to Figure 3. For semantics, we first define$\u27e6M\u2223\kappa \u27e7Lm=m$. For any other*ler* ∈ *L*_{m} that is a well-formed Python 3.8 expression, we define$\u27e6e\u2223l,r\u27e7Lm$as the value of *e* assigned by the Python interpreter in the context〈*l*,*r*〉. For strings that are not valid Python expressions, define$\u27e6e\u2223l,r\u27e7Lm=\u2205$.

What does it take to emulate the expressions leq() and True in *L*_{m}? If we knew *m*, then we could emulate them by simply comparing *n* < *m*. However, it turns out that recovering *m* for any *L*_{m} ∈Leq is not possible with a fixed number of assertion queries. Formalizing this, we will show that Leq is not *ℵ*-emulatable.^{8}

Leq*is not ℵ-emulatable.*

Without loss of generality, we focus on the contexts for leq()^{9} and True within **print**(⋅), each of which is parameterized by some value of *n*. Notationally, we identify each *L*_{m} with *m*, and each context with its parameter *n*. This enables shorthand like $\u27e6e\u2223n\u27e7m$ for the denotation of the expression *e* in the context parameterized by *n* in *L*_{m}.

*n*that $\u2135\u221e(leq(),True\u2223n)=1$. To satisfy emulation of

*e*∈{leq(),True}, $\mu \u221e$ makes a finite number of assertion queries

*n*

_{1},⋯ ,

*n*

_{q}, which we assume without loss of generality is sorted in increasing order. We can adversarially construct $m\u2032\u2260\u221e$ such that all these queries are the same, and thus $\mu \u221e(e)=\mu m\u2032(e)$ for both

*e*. To implement this, we simply set

*m′*=

*n*

_{q}+ 1. Since $\mu \u221e(e)=\mu m\u2032(e)$, we conclude that, for all

*n*,

*n*>

*n*

_{q}. In this case,

*ℵ*-emulation (

^{3}) cannot be satisfied for both $Lm\u2032$ and $L\u221e$. This implies that Leq is not

*ℵ*-emulatable.

### 5.1 Discussion

We briefly summarize this result in less formal terms. Leq contains languages *L*_{m} defined by Figure 3. Every program in each *L*_{m} is easily computable. With knowledge of the Python interpreter and *m*, any agent could execute all of these programs. This can be formalized by observing that, for a fixed *m*, the class {*L*_{m}} is *ℵ*-emulatable. Rather, what we have shown is that, with finite time, it is impossible for an ungrounded agent to emulate *L*_{m}*using assertion queries* when *m* is unknown in advance. In other words, without prior knowledge of *m*, no algorithm can use assertions to disambiguate which notion of = is used by *L*_{m} from the infinite other possibilities. In a rough sense, *m* can be thought of as a cryptographic key enabling linguistic understanding: agents that know *m* can directly emulate *L*_{m}, but agents without it cannot, at least using assertions.^{10}

^{5} does not use the fact that *δ* must be computable, as opposed to an arbitrary function. Even if *δ* is an arbitrary function, it could not disambiguate whether *m* halts based on queries.

It is more precise to state ^{5} in a formal language, but an argument similar to ^{5} can be adapted to a natural language like English. An example is shown in Figure 4, where we define the meaning of a sentence as its truth conditions, and we imagine the class of candidate languages is formed by varying the unspecified *number*, which can potentially be $\u221e$. Deciding if *n* is less than it has the same truth conditions as *Zero equals one* is equivalent to comparing leq() and True. A system must necessarily fail to emulate the semantics of these expressions in some context, for some secret number. The rest of the paper further explores the implications and limitations of applying our formal model to natural language.

## 6 Towards Natural Language

As discussed in Section 1, our results are inspired by the thought experiment of whether a language model can use raw code to learn a compiler. A goal of this, of course, is to examine whether understanding can be acquired from natural language text in a simplified setting. In principle, our formal results can bear on this broader question about natural language, although some differences emerge when extending the results to a less well-defined setting. In many cases, these differences appear to make the task of learning meaning harder, suggesting that our negative claim in a simpler setting (^{5}) may still hold as an impossibility result. We now discuss some points of difference between our formal model and natural language.

##### Truth Conditions

There are connections between our framework and the concepts of truth values and truth conditions in linguistic semantics. For a Boolean-valued expression *e*, a truth value corresponds to computing $\u27e6e\u2223\kappa \u27e7L$ in a fixed context. On the other hand, truth conditions correspond roughly to a function computing $\u27e6e\u2223\kappa \u27e7L$ for any *κ*. A crucial difference, though, is that these conditions cannot be *intensional* (Von Fintel and Heim, 2011), that is, they are not functions of the world state, but rather of the linguistic context only. In this sense, emulation corresponds to recovering the ability to resolve non-intensional truth conditions of sentences. This model is natural for formalizing a closed programming language environment, for example, with no environment variables or user input, since in this case the program state is specified completely by the linguistic context. On the other hand, English has common elements like *that* whose meaning can change depending on world state external to language. Perhaps allowing such elements would only make understanding more difficult; or, arguably, generally impossible, since there is no way for the model to observe the grounding world state using only an assertion oracle. We are inclined to believe that, since such changes would make understanding more difficult, ^{5} would still hold as an impossibility result. However, future work would be needed to make this idea precise.

##### Possible Worlds

In the last paragraph, we discussed how mutable world state is an additional complexity of natural language compared to our setup. Similarly, speakers of natural languages have imperfect information about the world around them, which can be captured by modeling the referent of an expression over a set of *possible* worlds, rather than within a specific evaluation context. In Appendix A, we explore to what degree this setting makes the task of learning to understand more difficult. In adapting our model to this context, the assertion oracle must become “modal” in the sense that it quantifies over sets of worlds. We explore two different models of modality for the oracle, corresponding to different physical interpretations. In one case, ^{4} and ^{5} apply analogously, while, in the other, emulation becomes an ill-defined problem.

##### Denotation vs. Intent

Bender and Koller (2020) distinguish between *standing meaning* and *communicative intent*, reflecting a distinction between denotational semantics and other pragmatic intentions that a speaker has in producing an utterance. In this paper, it is most straightforward to take $\u27e6e\u2223\kappa \u27e7L$ to reflect standing meaning. In principle, we could imagine that it represents the speaker’s communicative intent, and that an omniscient oracle *ℵ*_{L} can reveal information about the speaker’s intents to the system. Even with this unrealistically powerful oracle, ^{5} says that the system cannot emulate the speaker’s intents.

##### Competence vs. Performance

Chomsky (1965) differentiates competence and performance in linguistic theory, where competence corresponds roughly to the correct algorithmic modeling of a linguistic process, and performance describes its implementation subject to resource constraints like memory. Arguably, agents might be said to understand language if they are competent in this sense, even if they sometimes make performance errors. In contrast, our definition of emulation (^{3}) permits no performance errors. In future work, it would be interesting to adapt an approximate notion of emulation that tolerates performance errors in order to more closely target understanding in a sense reflecting competence.

##### Other Relations

^{4} and ^{5} investigate whether *ℵ*_{L} can be used to emulate meaning representations that preserve an equivalence relation. While equivalence is an important part of semantics, other semantic relations like entailment are also necessary for language understanding. In Appendix B, we show a generalization of ^{5} extends to *any* semantic relation. In other words, referential transparency also enables emulation of relations besides =.

##### Other Oracles

We believe assertions are a fairly general model of the types of semantics encoded in unsupervised learning resulting from a pragmatic bias for truth; however, it is possible other information is also represented, resulting from other pragmatic biases governing language usage and dataset creation. This additional information could be formalized as access to additional oracles. It would be exciting to formalize the power of multimodal setups by analyzing the interactions of oracles enabled by different input modalities.

## 7 Stepping Back

In this work, we formalized an argument that was raised by Bender and Koller (2020) as a thought experiment. Bender and Koller (2020) question whether unsupervised training objectives are the right goal to target for achieving natural language understanding. If meaning is defined as identifying which object in the real world, or which set of situations, a linguistic element refers to, then, in a direct sense, an ungrounded system cannot understand meaning. But Bender and Koller (2020) go farther than this, claiming that an ungrounded system cannot even *emulate* understanding because it is not clear how a system should learn to interpret strings, even if it can model their distribution. We formalize this idea of emulation as *ℵ*-emulation.

One counterargument mentioned by Bender and Koller (2020) is that indirect forms of grounding do exist in programming and natural language, which we formalize as assertions. The syntactic distributions of statements like assert allow us to indirectly observe semantic relations over the denotations. Assertions are one way that the distribution of strings in a corpus is not blind to their semantics. By studying them, we study whether this indirect grounding enables a computational system to emulate the underlying semantic relations.

##### Key Takeaways

While assertions allow a system to emulate semantic relations in simple cases where the semantics are referentially transparent, we find that linguistic constructs like variable binding bring this task in conflict with the fundamental laws of computability. In other words, under our formal model of meaning and emulation, it is not just intractable for an ungrounded system to emulate understanding of a formal language, but, in some cases, *impossible*. We provide constructive examples where understanding must necessarily break down. We present these results in a well-defined framework building off formal approaches in logic, linguistics, and computer science. While we do not prove anything about natural languages, we do show that ungrounded models must fail to emulate equivalence in a very simple setting. A similar result likely extends to natural language understanding as well, which among other things, requires modeling referential identity (e.g., for sentences like *Manny is the cat*). Further, we believe much of our framework can be readily adopted in other works formalizing understanding in Turing-complete systems.

##### Open Questions

In this work, we have focused on utterances, by default, as opposed to *dialogues*. An exciting extension would be to formalize a dialogue between two speakers, interrupted by the “octopus” of Bender and Koller (2020).^{11} Existing theories of discourse could potentially be synthesized with this framework. What linguistic properties besides referential transparency relate to emulatability? Can this framework be extended to formalize multimodal setups, where multiple oracles from different domains can potentially be combined to gain additional power? Finally, is there a natural way to relax our standard of emulation towards a probabilistic definition, and how would this change the results?

## Acknowledgments

We thank Mark-Jan Nederhof for his excellent suggestions. We also thank Dana Angluin, Matt Gardner, Eran Yahav, Zachary Tatlock, Kyle Richardson, Ruiqi Zhong, Samuel Bowman, Christopher Potts, Thomas Icard, and Zhaofeng Wu for their feedback on various versions of this work. Further thanks to our anonymous reviewers and researchers at the Allen Institute for AI and UW NLP. Finally, we appreciate the lively online discussion of the paper, which informed updates to the camera-ready version.

## A Multiple Worlds

Programs execute in well-defined environments with a clear state. Speakers of natural language, on the other hand, have imperfect information and beliefs about the world around them. Thus, it can be more natural to model grounding context for language as a set of *possible worlds*, rather than a single world state. We formalize this in two different ways (with two different physical interpretations) and explore how it affects our results.

Let *W* be a set of all possible worlds. We redefine denotations to be *intensionalized* (Von Fintel and Heim, 2011), that is, we write $\u27e6e\u2223\kappa \u27e7w$ as the denotation of *e* in the context *κ*, evaluated in world *w* ∈ *W*. Assume for simplicity that *Y* = {0,1,*∅*}. We will now introduce modal denotations and assertions using a generic *modal quantifier* ⊙, which reduces a sequence of worlds to a boolean value according to some intensional predicate. This quantifier controls how multiple possible worlds are collapsed to form denotations and query outputs.

*e*∈

*Σ*

^{★},

*κ*∈ (

*Σ*

^{★})

^{2}, define

We will write the previously defined assertion oracle to apply in a specific world *w*, namely, $\u2135Lw$. We also extend it to quantify over multiple worlds:

*e*∈

*Σ*

^{★},

*κ*∈ (

*Σ*

^{★})

^{2}, define

Specifically, we consider $\u2299={\u25a1,\u25ca}$, corresponding to universal and existential quantifiers over worlds. Thus, $\u25a1$ can be thought of as as ∀ over worlds, and *◊* can be thought of as ∃. For either quantifier, if any $\u27e6e\u2223\kappa \u27e7Lw=\u2205$, we define $\u2299\u27e6e\u2223\kappa \u27e7L=\u2205$ as well. Each quantifier will have a different physical interpretation. With universal quantification, we will find that results analogous to ^{4} and ^{5} hold. With existential quantification, it turns out that the equivalence class of *μ* is underspecified. In other words, not only is it impossible to compute an emulator with a finite number of assertion queries, but, even with infinite assertions, there is no consistent way to emulate the underlying modal semantics.

### A.1 Universal Quantification

In the first case we let $\u2299=\u25a1$. Two expressions are viewed as having the same meaning if they are equivalent in every possible belief world. This is interpretable as observing text $L\u25a1$ written by a single author whose belief state is represented by multiple possible worlds. The author only asserts a statement is true if it is consistent across all worlds that they believe are possible.

In this setting, we will show that the modal assertion oracle uniquely specifies a modal denotation for each expression, up to isomorphism. In other words, as with the non-modal assertion oracle, each assertion query would let us decide some relation between two expressions. Thus, the same results for the non-modal setting discussed in the main body of the paper will also hold here.

*e*,

*e′*∈

*Σ*

^{★}and any context

*κ*∈ (

*Σ*

^{★})

^{2}such that $\u25a1\u27e6e\u2223\kappa \u27e7L\u2260\u2205$ and $\u25a1\u27e6e\u2032\u2223\kappa \u27e7L\u2260\u2205$. Then,

Crucial to this simple proof is the fact that ∧ is distributive over =. This is specific to the quantifier being $\u25a1$. ^{3} implies that $\u25a1\u27e6e\u2223\kappa \u27e7L$ can be recovered from modal assertion queries analogously to the non-modal case. Thus, results analogous to ^{4} and ^{5} apply for emulating $\u25a1\u27e6e\u2223\kappa \u27e7L$ using queries to $\u25a1\u2135L$.

### A.2 Existential Quantification

In the second case we let ⊙ = *◊*. Two expressions are viewed as having the same meaning if they are equivalent in *some* world. This is interpretable as observing a large dataset of text *L*_{◊} generated by many authors, each with a different single belief world *w*. In the corpus, we imagine two expressions can be asserted to be equivalent in some context if *any* of the authors would consider them to be equal in that context.

In this case, assertions do not even fully specify equivalence between the modal denotations. This is a stronger sense in which meaning cannot be emulated from assertion queries. Emulation is not just impossible with finite assertions, but mathematically underspecified.

There exist *e*,*e′* ∈ *E*(*L*) and *κ* ∈ (*Σ*^{★})^{2} such that $\u25ca\u27e6e\u2223\kappa \u27e7L\u2260\u2205$ and $\u25ca\u27e6e\u2032\u2223\kappa \u27e7L\u2260\u2205$, and also *◊ℵ*_{L}(*e*,*e′*∣*κ*) = 1 is consistent with either $\u25ca\u27e6e\u2223\kappa \u27e7L=\u25ca\u27e6e\u2032\u2223\kappa \u27e7L$ or $\u25ca\u27e6e\u2223\kappa \u27e7L\u2260\u25ca\u27e6e\u2032\u2223\kappa \u27e7L$.

We construct an example with expressions *e*_{1},*e*_{2} in a single context *κ*. Fix *W* = {*w*_{1},*w*_{2}}. Table 1 shows two versions of this modal setup. In both versions of the universe, *◊ℵ*_{L}(*e*,*e′*∣*κ*) = 1. However, on the left, $\u25ca\u27e6e\u2223\kappa \u27e7L=\u25ca\u27e6e\u2032\u2223\kappa \u27e7L$, while, on the right, the opposite holds. So, with *◊*, modal assertions do not uniquely determine equivalence of modal denotations.

. | e_{1}
. | e_{2}
. | ℵ
. | e_{1}
. | e_{2}
. | ℵ
. |
---|---|---|---|---|---|---|

w_{1} | 0 | 0 | 1 | 0 | 0 | 1 |

w_{2} | 0 | 0 | 1 | 0 | 1 | 0 |

◊ | 0 | 0 | 1 | 0 | 1 | 1 |

. | e_{1}
. | e_{2}
. | ℵ
. | e_{1}
. | e_{2}
. | ℵ
. |
---|---|---|---|---|---|---|

w_{1} | 0 | 0 | 1 | 0 | 0 | 1 |

w_{2} | 0 | 0 | 1 | 0 | 1 | 0 |

◊ | 0 | 0 | 1 | 0 | 1 | 1 |

As an equivalence class for *μ* is not even well- defined by *◊ℵ*_{L}, we cannot hope to compute it from queries. This is an even stronger sense in which emulation is impossible using assertions. On some level, this may be a natural model for language modeling corpora, which aggregate text from potentially inconsistent sources.

In summary, if assertions uniquely determine equivalence between denotations in a strongly transparent language, then we can expect to emulate representations preserving equivalence using assertions. Otherwise, there are various levels of formal challenges to emulating equivalence.

## B Other Semantic Relations

Sections 4, 5, and 7 investigate whether *ℵ*_{L} can be used to emulate meaning representations that preserve semantic equivalence. While equivalence is an important part of semantics, other semantic relations are also necessary for language understanding. For example, the following feature prominently in theories of linguistic semantics:

**Entailment**In general terms, an entailment (Winter, 2016) relation $\u2192$ is a partial order over*Y*. Intuitively, if $y\u2192y\u2032$, then*y*is a “special case” of*y′*. For example, one could construct*E*, a semantic analysis of English, where $\u27e6fat cat\u2223a,sits\u27e7E\u2192\u27e6cat\u2223a,sits\u27e7E$.**Contrary negation**Negation is a complex topic in semantics. One sense of negation is if two meaning representations are “contrary” (Horn and Wansing, 2020), meaning both cannot be true at the same time.

Does ^{5} generalize to other relations besides =? To answer this, we first extend assertions and emulation to apply to a generic relation ∘ : *M*^{2}. The proof for ^{4} does not fully translate to this new setting, but we will show via a new argument that emulation is still possible.

*e*,

*e′*,∈

*Σ*

^{★}and

*κ*∈ (

*Σ*

^{★})

^{2}, define the assertion oracle

*Σ*is

*ℵ*-emulatable w.r.t.∘ if there exists an oracle Turing machine

*μ*and standard Turing machine

*δ*such that, for all

*L*∈ℒ,

*κ*∈ (

*Σ*

^{★})

^{2}, and

*e*,

*e′*∈supp

_{L}(

*κ*),

We now are ready to prove the extended form of ^{4}. The main idea of the proof will be to memoize the value of the relation ∘ between $\u27e6e\u2223\kappa \u27e7L$ and the values of all expressions smaller than *e*. This guarantees that *δ* will be able to “look up” the correct output.

transparent is *ℵ*-emulatable w.r.t. ∘.

Similarly to ^{4}, we present the proof constructively as a Python program to compute *μ*. We then show how to define *δ* appropriately, completing the proof.

*μ*

_{L}(

*e*) ∈

*M*. In Python,

*μ*

_{L}(

*e*) is a dictionary; we interpret it as a function $\mu L(e):\Sigma \u2605\xd7\Sigma \u2605\u2192{0,1,\u2205}$, where

*∅*represents values that are not set. We define

*δ*as follows:

*μ*

_{L}(

*e*)(

*e*,

*e′*)≠

*∅*or

*μ*

_{L}(

*e′*)(

*e*,

*e′*)≠

*∅*. In Figure 6, cand either reaches

*e*before

*e′*, or

*e′*before

*e*. By symmetry, assume it reaches

*e*before

*e′*. Then

*μ*

_{L}(

*e′*)(

*e*,

*e′*)≠

*∅*, so

Therefore emulate satisfies ^{3}.

We needed to change the proof of ^{5} compared to ^{4} because ∘ is not an equivalence relation. In ^{4}, the final steps relied on reflexivity, transitivity, and symmetry: the three properties that constitute equivalence. The new proof enlarges the size of the emulated representations. Rather than representing each *e* with a number, *μ*_{L}(*e*) becomes a large dictionary of strings. This represents an increase in space complexity from linear to exponential in the size of *e*.

## C Old Emulation Definition

*general denotation*as

*f*,

*g*are isomorphic (with respect to =) over a set

*X*iff, for all

*x*,

*x′*∈

*X*,

*f*≅

_{ =}

*g*in this case. We will refer to a set of contexts

*S*⊆ (

*Σ*

^{★})

^{2}as a

*syntactic role*. Each syntactic role has a set of expressions supp

_{L}

^{−1}(

*S*) whose

*support*is that role:

(Old*ℵ*-emulation)$\mu :\Sigma \u2605\u2192M$emulates$\u27e6\u22c5\u27e7L$w.r.t. = iff:

$\mu \u2245=\u27e6\u22c5\u27e7L$ over supp

_{L}^{−1}(*S*), for all*S*⊆ (*Σ*^{★})^{2}There exists a Turing machine that computes whether

*m*=*m′*for each*m*,*m′*∈*M*There exists a Turing machine with oracle access to

*ℵ*_{L}that computes*μ*

*ℵ*-emulatable iff, for all

*L*∈ℒ, there exists an oracle Turing machine

*μ*and normal Turing machine

*δ*such that, for all

*S*∈ (

*Σ*

^{★})

^{2},

*e*,

*e′*∈supp

_{L}

^{−1}(

*S*),

^{3}, but we will make two slight changes. First, we will change the quantifier order, such that a single

*μ*must work for every

*L*∈ℒ. Then, we will grant

*δ*access to a context

*κ*, and rephrase the equation to hold over all

*κ*∈ (

*Σ*

^{★})

^{2}and

*e*,

*e′*∈supp

_{L}(

*κ*):

^{3}. This version more faithfully reflects the intuitive notion of emulation. The old version required

*μ*

_{L}(

*e*) to determine how

*e*should evaluate in every possible context. Emulation would not be possible in some cases even with perfect knowledge of

*L*. Now, it must just be possible in any context

*κ*to compute $\u27e6e\u2223\kappa \u27e7L$ from

*κ*and

*μ*

_{L}(

*e*), which is a weaker standard. Under the new definition, it is

*always*possible to emulate a class of languages with one element, assuming $\u27e6e\u2223\kappa \u27e7L$ is computable. An additional improvement is that emulation now applies to all expressions that share a context, whereas before it only targeted expressions with the same support.

## Notes

^{2}

Unit tests are blocks of code in a software project that are designed to test whether the core code is behaving correctly.

^{3}

Contexts like assertions can be seen as an argument in favor of the distributional hypothesis (Harris, 1954).

^{4}

Appendix C documents and motivates conceptual changes since the original arXiv version of the paper.

^{5}

We slightly abuse notation by using *L* to refer to both a set of strings, and a set of strings paired with a denotation function, which could be written more verbosely as $\u2329L,\u27e6\u22c5\u27e7L\u232a$.

^{6}

Our simple model of denotations does not reflect the full range of semantic theories that have been proposed for natural language. In particular, our denotations $\u27e6e\u2223\kappa \u27e7L$ depend only on the linguistic context *κ* rather than any external world state. This differs substantially from how truth conditions are traditionally conceptualized in formal semantics (Heim and Kratzer, 1998). For example, in our framework, the referent of English $\u27e6the dog\u2223\kappa \u27e7L$ must be fixed with no regard for the extralinguistic context. Section 6 further contrasts our setup with the richer semantics of natural language.

^{7}

This resembles the role of queries in classical grammar induction works (e.g., Angluin, 1987).

^{8}

Another example of a non-*ℵ*-emulatable language takes M to be a finite list of integers and replaces *n* < M with *n***in**M.

^{9}

The only “valid” context for leq() is within **print**(⋅). The denotation of leq() when it occurs next to **def** is *∅*.

^{10}

Alternatively, we can take a more complexity-theoretic perspective by measuring the number of queries needed to emulate up to a bounded context size. Fix a maximum *n*. Then we can use binary search with $O(logn)$ queries to find the value of *m*. Since the number of context bits is $O(logn)$, the numbers of queries is $O(|\kappa |)$, beating the $O(|\Sigma ||\kappa |)$ query complexity achievable by brute force. This perspective somewhat resembles Pratt-Hartmann and Third (2006) and other work in semantic complexity theory on the computational complexity of evaluating fragments of natural language.

^{11}

The octopus thought experiment imagines a deep-sea octopus *O* observes a dialogue between two humans by intercepting an underwater cable. Could *O* learn to emulate the role of one of the speakers without exposure to life on land?