Many current NLP systems are built from language models trained to optimize unsupervised objectives on large amounts of raw text. Under what conditions might such a procedure acquire meaning? Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations (i.e., languages with strong transparency), both autoregressive and masked language models successfully learn to emulate semantic relations between expressions. However, when denotations are changed to be context-dependent with the language otherwise unmodified, this ability degrades. Turning to natural language, our experiments with a specific phenomenon—referential opacity—add to the growing body of evidence that current language models do not represent natural language semantics well. We show this failure relates to the context-dependent nature of natural language form-meaning mappings.
Despite language models’ (LMs) centrality to recent progress on NLP benchmarks, a formal characterization of what can be learned from unsupervised training on large text corpora, and of what modern language models actually do learn, remains elusive. Empirically, Tenney et al. (2019), Kovaleva et al. (2019), Wu et al. (2021), among others, all discovered that pretrained LMs possess unsatisfactory semantic representations. Traylor et al. (2021) found co-variation between form and meaning to be insufficient for an LM to represent lexical semantics. Li et al. (2021), on the other hand, identified evidence of LMs representing dynamic semantics (Kamp, 1981; Heim, 1982; Groenendijk and Stokhof, 1991).
From first principles, Bender and Koller (2020) argued that it is a priori impossible for an ungrounded system that has access only to linguistic forms to learn the mapping between those forms and their grounded denotations. They claimed, as a thought experiment, that a learner that has access to all Java code (i.e., form) on GitHub can never learn execution (i.e., meaning). They nevertheless acknowledged that the existence of unit tests, which assert the expected output given input to blocks of code, could constitute a weak form of grounding which potentially enables the learning of meaning.
Formalizing this idea, Merrill et al. (2021) theoretically proved the possibility of learning (or more technically, emulating) semantic relations between expressions in a certain class of formal languages—those that are strongly transparent whose expressions have context-independent denotations—using an assertion oracle, analogous to the assertions in unit tests. In addition, with an example, they showed the existence of non-emulatable languages even with an assertion oracle.
Yet, the practical implications of these theoretical results have not been explored. While assertions enable the emulation of strongly transparent languages, it is unclear if existing LM architectures and objectives achieve emulation given training data with assertions. Furthermore, we do not know if natural language (NL) is similarly non-emulatable as Merrill et al.’s (2021) constructed example, especially since non-transparency does not always imply non-emulatability. We thus pose two research questions:
Can current LM architectures and pretraining objectives emulate the meaning of strongly transparent languages?
Can modern LMs fully emulate the meaning of natural language which is non-transparent?
We answer RQ1 in the positive (§3): On a strongly transparent propositional logic language, autoregressive and masked language models pretrained on only expressions (form), à la GPT-2 (Radford et al., 2019) and RoBERTa (Liu et al., 2019c), can consistently compare and evaluate their values (meaning). We find that necessary grounding of the pretraining data distribution is crucial to this ability. We also investigate the role of transparency for emulatability in a controlled setting as an intermediate study before analyzing non-transparent natural language. We ablate strong transparency from the logic language while keeping other factors unchanged. We observe a substantial drop in the LMs’ ability to emulate meaning, highlighting the importance of transparency for emulatability.
We then turn to natural language (§4). Referential opacity is an extensively studied phenomenon in semantics and philosophy (Quine, 1956; Kripke, 1972, among others) but has not been examined in modern NLP. We prove that this phenomenon entails non-transparency and analyze how well existing LMs represent it. Our analyses based on probing and sentence similarity point to a lack of its representation in the largest GPT-2 and BERT (Devlin et al., 2019) models (RQ2). Theoretically, this is a natural language parallel to the emulation difficulty for our non-transparent formal language, and further reinforces the connection between transparency and meaning emulatability. Practically, through the lens of strong transparency, our results supplement prior studies that identified pretrained LMs’ insufficient semantic representations (Tenney et al., 2019; Yu and Ettinger, 2020, 2021; Wu et al., 2021, among others).
We follow Merrill et al.’s (2021) operationalization of the learning of meaning by emulation and their definition of strong transparency. We summarize their nomenclature and theoretical results in this section and provide some examples. We refer readers to Merrill et al. (2021) for more details.
At a high level, we take an inferential (Speaks, 2021, §2.2.3) view of meaning. An LM is taken to understand a language L if it can resolve semantic relations (e.g., equivalence) between expressions in L.1 This is achieved through two procedures: μL maps expressions into representations based on training data from L, and δ uses the representations of two expressions to resolve a semantic relation between them.
We consider a languageL ⊆ Σ* over an alphabet Σ and denote . We term members of Lsentences. We consider an expressione ∈ Σ* with associated left and right context. ler ∈ L is a sentence. We denote the empty string with λ and the empty context with λ2.
For example, the sentence belongs to Lt because it can be generated by this CFG using the steps illustrated in Figure 1. In this sentence, the expression F has context .
We consider the denotation of an expression e, ⟦e∣κ⟧L, to be its meaning in the context κ.2 We write ⟦e∣κ⟧L = ∅ if e is invalid in κ.
The meaning of a propositional logic expression can be the value derived from its conventional semantics, i.e., either T or F. For instance, , and . For natural language, extensionally, the meaning of a sentence is its truth value, also either T or F (Frege, 1892); intensionally, the meaning is its truth condition, which could be viewed as a set of possible worlds where the sentence is true (Carnap, 1947). For a summary of the extension and intension of other expressions in NL, see Kearns (2011, §1.3). As an example in English, extensionally, ⟦An author of this paper believes that Corgis are the cutest dogs.|λ2⟧ = T.
2.3 Assertion Oracle
LM pretraining corpora could provide ℵ-like signals. For instance, pretraining sequences of the form e=e’ are a natural analog to an ℵ query. We adopt this view to pretrain our propositional logic language in §3. In English and many other natural languages, copulas are a straightforward counterpart: “Corgis are the cutest dogs.” is equivalent to “Corgis=the cutest dogs.” This can be further reduced to all propositions: “Corgis run.” is equivalent to ℵ(Corgis run., T) under the extensional framework.3
2.4 ℵ-emulation: Learning Meaning
Back to Corgis, an English learner μ can observe the equivalence of e = “Corgis” and e′ = “the cutest dogs” in many different contexts κ and develop their representations. We say that natural language is emulated if there exists δ that can decide the equivalence between such expressions from the representations alone.
The standard pretraining-probing setup is an intuitive instantiation of μL and δ. A model μL can query ℵL while pretraining on language L, which can then produce a representation μL(e) for any expression e. An equivalence probe δ can take the (frozen) representation of two expressions and decide their equivalence in some context. Importantly, because δ is frozen, it cannot make any more queries to ℵL. We adopt this paradigm for analysis in §3 and §4 and elaborate below.
2.5 Strong Transparency
A language L is strongly transparent if all of its expressions have context-independent denotations. That is, for all , either ⟦e|κ⟧L = ⟦e|λ2⟧L ≠ ∅ or ⟦e|κ⟧L = ∅.
Under conventional propositional logic semantics, Lt (Def. 1) is strongly transparent because the value of every expression is determined by itself and unaffected by its context. Natural language, on the other hand, is non-transparent. We prove in §4 that the NL phenomenon of referential opacity violates strong transparency.
Merrill et al. (2021) theoretically proved that all strongly transparent languages are ℵ-emulatable. In other words, it is possible to learn to emulate the meaning of these languages with only assertion oracle access. The converse is not necessarily true4 and hence there may be a weaker condition than strong transparency that also entails ℵ-emulatability.
In what follows, we study how their theoretical results realize empirically. We examine in §3 if LM architectures and objectives can emulate the meaning of a strongly transparent language. In §4, we return to natural language which is non-transparent and thus Merrill et al.’s (2021) results do not predict its meaning emulatability.
3 How Well Do Language Models Fare?
While strongly transparent languages are in theory ℵ-emulatable, it is unknown if existing LM architectures, coupled with their pretraining objectives, are able to successfully achieve ℵ-emulation, or more intuitively, to learn their meaning.
To test this, we synthetically create a strongly transparent language based on propositional logic. We pretrain LMs with the same architecture and similar data scale as GPT-2 and RoBERTa on a generated pretraining corpus. We then train an equivalence probe to study if the pretrained representations enable ℵ-emulation. The probe is trained with a sentence pair binary classification objective and tested on unseen sentences sampled from the same grammar. Alternatively, we also try to directly evaluate the value of unseen sentences, without probe training. To isolate the effect of strong transparency, we also minimally perturb this language to be non-transparent and study how this affects emulatability.
We use a PCFG to construct our propositional logic dataset because its recursive nature and context-freeness bear some resemblance to natural language,5 and because it is convenient for sampling. The rules are specified in Eq. 1 and the probabilities are hand-designed. The denotation of an expression can be computed according to the conventional semantics of propositional logic, which, as argued in §2.5, makes Lt transparent. Figure 1 shows an example. See §A for more details.
Our CFG rules prevent the atomic sentences T and F from occurring in the corpus (and (T) and (F) too) and only allow compositional sentences. This ensures the absence of pretraining sequences like sentence=T and guarantees that there is no direct grounding to denotations during pretraining, but only indirect grounding via ℵ. This makes the task more difficult than the ℵ-emulation setup but more realistically transferable to natural language (§5).
The dataset has 819.2M pretraining sequences and 1M/10K/10K probe training/validation/test sentence pairs. All splits have disjoint sentences. The average sentence length is around 48.6. §I contains more details including tokenization.
We pretrain from scratch an autoregressive LM (ALM) and a masked LM (MLM), respectively simulating GPT-2-small and RoBERTa-base6 with their original architecture, objective, and, to the extent possible, hyperparameters. They have near-identical model size hyperparameters, leading to 86.8M ALM parameters and 87.0M for MLM. We sample sentence pairs (a, b) with the same denotation and format the pretraining sequences in the form of a=b, such as (T∧F)= (F∨F), simulating ℵ-access (but restricting queries to be sentences, a more challenging setup: see Eq. 2). §3.3 will discuss a necessary form of data augmentation. We train for 100K steps, 20% of RoBERTa-base’s training duration and hence data size, which we found sufficient for convergence on our data. §B summarizes hyperparameters.
3.3 Analysis: Probing Lt
Probing is a commonly adopted method to quantify the extent to which a representation encodes a particular type of linguistic information (Alain and Bengio, 2017; Liu et al., 2019a; Hewitt and Manning, 2019, among others). The representation is frozen, on top of which a lightweight classifier is trained to predict the information of interest. As shown in §2.4, this paradigm conveniently corresponds to the formalization in Merrill et al. (2021), and hence we use it to investigate whether or not pretrained representations encode sufficient semantic information for equivalence decisions.
We probe semantic equivalence from the pretrained models for pairs of unseen sentences. We embed each sentence separately through the pretrained model, taking the last token representation for ALM and the average for MLM.7 Voita et al. (2019) and Haviv et al. (2022) have shown that the positional information is diluted at the top transformer layers of MLMs, but it is crucial for the truth value in our language. We, therefore, take a weighted sum (a.k.a. scalar mix) of all layers for compensation for MLM.8 We also found that these simple methods for sentence representations sometimes do not perform well. We hence additionally consider a variant where the probe is an attention-weighted mixture of all token positions. We refer to these two representations as –Attn and +Attn, respectively. See §B for more on their details. We train a bilinear classifier probe on top of the sentence representations (Li et al., 2021) and evaluate it with accuracy on a held-out test set. For each setting, we train the same probe with five different random seeds and report their mean and standard deviation. We report hyperparameters in §B.
Past work has cast doubt on whether probes faithfully reflect the representation’s encoding of the information of interest, or if they directly learn the task (Hewitt and Liang, 2019). This is an especially important issue here as our +Attn sentence representation injects additional trainable parameters compared to a simple (bi)linear classifier. To answer this question in our setting, we follow previous studies (Conneau et al., 2018; Tenney et al., 2019; Wu et al., 2021, among others) and train a randomly initialized and similarly frozen control model with the same architecture: If LMs emulate meaning similarly to Merrill et al. (2021) algorithm, we would expect the pretrained model to yield higher probing accuracy than the random model.
The Lt rows in the top two sections of Table 1 summarize the results. With a simple sentence representation (–Attn), the pretrained ALM achieves near-perfect probing accuracy for Lt, though MLM performs at chance level. An attention-based sentence representation enables 63.8% accuracy10 for MLM and improves ALM’s performance to 100%. Importantly, in this variant, the random baselines still perform at chance level, demonstrating that the additional parameters do not lead to an overly powerful probe. We discuss the accuracy differences between ALM and MLM in §5. These results demonstrate that pretraining enables meaning emulation, though the meaning representation can be more deeply encoded than what can be extracted with a (bi)linear probe. We note that it is expected that the performance of pretrained models does not reach 100%. While Merrill et al. (2021) showed its theoretical possibility, their setup assumes active learning with unlimited access to ℵ and allows the “probe” δ to be an arbitrarily powerful function, among other differences.
|.||ALM (à la GPT-2) .||MLM (à la RoBERTa) .|
|.||Random .||Trained .||Random .||Trained .|
|.||ALM (à la GPT-2) .||MLM (à la RoBERTa) .|
|.||Random .||Trained .||Random .||Trained .|
We found that independently sampling pretraining sequences results in unsuccessful emulation with probing performance at random. Instead, it is crucial to ground = with reflexivity and symmetry.11 We achieve this by augmenting the pretraining data: if a=b is a pretraining sequence, we ensure a=a, b=b (reflexivity), and b=a (symmetry) are too. This imposes a constraint on the pretraining data distribution that eases the learning of =’s meaning. Table 2 shows that both properties are important. We consider the implication in §5.
|.||–Reflexivity .||+Reflexivity .|
|–Symmetry||a=b||a=b, a=a, b=b|
|+Symmetry||a=b, b=a||a=b, b=a, a=a, b=b|
|.||–Reflexivity .||+Reflexivity .|
|–Symmetry||a=b||a=b, a=a, b=b|
|+Symmetry||a=b, b=a||a=b, b=a, a=a, b=b|
3.4 Analysis: Direct Evaluation on Lt
The process of training a probe introduces additional complexity, such as ± Attn, that potentially complicates our analysis. Therefore, we also test a stronger condition where there is no additional classifier: Can the pretrained models evaluate expressions, without any further training (e.g., a probe)? For MLM, it is the most straightforward to compare if the model assigns a higher probability to T or F in sentence=[MASK]. However, this is a sequence that never occurs in the pretraining corpus since a standalone T or F is not part of our language (Eq. 1). Therefore, we use five templates on the right-hand side that are minimal in our language: (T∧[MASK]), (F∨[MASK]), ([MASK]∧T), ([MASK]∨F), ([MASK]). For the first four templates, we expect the masked position to be filled with the truth value of the proposition, and the negated value for the last one. For ALM, we compare if the model assigns a higher probability to the sequence where [MASK] is filled in with T vs. F.
The bottom section of Table 1 shows the mean and standard deviation of the evaluation accuracy across our five templates. Without training, a random model always has 50.0% accuracy on expectation. Both ALM and MLM achieve a high evaluation accuracy, above 95%, corroborating the LMs’ capability to represent the meaning of Lt.
These results respond to the argument in Bender and Koller (2020):
We let GPT-2 complete the simple arithmetic problem Three plus five equals. The five responses below [...] show that this problem is beyond the current capability of GPT-2, and, we would argue, any pure LM.
We showed that form-only supervision does allow such evaluation on a strongly transparent language, at least when the supervising data distribution satisfies symmetry and reflexivity.
Building towards non-transparent natural language, it is important to understand strong transparency’s effect on emulatability. We design a minimally perturbed version of Lt that is non-transparent, Ln. The syntax stays the same, but we change the semantics such that has a side effect: When followed by T or F, it inverts the meaning of these literals that occur in certain other environments. Specifically, each node changes the meaning of all the literals T in its c-commanded subtree (i.e., the e subtree headed by the node’s sibling, if there is one; Reinhart, 1976) to F. An additional does not invert back. Similarly, changes the meaning of the literal F to T. For example, in the sentence in Figure 1, thenode, so its meaning is changed to F. On the other hand, the (¬ T because they do not constitute a c-command relation. This alternation is inspired by binding theory in generative grammar (Chomsky, 1981, 1983), where the node is the binder that c-commands the bindee. ince the meaning of T and F now depends on the existence of a binder, Ln is non-transparent.12
We conduct the same pretraining/ probing/direct evaluation procedure on Ln. Table 1 reports the results. on-transparency decreases ALM’s probing accuracy with both –Attn and +Attn, though not to random level. The variance across different probe training seeds also increases compared to Lt, indicating that the pretrained representation is less robust. Directly evaluating ALM with Ln similarly leads to both decreased average accuracy and increased variance. MLM, on the other hand, achieves random probing and evaluation accuracy. Overall, the lack of strong transparency reduces models’ meaning emulation ability, though not always to chance performance.
4 What About Natural Language?
While existing LM architectures and objectives are able to emulate the meaning of synthetic languages, it is unclear how these observations transfer to natural language (NL). Merrill et al. (2021) hinted that, since NL is non-transparent and likely more complex than their constructed non-emulatable language, it is probable that a pretraining procedure, even with ℵ-access, cannot emulate its meaning either. This, however, remained an untested hypothesis.
We formalize this intuition and prove that a specific NL phenomenon, referential opacity, makes NL non-transparent.13 This phenomenon has been widely studied in semantics (Quine, 1956; Kripke, 1972, among others), yet it has received little attention in modern NLP. We fill this gap from the perspective of strong transparency and study the representation of this phenomenon in modern LMs with a probing-based and a sentence similarity-based analysis.
4.1 Referential Opacity
To illustrate referential opacity, we use the classic example in semantics:
Lois Lane believes Superman is a hero.
Lois Lane believes Clark Kent is a hero.
Note that (a) and (b) have different truth conditions: Their truth values differ if Lois Lane does not know Superman and Clark Kent are the same person. Formally, ⟦Lois Lane believes Superman is a hero.|λ2⟧ ≠ ⟦Lois Lane believes Clark Kent is a hero.|λ2⟧.14 On the other hand, ⟦Superman|λ2⟧ = ⟦Clark Kent|λ2⟧.15 In other words, two expressions that have the same denotation, when embedded in the same context, yield sentences with different truth conditions. Such contexts are called referentially opaque, and, in this case, they are induced by a propositional attitude verb “believes” whose meaning depends on the cognitive state of its subject (Anderson and Owens, 1990).
Now we formalize referential opacity:
In natural language, an expression e is contextually valid in if none of ⟦l|λ, er⟧,⟦e|l, r⟧,⟦r|le, λ⟧ is ∅.16
A context in natural language is referentially opaque if there exist expressions e1, e2, both contextually valid in κ, such that ⟦e1|λ2⟧ = ⟦e2|λ2⟧ and ⟦le1r|λ2⟧ ≠ ⟦le2r|λ2⟧.
Now, we prove that the existence of referentially opaque contexts implies non-transparency. We assume compositionality, for which we provide a working definition: ⟦ler|λ2⟧ = f(⟦l|λ, er⟧,⟦e|l, r⟧,⟦r|le, λ⟧) for some meaning composition function f.17 Intuitively, the proof shows that if all expressions have fixed meaning (i.e., are strongly transparent), referential opacity would not arise.
A compositional language with referentially opaque contexts is not strongly transparent.
Therefore, as a non-transparent example in NL, we study whether referential opacity is reflected in the representation of current LMs.
We cast referential opacity as a sentence pair binary classification problem. We generate sentence pairs like Ex. 1 as our dataset. Ex. 1 consists of two parts that correspond to the two conditions in Def. 4: two co-referring expressions (⟦e1|λ2⟧ = ⟦e2|λ2⟧), and a referentially opaque context that embeds the entity (⟦le1r|λ2⟧ ≠ ⟦le2r|λ2⟧). Next, we separately introduce how we generate them. Our final dataset consists of 45K/6K/6K training/development/testing sentence pairs for GPT-2 and 97K/12K/12K for BERT. §C provides more details, including more fine-grained dataset statistics for different experimental settings below.
The co-referring expressions in Ex. 1 are proper names, “Superman” and “Clark Kent.” Not only is this hard to collect data for, but, due to the rigidity of proper names (Kripke, 1972), it is also theoretically more challenging to analyze as the classic intensionality framework is more difficult to apply (Von Fintel and Heim, 2011).18 We hence consider co-referring expressions that are one proper name and one definite description, such as “Yuri Gagarin” and “the first person in space,” which can be more straightforwardly accounted for with intensionality (Heim and Kratzer, 1998, §12; Von Fintel and Heim, 2011). We use the LAMA dataset (Petroni et al., 2019), specifically the T-REx split (Elsahar et al., 2018) following recent factual probing work (Jiang et al., 2020; Shin et al., 2020; Zhong et al., 2021), to obtain a list of such entities. To make sure the model representation captures the coreference, we follow Petroni et al. (2019) and use LAMA to prompt the LM with these equivalences and only keep entities that are correctly predicted.19
We construct referentially opaque and referentially transparent contexts to embed these co-referring expressions. We only consider referential opacity involving propositional attitude verbs, where the context is referentially opaque iff its main verb conveys propositional attitude. There are other types of referential opacity, such as counterfactuals (Von Fintel and Heim, 2011; Kearns, 2011, §7) and substitutions that shift the syntactic status of constituents (e.g., Fine, 1990), that we omit in this work for simplicity, though they could be targets of future studies. We manually design two classes of templates, depending on the verb’s argument structure. The first has an embedded clause, e.g.,
She wants to meet Yuri Gagarin.
She wants to meet the first person in space.
The second contains only the main clause, such as
He speaks Lao.
He speaks the official language of Laos.
The two sentences in a pair only differ by the entity reference: one is a name and one is a definite description. A sentence pair is non-equivalent iff it has a referentially opaque context, or within our scope of study, iff its main verb is a propositional attitude verb. We gather the list of verbs from past linguistic studies and verify with native speaker judgment (see §C).
We consider GPT-2-XL and BERT-large-cased21, the largest variants in these two families, as representative autoregressive and masked LMs. They have 1.5B and 340M parameters, respectively. We obtain sentence representations in the same way as in §3, except without attention-weighting and simply using the [CLS] embedding for BERT.
4.4 Analysis: Probing
We use the same bilinear probe in §3 as a binary classifier over sentence pairs, determining the equivalence, or the referential transparency, of each pair. However, because of the lexicalized nature of referential opacity, the probe could easily overfit and recognize not their equivalence but the existence of a propositional attitude verb.
To overcome this, we introduce attractors (Linzen et al., 2016; Gulordava et al., 2018; Pandia and Ettinger, 2021, among others).22 We always conjoin a clause with a propositional attitude verb and one with a non-attitude verb, disallowing the aforementioned heuristics. The equivalence label now depends on if the entity alternation occurs under the non-attitude verb, which would result in an equivalent sentence pair, or the attitude verb, which would lead to non-equivalence. For example:
He speaks Lao and she wants to meet Yuri Gagarin.
He speaks the official language of Laos and she wants to meet Yuri Gagarin.
He speaks Lao and she wants to meet Yuri Gagarin.
He speaks Lao and she wants to meet the first person in space.
Despite both examples having the same verbs, the sentence pair in Ex. 4 is equivalent, but Ex. 5 is not. We are not using attractors for out-of-domain evaluation; instead, the training and test sets are i.i.d., but we break down the test set performance by categories.
We train a probe on GPT-2-XL and BERT-large over 10 random seeds. Details are in §D. Table 3 reports the results. As expected, both models overfit with the attractor-less simple sentences, achieving perfect accuracy. With attractors in coordinated sentences, however, both models obtain near-random performance overall. Because the training and test sets are i.i.d., this means that semantic equivalence based on referential opacity cannot be probed in our setup from these two models, suggesting an inadequate representation of this phenomenon.23 Interestingly, both models tend to predict equivalence more than non-equivalence (more prominent with GPT-2 than BERT), likely due to the nuanced nature of this task: Without training, a human would likely judge equivalence on referentially opaque sentence pairs too.24 See §E for a set of experiments that show that LMs can potentially learn to capture referential opacity with semantic supervision following pretraining.
|.||.||GPT-2 .||BERT .|
|.||.||GPT-2 .||BERT .|
4.5 Analysis: Sentence Similarity
As in §3.4, the simplicity of a training-free analysis can be desirable. To this end, we directly measure the cosine similarity between the two sentence representations in a pair. While this semantic similarity would be high for both groups of sentences by our construction, equivalent sentence pairs should have more similar representations than those that are not. While factors other than semantics, such as syntax, also affect sentence representations, we strictly control them in our synthetic data generation to be identical between referentially transparent and opaque sentences. We do not consider attractor sentences (§4.4) in this analysis.
For significance testing, we employ an exact permutation test (Fisher, 1935) and a bootstrap test (Efron and Tibshirani, 1993) with 1,000 iterations, performed across verbs, where the test statistic is the difference between the averaged cosine similarity of the two groups. Both tests are two-sided with the null hypothesis being that the model representation does not distinguish between the two classes of verbs. For GPT-2-XL, the permutation test gives p = 0.64 and bootstrap gives p = 0.66, barring us from rejecting the null hypothesis. For BERT-large, they give p = 0.45 and p = 0.57 respectively, where we again observe no significant difference between the two classes. Nonetheless, we note that the inability to reject the null hypothesis does not entail it is true.
Reimers and Gurevych (2019) noted that computing sentence pair cosine similarity using BERT’s [CLS] token, as we did, does not correlate well with textual similarity benchmarks. This phenomenon is commonly attributed to the anisotropic nature of pretrained representations (Ethayarajh, 2019). This does not undermine the validity of our method, which instead relies on the correlation between the cosine similarity and the model’s representation of semantic closeness. We ensure this correlation by controlling for all factors other than semantics (syntax, lexical choices, entities, etc.). Nevertheless, we also postprocess BERT’s [CLS] representation using BERT-flow (Li et al., 2020) which has been shown to increase the correlation with textual similarity benchmarks. We obtain a similar result: Bootstrap gives p = 0.49. While the two-sided permutation test gives p = 0.03 with potential significance, the one-sided version gives p = 0.99; in other words, the calibrated space represents opaque sentence pairs to be more similar than transparent ones, contrary to our expectation that equivalent sentence pairs should be closer in the representation space than non-equivalent ones when all other factors are controlled.
The results from these two sets of analyses in §4.4 and §4.5 are consistent and show no evidence of modern LMs representing referential opacity, demonstrating that they cannot fully emulate the meaning of NL. Our finding adds to recent observations that pretrained LMs do not represent semantic phenomena well (Tenney et al., 2019; Kovaleva et al., 2019; Wu et al., 2021, among others). Theoretically, it also strengthens the connection between strong transparency and meaning emulatability with NL-based empirical evidence.
Through analyses based on probing and direct evaluation, we have seen that existing LM architectures and objectives can learn to emulate the meaning of a strongly transparent language Lt when the training data reflects equivalence relations. While non-transparency (Ln) causes this ability to decrease, the trained models still outperform a random model in certain setups. We believe this result hints at the strength of current LM architectures and objectives.25 There seems to be a limit to this strength, though—in natural language, neither GPT-2 nor BERT represents the non-transparent phenomenon of referential opacity well.
Our results shed light on the relationship between the strong transparency of a language and whether its semantics can be emulated. We observed co-variation between the two: When slightly perturbed to be non-transparent, our logic language becomes harder to emulate; and there is no evidence for LMs representing the semantics of a non-transparent NL phenomenon. Nevertheless, the above-random emulation performance with Ln suggests that there could be language properties that potentially better predict emulatability, leaving room for future theoretical endeavors.
We also found that, with a similar size and training procedure (§3.2), ALM is more suitable for representing the meaning of our propositional logic languages than MLM, in our setup. ALM achieves better probing accuracy than MLM under both methods of obtaining sentence representations that we explored. Also, MLM completely fails to emulate meaning facing non-transparency, but not ALM. Ultimately, though, we hope to understand if this difference transfers to natural language. Our NL investigation reveals that both ALM (GPT-2) and MLM (BERT) achieve chance-level probing performance on the one phenomenon that we inspected, likely due to its difficulty. It would be interesting for future efforts to further examine their differences, if any, in learning and representing the meaning of other NL phenomena.
Our results also lead to the question: Why can LMs achieve above-random results on Ln but not referential opacity? While it is entirely possible that the latter is simply more difficult than our synthetic non-transparency, there are other factors at play. First of all, natural language is much more variable than our synthetic language: Utterances can be untruthful (though they are in general governed by Gricean quality; Grice, 1975), subjective (such as our earlier claim about Corgis’ cuteness, §2.3), intensional (see Merrill et al., 2021 for a discussion), etc. But putting these variations aside, we saw from §3 that even the synthetic language requires an explicit grounding of = to enable emulation, and this is missing from NL pretraining. It is certainly not the case that, for every expression such as “Corgis are the cutest dogs.” that exists in the pretraining corpus, the variations “The cutest dogs are Corgis.”, “Corgis are Corgis.”, “The cutest dogs are the cutest dogs.” are also guaranteed to appear. So perhaps there needs to be a more foundational change in our pretraining objective. As Brown et al. (2020) foretold, “A more fundamental limitation of [...] scaling up any LM-like model [...] is that it may eventually run into (or could already be running into) the limits of the pretraining objective.” Our results point to one such possibility: We believe research into a more explicit representation of semantic relations in future pretraining processes, such as based on paraphrases, could be fruitful.
What we did not investigate, though, is whether partial equivalence grounding enables emulation: what if, for example, only 1% of the pretraining data has this form of grounding, while the rest does not? And the above format already exists for certain sentences in NL. This, too, could be an exciting future research question.
6 Related Work
Bender and Koller (2020) initiated the discussion on the possibility of a learner acquiring meaning from training on linguistic forms alone. From first principles, they argued for its impossibility. Empirically, Traylor et al. (2021) also found that LMs cannot well-represent lexical-level symbols when the pretraining data is distributionally constrained to supply relevant signals. Merrill et al. (2021), on the other hand, proved theoretically that it is possible to emulate the meaning of strongly transparent languages with assertion oracle access. We showed in this work that, empirically, LMs also attain the capability. The work of Patel and Pavlick (2022) is also conceptually similar to our work, discovering that the internal representation of LMs is to a large extent isomorphic to the conceptual spaces of directions and colors. They adopted in-context learning (Brown et al., 2020; among others) to elicit the isomorphism, while we used the more traditional probing paradigm.
Another line of work has inspected the extent to which pretrained LMs encode various types of semantic information. Some have examined the representation of lexical semantics: Garí Soler and Apidianaki (2021) found that BERT representations reflect polysemy levels, and Vulić et al. (2020) showed that they also capture abundant type-level lexical knowledge. On the other hand, Ettinger (2020) and Ravichander et al. (2020) have discovered that pretrained LMs do not satisfactorily encode negation and hypernymy, respectively. Moving beyond the lexical level, Wu et al. (2021) demonstrated that pretrained BERT and RoBERTa models less readily surface semantic dependency information than syntactic dependencies, while Li et al. (2021) identified evidence of dynamic semantics representation in these models.
We have empirically shown that pretrained language models are able to emulate the meaning of a strongly transparent language through pretraining on an assertion-inspired format, but this ability deteriorates when the language is minimally perturbed to be no longer strongly transparent. Furthermore, we found no representation of referential opacity, which is significant for being a non-transparent natural language phenomenon, in pretrained LMs.
We thank the TACL reviewers and action editor for helpful feedback on this work. We thank Kyle Richardson, Jesse Dodge, and other members of AI2 for insightful discussions. This work was funded in part by NSF award 1922658. WM was supported by an NSF graduate research fellowship.
A Propositional Logic Dataset Details
We hand-designed the PCFG probabilities in Eq. 1. To expand an e, the two binary rules each have 0.06 probability under Lt. The rule and expansion to T and F divide the remaining probability mass, with T and F having the same probability, half of the rule. As S does not expand to T or F, the other three rules proportionally split the probability mass. We consider each of as a separate token for tokenization. We enforce a maximum length of 248 tokens. We sample all sentences without replacement. The average Lt sentence length is ≈48.6 tokens. Sampling Ln results in slightly longer sentences, so we decrease the binary rule probabilities to be 0.03 each, but the specification is otherwise the same. The resulting Ln sentence on average has ≈51.7 tokens. We sample 819.2M pretraining sentences and 1M/10K/10K probe training/validation/test sentences. Then, for each split, we sample sentence pairs, with the same number as the number of sentences in that split.
B Propositional Logic Training Details
For pretraining, we mostly follow the original hyperparameters for GPT-2-small and RoBERTa- base. We train with batches of 8,192 sequences for 100k steps, equivalent to 1 epoch over our pretraining data. We use the AdamW optimizer (Loshchilov and Hutter, 2019) with epsilon 10−8 for ALM for 10−6 for MLM, and β2 0.95 for ALM and 0.98 for MLM. We set the learning rate to 6 × 10−4 warmed up over 10k steps with a 0.1 weight decay.
For probing, +Attn trains a query vector that interacts with the key representation of each token, obtained with a trained key matrix transformation, and the resulting attention weights are used to average the token embeddings. We train all probes for 3 epochs with batch size 8 and 1,000 warmup steps and select checkpoint with validation accuracy. We use AdamW with 10−5 learning rate except only for Ln–Attn ALM that benefits from a different learning rate 10−3. We clip gradients to unit norm.
C Referential Opacity Dataset Details
We detail the generation of our referential opacity dataset, separately discussing its two aspects (§4.2).
C.1 Generating Co-referring Expressions
For fact probing on LAMA, we use the prompt in the form “The official language of Laos is known as __” which we found appropriate for the entity types in T-REx. If the LM correctly predicts “Lao”, we consider this equivalence, or fact, captured by the model. As LAMA was designed to have 1-token answers with BERT’s tokenization, we let BERT fill in the blank. This is not a guarantee for GPT-2’s tokenization, so we run decoding for the same number of steps as the true answer’s length with beam size 5 and no sampling. To further ensure that the predictions are reliable and not due to noise, we only keep entity categories with overall prediction accuracy >25%. The resulting categories are “P37 official language”, “P364 original language of film or TV show”, “P140 religion”, “P103 native language”, and “P36 capital”. This procedure results in 1,606 facts for GPT-2 and 2,962 facts for BERT.
C.2 Generating Contexts
We generate two types of contexts (§4.2). The first type contains an embedded clause, for which we construct templates for each entity category in §C.1. For language entities, for example, one template is “[pronoun] [verb] to speak [entity].” A sentence pair is formed by filling in [entity] with a definite description vs. a proper name for a fact. We only consider the pronouns “She” and “He” in this work. We consider 6 referentially transparent verbs (“starts”, “begins”, “ceases”, “stops”, “managed”, “failed”) and 6 referentially opaque verbs (“wants”, “intends”, “hopes”, “begs”, “preferred”, “suggested”). The second type of context contains only the main clause. We use the referentially opaque template “[pronoun] dislikes [entity].” and an entity category-specific referentially transparent template such as “[pronoun] speaks [entity].” In total, we have 64,672 sentence pairs for GPT-2 and 121,768 for BERT.
For our probing analysis, we also included attractors with coordinated sentences (§4.4). As there are a quadratic number of possible coordinations, we subsampled 59,548 such sentences for GPT-2 and 119,540 for BERT, similar to the number of attractor-less sentences. We split all sentence pairs 8/1/1 for training/validation/testing.
For our similarity analysis, for a cleaner significance test, we only consider sentence pairs with an embedded clause. This leaves 58,776 sentence pairs for GPT-2 and 111,312 for BERT.
D Referential Opacity Training Details
The probe is trained similarly to §B except for 1 epoch with batch size 256 and learning rate 10−5.
E Can Language Models Learn to Represent Referential Opacity With Appropriate Supervision?
We showed in §4 that we do not observe evidence of pretrained language models representing the phenomenon of referential opacity. A natural question, then, is whether language models can learn to represent it. Following a similar setup as Lyu et al. (2022) and Liu et al. (2019b), we finetune the entire model on a portion of our training set for 1 epoch and conduct the same probing procedure on the resulting model. All training is done with the coordinated data introduced (§4.4). Finetuning uses the same hyperparameters in §D. Similar to §4.4, we report the mean and standard deviation across 10 random seeds for each setting.
We plot the probing accuracy along with the number of finetuning examples in Figure 2. Both GPT-2 and BERT continue to be unable to perform above-random with up to 10,000 finetuning examples, further demonstrating their inadequate semantic representation of referential opacity. Nevertheless, with enough finetuning examples, both models eventually achieve near-100% probing accuracy. It is, therefore, possible that they can potentially learn to represent referential opacity with sufficient semantic supervision, though we note a caveat: while we introduced coordinated data to prevent an obvious shortcut that the model could take (§4.4), it does not eliminate all possible shortcuts. It could be the case that the additional capacity afforded by finetuning enables the model to exploit a more sophisticated shortcut (unknown to us) instead of truly capturing this phenomenon.
This inferentialist perspective can be contrasted with denotationalism, which says that “understanding” is the task of mapping an expression to a logical representation of its meaning (Speaks, 2021, §2.2.3). Inferentialism implicitly underlies natural language inference-based evaluation of NLP models (e.g., Bowman et al., 2015).
We overload L to represent both the surface form and a mapping between form and denotation.
Assuming that propositions are more frequently true than false, which tends to be the case pragmatically (Grice, 1975).
Consider, for example, a finite non-transparent language whose denotation space can be learned by enumeration.
There are aspects of natural language that a PCFG does not capture, such as recursion constraints (Karlsson, 2010) and non-context-free phenomena (Shieber, 1985). Nevertheless, the goal of this research question is not to maximally simulate NL, but rather investigate the distributional learnability of compositional semantics. Future work could investigate the effect of moving away from a strict PCFG.
We do not follow BERT because next sentence prediction is not applicable here, but they are otherwise similar.
The lack of a next sentence prediction task (Fn. 6) leads to no supervision for a [CLS] token.
Formally, μ’s output contains all layer representations.
ALM Trained +AttnLn has a degenerate seed that led to around 50% accuracy, hence the large variance. It is possible that additional configuration-specific hyperparameter tuning, which we did not perform, could reduce this instability.
With additional linear layers, it could go up to 83.4%±2.0 while the random model still performs at chance level. We did not include this in Table 1 for consistency with other settings.
Reflexivity states that a = a, and symmetry a = b ⇒ b = a. Equality further requires transitivity: a = b ∧ b = c ⇒ a = c, but it is not tested in our probing setup and we found it unimportant for probing accuracy in preliminary experiments.
This is a straightforward way to introduce a with side effect to a hierarchical structure. An alternative is to rely on a linear structure and invert all literals linearly following . Nevertheless, our version leverages the hierarchical reasoning that the model originally needs to possess to evaluate an expression, while this version requires a new type of reasoning that is linear. So that change would be less minimal.
Deictic expressions are another example, though they have been extensively studied under coreference resolution.
In this section we consider the language L to be English, or any NL that exhibits this phenomenon, and ⟦·|·⟧ to be intensions (§2.2). We drop the subscript L for brevity.
It is possible to argue that ⟦Superman|λ2⟧ ≠ ⟦Clark Kent|λ2⟧ if we consider their intension to be different. Nevertheless, we adopt the view of Heim and Kratzer (1998, §12.3) to not introduce intensionality by default (i.e., with κ = λ2), but rather to evoke it by context: “The usual denotations are extensions. But for nonextensional contexts, Intensional Functional Application allows a switch to intensions. The switch is triggered by particular lexical items [...]”.
This is a technical detail needed for proving Theorem 1.
Though see Shabasson (2018) for a theorization.
Consider, for example, if this person is Yuri’s neighbor and wants to meet him for dinner, but, being an avid flat-earther, is not fond of space traveling and is unaware that he has been to space. She would say she wants to meet Yuri Gagarin but has no interest in meeting the first person in space.
Another option is to have disjoint training and testing verbs. This did not work in preliminary experiments because verbs that induce referential opacity are semantically closer, as they always convey propositional attitude. So the model could use this similarity in the word embedding space to extrapolate.
There might still be other more complex heuristics, but even so, the probe still fails. Hence we do not need additional attractors to rule out all possible heuristics.
Though, with training, it is relatively straightforward to perform this task for a human, so it is reasonable to test the ability in LMs.
Especially since our setting is more challenging than Merrill et al.’s (2021) algorithm, without their unlimited ℵ-access, active learning, arbitrarily powerful δ, etc. Plus, we restrict ℵ queries to be sentences and disallow comparing a sentence with T or F using ℵ.
This work was done when Zhaofeng Wu was at AI2. Our code and trained models are released at https://github.com/ZhaofengWu/transparency.
Action Editor: Marco Baroni