Abstract
Performance on a dataset is often regarded as the key criterion for assessing NLP models. I argue for a broader perspective, which emphasizes scientific explanation. I draw on a long tradition in the philosophy of science, and on the Bayesian approach to assessing scientific theories, to argue for a plurality of criteria for assessing NLP models. To illustrate these ideas, I compare some recent models of language production with each other. I conclude by asking what it would mean for institutional policies if the NLP community took these ideas onboard.
1 Introduction
Much recent work in Natural Language Processing (NLP) has focused on the performance of NLP models, as measured by various intrinsic or extrinsic evaluations, neglecting some vitally important other dimensions of model quality. I argue that this practice risks building models that are ad hoc, that are implausible and unwieldy, that are not connected to existing insights, and that may not generalize to other domains, genres, and applications. I argue that an antidote to these risks can be found in a widely accepted view of scientific explanation, and can be harnessed by a Bayesian perspective on assessing scientific hypotheses and theories.
I will try to be generic, covering NLP theories and models of every kind, using the word “model” throughout; following the terminology of Sun (2008), this will include both process models (which aim to mimic mental processes) and the more ubiquitous product models (which focus on the relation between inputs and outputs, agnostic about how these outputs are produced). I start by focusing on models that are constructed with the aim of expressing, testing, and ultimately enhancing humankind’s grasp of human language and language use (i.e., “NLP-as-Science”). In Section 4, I will argue that my argument can be extended to cover many practically oriented types of NLP, too.
One might think that a lot of work in NLP already focuses on explanation, because it aims for explainability (e.g., Lyu, Apidianaki, and Callison-Burch 2023; see also Ghassemi, Oakden-Rayner, and Beam 2021). This argument, however, confuses (1) explaining a natural phenomenon (e.g., an aspect of language use) and (2) explaining a model (i.e., a piece of software). The difference is starkest when the model doesn’t match the phenomena well. Suppose a model classifies student essays into good (pass) and bad (fail). Suppose the model has terrible performance but admirable explainability, for instance via computer-generated “rationales” that highlight essay fragments that were particularly important for each classification decision (as in Lei, Barzilay, and Jaakola 2016, for example). These rationales can be useful for a stakeholder who wonders whether to trust its decisions, or a developer wanting to improve it. Yet they cannot tell us what makes an essay good or bad, because (we assumed) the model does not know the difference.
2 Dimensions of Explanatory Value
Explanation lies at the heart of the scientific enterprise (Overton 2013; Woodward and Ross 2021; Hepburn and Andersen 2021). But what is scientific explanation? Most theorists believe that scientific explanation should involve a reduction of something not yet known (e.g., previously unobserved facts) to something known, such as an existing law or model. A famous version is the Deductive-Nomological Theory (DNT). The DNT (also known as the Deductive-Nomological Model), which originated in the 1940s and was further elaborated in Hempel and Oppenheim (1965) and Hempel (1965), asserted that an explanation of data D (e.g., two billiard balls colliding) should take the form of a model M (e.g., Newton’s laws of motion) that meets two requirements: first, D should follow logically from M, and second, M should be true.
These ideas may be similar to everyday notions of explanation, yet it is worth noting that Hempel’s perspective on explanation is more scientific than psychological: It does not require that an explanation should give humans the feeling that they understand the data; it does not even require that M should be intelligible to us at all.
Later offshoots of DNT, also discussed in Hempel (1965), have replaced the notions of truth and logical consequence (i.e., the “follow logically from” relation above) by their probabilistic counterparts. In particular, Hempel’s Inductive-Statistical Theory (IST) requires that, firstly, D should be probable given M (compared to other models) and, secondly, M itself should be probable. This idea is most at home in a Bayesian conception of probability,1 which allows probabilities to be based on any kind of information, instead of only frequencies. In what follows, I will flesh out these ideas a bit more, focusing on NLP models. I will argue that IST gives rise to three dimensions of model quality, which we call performance, indirect support, and parsimony.
1. Performance. The IST requirement that data D should be probable given a model M is often measured by the performance of the model M on D, by means of metrics such as BLEU or Moverscore (Sai, Mohankumar, and Khapra 2022; Celikyilmaz, Clark, and Gao 2020), which allow us to compare a model’s predictions with a gold standard. From our present perspective, evaluations based on human judgments (van der Lee et al. 2019) and task performance are likewise varieties of performance. If model A has better performance than model B then, other things being equal, A has greater explanatory value (in the sense of Hempel’s IST) than B, because its predictions are more reliable, at least on data drawn from the same distribution as the test data.
However, performance is best seen as a broader concept, which includes a range of ways to test a model; I will use the term direct support to refer to these collectively. One type of direct support is replication. Replications can vary considerably. For example, they may or may not be conducted in a different environment, by other researchers, and with a different data sample. Replications are important because they can reveal problems with the original experiment (Belz 2022).
Often, our ultimate aim with a model is not to shed light on just one set of data, but on a much wider class of phenomena (see also Section 4). It is therefore often necessary to test a model on data drawn from different distributions. Suppose, for example, a caption generation model in the tradition of Hodosh, Young, and Hockenmaier (2013) and Agrawal et al. (2019) is initially tested on a set of outdoor photos, showing good performance. To explore the reach of the model, one could test the model on a corpus of indoor photos, with captions in a different language, and where the captions serve a different practical purpose than in the original experiment; in the terminology of Hupkes et al. (2022), these experiments would count as cross-domain, cross-lingual, and cross-task generalizations. To complicate the picture, generalizations can change the model. For example, when a neural “foundation model” is fine-tuned to perform a new task, this can be considered a generalization of the model as well, and this perspective has given rise to new metrics in transfer learning and domain adaptation, which aim to assess how easy it is to transfer knowledge learned from one task to another (e.g., Tsuong et al. 2020; Tan, Li, and Huang 2021).
2. Indirect Support / Theory Support. These ideas are not always adhered to, but they are widely accepted in principle. This is not true for the second IST requirement, that the model should itself be plausible. The fact that human problem solvers often forget to consider the a priori probability of an idea when new evidence is considered is a well-attested finding (e.g., Bar-Hillel 1980). Bayesians have argued that when a model is assessed, the same should apply: One should consider the probability of the model not just in light of the data, but in light of everything else we know or believe (Jaynes 2003; Clayton 2022). The fallacy of “base-rate neglect”—which occurs when we fail to take the a priori probability of a model into account—occurs for example when Null Hypothesis Significance Testing is performed in such a way that P(Data|Hypothesis) is computed without also taking into account the (im)plausibility of the Hypothesis, leading to incorrect estimates of P(Hypothesis|Data).2 I will call the extent to which a model is supported by our existing knowledge indirect support, or theory support (because existing theories are often involved). Many scientists accept that indirect support is important, although what role it should play can be a matter of fierce debate.3
Indirect support can come from many sources. For example, if a model M based on BERT (Devlin, Chang, and Toutanova 2019) is good at predicting what type of expression human speakers use, and if linguistic theory suggests that the wider linguistic context affects this choice, then the fact that BERT was constructed to handle context dependencies is indirect support for M as a model of the phenomenon in question.
Because language use is a human activity, indirect support in NLP can sometimes involve considerations of cognitive plausibility. Suppose, for example, two Natural Language Generation (NLG) models have similar performance, but one of them is more in line with existing insights into the human language production architecture (e.g., as per Vigliocco and Hartsuiker 2002); then that model’s alignment with psychological theory lends indirect support to it if the aim of the work is to shed light on human language processing; if the aim is otherwise, the alignment is irrelevant. In the case study of section 3.1 we shall encounter a situation where the support for a bunch of NLP models hinges on changing insights into the limits of human rationality.
3. Parsimony. An idea that comes up frequently when scientific theories are compared is parsimony: Other things being equal, a simpler model is a better model. Parsimony is closely related to Occam’s Razor and to the idea that models should be as elegant as possible (see, e.g., Greene [2000] for a defense).
To complicate matters, parsimony can concern different aspects of a model; accordingly, it has been motivated in different ways. (See Fitzpatrick [2022] for a survey.) Accounts that focus on the logical strength of the postulates used by the model, for example, have been defended on the grounds that because models that are more parsimonious in this sense are logically weaker than less parsimonious ones, they have a greater probability of being true: For if M1⊧M2 (e.g., because M1 adds a postulate to M2), then P(M1) ≤ P(M2). While this is a sensible idea, NLP models tend not to be formulated as conjunctions of postulates. Rule-based models of syntax are a case in point, with some syntacticians signing up to a version of parsimony that is based, essentially, on measuring the size of a grammar (Akmajian and Heny 1975; Brody 1995).4
Parsimony of neural models could focus on various aspects of the model, including its overall architecture, number of hidden layers, number of parameters set and learned, the number of training epochs, and the amount of training data. Implicitly, the idea of parsimony is acknowledged in neural practices such as induction of “causal” models (Geiger et al. 2021), knowledge distillation (Sanh et al. 2019), and pruning (Tessier 2021), where the idea is to get rid of parameters or layers that do not add to a neural model’s performance. Invoking parsimony as a factor in choosing between models is less common.5 Discussions in other sciences suggest that doing so can be risky, particularly when a complex model has better performance than its simpler competitors. Hossenfelder, for instance, has argued that parsimony has played too much of a role in discussions of string theory (Hossenfelder 2018)—for, why should nature care about simplicity? Nonetheless, the idea that parsimony enhances the value of a model is widely accepted in physics too, because unless we insist on some form of parsimony, a model could be called highly explanatory even if it was nothing more than a huge collection of isolated facts; this would be counter-intuitive because a model is of little scientific interest unless it allows us to compress data in some way or other.
A Bayesian perspective can help to elucidate how these dimensions relate to each other. Let D be the data on which a model M is tested, and X “everything else we know” (a common Bayesian construct). As scientists—and as engineers too, as I will argue in Section 4—we want to know the plausibility of a model M (in theory: every possible model M) in light of both X and D; that is, we want to know P(M|D, X). How do we get there? Starting with the two requirements inherent in IST, the performance of a probabilistic model M may be based on an assessment of P(D|M,X), the probability the data D would have if M and X were true. As we have seen, performance may also involve direct support (i.e., replications and generalizations), which is essentially Bayesian update, where n subsequent datasets D1,.., Dn are brought to bear on M, yielding P(M|D1,.., Dn, X). Indirect support for M is P(M|X), the plausibility of M in light of everything we know (before considering D1, .., Dn). Now Bayes’ Theorem tells us that P(M|D, X) = P(D|M, X) * P(M|X)/P(D|X). Since P(D|X) is the same for every model, it follows that P(M|D, X) (i.e., the value we’re interested in) depends on performance (i.e., P(D|M,X)) and indirect support (i.e., P(M|X)) alone. It is P(M|X) that is often overlooked in NLP. Parsimony, finally, can be seen as the probability P(M) of a model before any data or other information about the world are considered. It is a component of P(M|X) but it can be examined separately. Models can be compared in terms of Solomonoff’s Prior (see Solomonoff 1964; Hutter, Legg, and Vitanyi 2007; and Li and Vitanyi 2008, Chapters 4 and 5), for example; the idea is that, once all models are encoded in the same way, the parsimony of a model is a function of the length of its encoding.6 In other words, our three dimensions can be seen as a closely knit family with roots in Bayesian as well as classical philosophy of science.
3 Case Study: Two Types of Referring Expressions Generation
To illustrate both the usefulness and the pitfalls of assessing NLP models in terms of our three dimensions, I examine two types of referring expression generation (REG). I choose REG because referring is an essential part of human communication that has been studied from many different angles, using very different types of models. The performance of REG models has been tested extensively, and the outcomes of these tests will inform our discussion of the explanatory value of these models. We discuss the two types of REG one by one, then we reflect on some lessons learned (Section 3.3).
3.1 Generating Referring Expressions in a Visual Domain
REG has been studied intensively in NLG (Dale 1989; Dale and Reiter 1995; Krahmer and van Deemter 2012; Yu et al. 2016; Luo and Shakhnarovich 2017) and elsewhere (van Deemter 2016). A dominant research question in this area is, given a visual scene composed of objects, and without any linguistic context, what properties do human speakers use when they refer to one of the objects in the scene; for example, when they call an object “the ball that is red”, they express the properties ball and red. Here we concentrate on models that emerged from controlled experiments involving artificial scenes whose objects have well-understood properties (shapes, colors, sizes, etc.) that can be manipulated precisely by the experimenter and presented to participants on a computer screen. Such experiments trade away some of the complexity of real-world scenes to allow a maximum of experimental control.
We compare two models. One is an application (which I will call RSA-REG) of Frank and Goodman’s highly general Rational Speech Act (RSA) model (Frank and Goodman 2016, 2012).7 RSA is a formalization of the Gricean idea that communication is always optimally rational; RSA-REG interprets this as meaning that a speaker model should emphasize efficiency: The probability that a property is chosen for inclusion in a referring expression is proportional to its discriminatory power (i.e., to the proportion of scene objects to which the property does not apply). As a result, RSA-REG tends to favor referring expressions that are efficient, and hence “rational”.
The other model grew out of a research tradition associated with the notion of Bounded Rationality, which is skeptical about the idea that speakers routinely compute discriminatory power for all the properties they consider for inclusion in their referring expressions. A well-known version of this experimentally well-supported idea (see, e.g., Belke and Meyer 2002) is the Incremental Algorithm of Dale and Reiter (1995), which assumes that some properties are intrinsically more “preferred” than others, and hence used more liberally. New findings (e.g., Koolen et al. 2011; Van Gompel et al. 2019) led to a new model in this tradition, called PRobabilistic Over-specification (PRO). PRO combines elements of the Incremental Algorithm with discriminatory power and a separate mechanism for over-specification. We compare PRO with RSA-REG.
Comparison 1: Performance.Van Gompel et al. (2019) reported an experiment in which the PRO model outperformed the other algorithms in terms of the human-likeness of their output. As for direct support, algorithms in the bounded rationality tradition have often been tested (Gatt and Belz 2010), but direct support for RSA-REG does not yet reach the level of the other models.
Comparison 2: Indirect Support. At first sight, there is much indirect support for RSA, given the intuitive plausibility of describing behavior as rational. On the other hand, behavioral economists have shown that human decision makers are affected by time and memory limitations that necessitate shortcuts (Elster 1983; Simon 1991; Gigerenzer and Selten 2002; Gershman, Horvitz, and Tenenbaum 2015) and many other deviations from rationality (Kahneman and Tversky 2013). REG evaluation experiments are broadly in line with these ideas (van Deemter 2016; Van Gompel et al. 2019); consequently, it may be argued that PRO matches theoretical results better than RSA-REG, and hence has better indirect support. This debate, however, is still ongoing.
Comparison 3: Parsimony. RSA-REG can be summarized in just two simple equations; by contrast, PRO’s pseudo-code needs about a page. Since the two models are otherwise similar, it seems fair to say that RSA-REG is more parsimonious than PRO.
3.2 Generating Referring Expressions in Context
REG-in-Context is another well-studied area of NLG. It focuses on co-reference in discourse. It often starts from texts in which all referring expressions (REs) have been blanked out. The task is to predict, for each of these blanks, what RE should fill it. Other than the identity of the referent, the main information for the model to consider is the sentences around the RE, because this guides the choice between pronouns, proper names, and descriptions. The entities mentioned in the text play a role similar to the objects displayed on a computer screen in the previous section.
A long tradition of linguistic research has led to theories such as accessibility theory (Ariel 1990), the givenness hierarchy (Gundel, Hedberg, and Zacharski 1993), and Centering Theory (Brennan 1995). These theories emphasize the effect of the recency of the antecedent (e.g., in terms of the number of intervening words), its animacy (animate/non-animate), and the syntactic structure of the sentences (e.g., does the RE occur in the same syntactic position as the antecedent?) Computational accounts can be classified in terms of whether they use (1) handwritten rules, (2) hand-coded features and Machine Learning, or (3) an End2End neural architecture.
A wide range of models were recently compared in terms of their performance on this task (Same, Chen, and Van Deemter 2022). Models included (1) two rule-based ones, RREG-S (small) and RREG-L (large); (2) two models based on traditional Machine Learning (ML), called ML-S (small) and ML-L (large); and (3) three neural models, including two from Cunha et al. (2020) and one from Cao and Cheung (2019).
Comparison 1: Performance. Performance figures on these models have been reported, when feature-based models were tested in the GREC evaluation campaign (Belz et al. 2009), on a corpus of Wikipedia texts; neural models on Ferreira et al.’s (2018) version of the WebNLG corpus. As for direct support, Same, Chen, and Van Deemter (2022) tested each of these models on WSJ, the Wall Street Journal portion of the OntoNotes corpus (Gardent et al. 2017), arguing that the texts in WebNLG were too short to tell us much about referring expressions in context. In this generalization experiment, ML-L outperformed all other models, thereby diminishing the credentials of neural models of REG-in-Context while boosting those of feature-based models.
Comparison 2: Indirect Support. Indirect support varied widely across models, with larger models receiving the most support from the linguistics literature. RREG-L, for instance, rests on notions such as local focus (Brennan 1995) and syntactic parallelism (Henschel, Cheng, and Poesio 2000); ML-L makes use of the grammatical role of the RE.
Comparison 3: Parsimony. We have seen that parsimony can focus on different aspects of a model. The present set of models are alike in most respects but, as observed in Same, Chen, and Van Deemter (2022), the two rule-based models only have the current and previous sentence available to them as input; the two ML-based models look at the current and all previous sentences; the three neural models have the entire text available to them. The two “large” models, RREG-L and ML-L, contain more features than their smaller counterpart and are consequently less parsimonious.
3.3 Lessons from This Case Study
Our case study shows that the dimensions of model quality proposed in Section 2 can help one think clearly about NLP models, and to compare them fairly with each other. We saw how indirect support may or may not go hand in hand with superior performance. And although our first case study suggested a trade-off between parsimony and performance, in which an improvement in the latter was “bought” by sacrificing the former, the second study shows this is not always the case. It seems to me that attention to these dimensions is particularly beneficial in a case like that of indirect support, in Section 3.1, where it was debatable which model is most in line with current theories of rationality; such debates are important to have, and a focus on our three dimensions would stimulate such debates. Some challenges have come to the fore as well:
Performance. When judging direct support for a model, younger models (such as RSA-REG in Section 3.1) tend to be harder to judge, because they have been subjected to less scrutiny than older ones. Furthermore, when a model is subjected to new tests, for instance during generalization, it may be modified in the process. For example, although the PRO model (Section 3.1) incorporates many aspects of its predecessors, it is a new model nonetheless. So ultimately, perhaps the research community’s focus of assessment should not be an individual model but the wider research programme of which it is a part, an idea that goes back to Lakatos and Musgrave (1970).
Indirect support for neural models can be debatable. These models are difficult to link with theoretical insights (Kambhampati 2021), at least unless they are combined with probing; (for probing in REG, see Chen, Same, and van Deemter 2021). On the other hand, neural models may be more cognitively plausible for being inspired by our knowledge of the brain; rather than either blithely rejecting or accepting this argument, this is an “indirect support” consideration whose validity deserves to be investigated rigorously, which is not often done yet (though see Ritter et al. 2017 and Momennejad 2022).
Parsimony. Comparisons of parsimony across different types of models can be problematic. For example, whereas traditional models tend to address one NLP task, neural “foundation” models such as BERT are adaptable to a wide variety of tasks, which makes a direct comparison with “single-task” models arguably unfair.
4 Conclusion
There is more to a model than performance; interpretability, novelty, and applicability, for example, are widely recognized as important considerations. I have argued that another set of dimensions is likewise important; they emerge naturally from Hempel’s theory of scientific explanation, and they are aligned with Bayesian thinking about assessing our beliefs and theories. I have argued that these ideas do not necessarily correspond to explanation in the psychological sense (e.g., as in Lombrozo 2006).
I have focused on models built for enhancing our grasp of language. It may be thought that my arguments are irrelevant to applied NLP models where, allegedly, performance is everything. However, the science and engineering aspects of NLP are thoroughly intertwined (as Ekbia [2008] argued about Artificial Intelligence in general), and some models that started out to solve a practical task were later studied as putative theories of human behavior (as I argued about REG models in van Deemter 2016).
To see how intertwined the two types of NLP are, let’s return to an example from Section 2. In order to couple a given collection of outdoor photos with useful text captions, a company decides to construct a model of human-produced image captions. Performance on the company’s photo collection is clearly paramount. Certain kinds of parsimony may be relevant (e.g., where they impact the time required for model training); replication of evaluation experiments will reduce the probability of error; but otherwise, direct support is irrelevant, and indirect support, too.
But if a company spends precious resources to construct a model, it may hope that the model keeps performing even as new photos are added to, or removed from, the collection. If the company is farsighted, it may therefore design its model not solely for the original dataset but, ideally, for all possible collections of the same kind (e.g., comprising both indoor and outdoor photos). Instead of focusing on one dataset, the company will thus target a far wider class of phenomena. If this happens, its work will start to resemble that of a scientist, in which case the argumentation that we put forward in favor of three dimensions of model quality applies to it.
Researchers in NLP and other areas of Artificial Intelligence should learn to care about the dimensions discussed, without trying to reduce them to a monolith. We should learn to say, and elaborate on, things like, “Model A has similar performance to B. Being newer than B, A has lower levels of direct support. However, A is more parsimonious and appears to have better indirect support than B.” How the different dimensions should be weighed depends on the context in which the model is assessed. But when faced with a model, readers should at least be cognizant of its qualities in terms of the dimensions discussed, because this will help them decide how to use it, how much trust to have in it, both now and in future, and how to investigate the model further.
Different policy mechanisms may help our community to achieve this, similar to (or even as a component of) the model cards that are starting to be used to enhance the documentation of NLP models in terms of their intended use, evaluation details, and so on (Mitchell et al. 2019). Funding agencies should ask proposers to comment on the direct support, parsimony, and indirect support (alias theory support) of the models they propose to develop; this would be analogous to asking proposers to discuss the economic impact of their plans, as is often done. Analogous to the currently non-committal limitations paragraphs solicited by some conferences and journals such as Computational Linguistics should encourage authors to discuss the above quality dimensions for any models proposed in their articles (see the Appendix); which of these dimensions are most relevant in a given case will be for readers to decide.
5 Appendix: Reporting on the Explanatory Value of a Model
When discussing a model in light of the dimensions of explanatory value discussed in this article, the following guidelines may be helpful. The three dimensions are preceded by some preliminary information about the type and scope of the model, and the aims behind it. Where appropriate, the model should be compared with other models in the public domain.
Preliminary informationabout the model. Please specify what task(s) the model performs. Specify thetypeof model (neural, rule based, classic machine learning, etc.); if the model is neural, please specify the type of model, making it clear whether the model is pre-trained, finetuned, or otherwise. Please specify thescopeof the model in terms of its intended domain(s) and text genre(s). Then specify theaimsof the model, for example, whether it aims to make predictions, to elucidate the process of human language processing, or otherwise.
Performance.Please start by summarizing the most relevant evaluation results:
Please say briefly in what way(s) the model was evaluated, and summarize the results. If anyreplicationswere done, then discuss these as well, indicating the type of replication (see e.g., Belz 2022); please highlight any differences between the different evaluation experiments in terms of their experimental setup, and in terms of the results obtained.
If the model is ageneralization, please specify what type of generalization (e.g., cross-domain, cross-lingual, cross-task), and what was the relationship between the original model and your generalization. If the model was modified, how was this modification performed?
Indirect Support.Should the model be regarded as inherently plausible (or inherently implausible) in light of common sense, theoretical insights, or a body of previously reported results? If so, please explain, and make sure to add references to the literature where these will help readers to assess your claims.
Parsimony.Please comment on how you would rate the parsimony of the model, addressing the simplicity or elegance of the model (or the lack thereof). For instance, for a rule-based model you could consider the complexity of the rule set; for classic machine learning, you could consider the amount of training data and the number of features; for a neural model, you could consider the amount of training data, the number of hidden layers, the number of parameters, the number of training epochs, and so on.
Acknowledgments
I thank the reviewers and editor of Computational Linguistics, Guanyi Chen, Robert Dale, Mehdi Dastani, Yupei Du, Rembert Duine, Juup Essers, Albert Gatt, Roger van Gompel, Emiel Krahmer, Chenghua Lin, Pablo Mosteiro, Dong Nguyen, Paul Piwek, Ehud Reiter, Susanne Riehemann, Fahime Same, Arno Siebes, and Anna Wegmann for discussions. I thank the audiences of the International Workshop on AI and Cognition (Örebro, Sweden 2020) and of the First Conference on Natural Language Generation and Intelligent Writing (Suzhou, China 2021).
Notes
As is common in the Bayesian literature, I will use “probability” and “plausibility” interchangeably.
To illustrate this point, Clayton discusses cases in which unthinking researchers reported confirmation of some highly implausible hypotheses regarding paranormal behavior (Clayton 2022, chapter 6).
For instance, action-at-a-distance (where objects influence each other without “touching” each other) was viewed with suspicion for some time, but has become an accepted part of physics (see, e.g., French 2005).
Akmajian and Heny (1975) use this example: X liked you is assigned the underlying form X did like you, because this allows one to generate tag questions (X liked you, didn’t he?), negated sentences (X did not like you), and emphatic sentences (X did like you) using one and the same mechanism, thereby minimizing the complexity of the grammar, as measured by the number of rules in it.
But see Bender et al. (2021), who criticize large models for requiring considerable power consumption.
The mechanisms of Degen et al. (2020) could lend RSA-REG better performance, but until a performance assessment of the resulting model is available, RSA-REG will serve our illustrative purposes.