Neural Network Acceptability Judgments

In this work, we explore the ability of artificial neural networks to judge the grammatical acceptability of a sentence. Machine learning research of this kind is well placed to answer important open questions about the role of prior linguistic bias in language acquisition by providing a test for the Poverty of the Stimulus Argument. In service of this goal, we introduce the Corpus of Linguistic Acceptability (CoLA), a set of 10,657 English sentences labeled as grammatical or ungrammatical by expert linguists. We train several recurrent neural networks to do binary acceptability classification. These models set a baseline for the task. Error-analysis testing the models on specific grammatical phenomena reveals that they learn some systematic grammatical generalizations like subject-verb-object word order without any grammatical supervision. We find that neural sequence models show promise on the acceptability classification task. However, human-like performance across a wide range of grammatical constructions remains far off.


Introduction
Native English speakers consistently report a sharp contrast in acceptability 1 between pairs of sentences like (1), irrespective of their grammatical training.
(1) a. What did Betsy paint a picture of? b. *What was a picture of painted by Betsy?
Such acceptability judgments are the primary source of empirical data in much of theoretical linguistics, and the goal of a generative grammar is to generate all and only those sentences which native speakers find acceptable (Chomsky, 1957;Schütze, 1996). Despite this centrality, there has been relatively little work on acceptability classification in computational linguistics. The recent explosion of progress in deep learning inspires us to revisit this task. The task has important implications for theoretical linguistics as a test of the Poverty of the Stimulus Argument, as well as in natural language processing as a way to probe the grammatical knowledge of neural sequence models.
The primary contribution of this paper is to introduce a new dataset and several novel neural network baselines with the aim of facilitating machine learning research on acceptability. To begin we define and motivate a version of the acceptability classification task that is suitable for sentence-level machine learning experiments (Section 2). We address the lack of readily available acceptability judgment data by introducing the Corpus of Linguistic Acceptability (CoLA), a collection of sentences labeled for acceptability from the linguistics literature, which at 10,657 examples is by far the largest of its kind.
We train several semi-supervised neural sequence models to do acceptability classification on CoLA and compare their performance with unsupervised models from Lau et al. (2016). Our best model outperforms unsupervised baselines, but falls short of human performance on CoLA by a wide margin. We conduct an error analysis to test our models' performance on specific linguistic phenomena, Included Morphological Violation (a) *Maryann should leaving. Syntactic Violation (b) *What did Bill buy potatoes and ? Semantic Violation (c) *Kim persuaded it to rain.

Excluded
Pragmatical Anomalies (d) *Bill fell off the ladder in an hour. Unavailable Meanings (e) *He i loves John i . (intended: John loves himself.) Prescriptive Rules (f) Prepositions are good to end sentences with. Nonce Words (g) *This train is arrivable. and find that some models systematically distinguish certain kinds of minimal pairs of sentences differing in gross word order and argument structure. Our experiments show that recurrent neural networks can beat strong baselines on the acceptability classification task, but there remains considerable room for improvement.

Resources
CoLA can be downloaded from the CoLA website. 2 The site also hosts a demo of our best model. Our code is available as well. 3 There are also two competition sites for evaluating acceptability classifiers on CoLA's in-domain 4 and out-of-domain 5 test sets. This has been the predominant methodology for research in generative linguistics over the last sixty years (Chomsky, 1957;Schütze, 1996). Linguists generally provide example sentences annotated with binary acceptability judgments from themselves or several native speakers.

The Acceptability Classification Task
Following common practice in linguistics, we define acceptability classification as a binary classification task. An acceptability classifier, then, is a function that maps strings into the set {0, 1}, where '0' is interpreted as unacceptable and '1' as acceptable. This definition also includes generative grammars of the type described by Chomsky (1957) above. CoLA consists entirely of examples from the linguistics literature. Linguists generally present examples to motivate specific arguments, and these sentences are generally chosen to each isolate a particular grammatical construction while minimizing potential distractions. In other words, ungrammatical examples in linguistics publications, like those in CoLA, tend to be unacceptable for a single identifiable reason.

Defining (Un)acceptability
Not all linguistics examples are suitable for acceptability classification. While all acceptable sentences can be included, we exclude four types of unacceptable sentences from the task (examples in Table 1): Pragmatic Anomalies Examples like (d) can be made interpretable, but only in fanciful scenarios, the construction of which requires real-world knowledge unrelated to grammar.
Unavailable Meanings Examples like (e) are often used to illustrate that a sentence cannot express a particular meaning. This example can only express that someone other than John loves John. We exclude these examples because there is no simple way to force an acceptability classifier to consider only the interpretation in question.

Prescriptive Rules
Examples like (f) violate rules which are generally explicitly taught rather than being learned naturally, and are therefore not considered a part of native speaker grammatical knowledge in linguistic theory.
Nonce Words Examples like (g) illustrate impossible affixation or lexical gaps. Since these words will not appear in the vocabularies of typical wordlevel NLP models, they will be impossible for these models to judge.
The acceptability judgment task as we define it still requires identifying challenging grammatical contrasts. A successful model needs to recognize (a) morphological anomalies such as mismatches in verbal inflection, (b) syntactic anomalies such as whmovement out of extraction islands, and (c) semantic anomalies such as violations of animacy requirements of verbal arguments.

Concerns about Acceptability Judgments
Binary vs. Gradient Judgments Though discrete binary acceptability judgments are standard in generative linguistics (Schütze, 1996), Lau et al. (2016 find that when speakers are presented with the option to use a gradient scale to report sentence acceptability, they predictably and systematically use the full scale, rather than clustering their judgments near the extremes as would be expected for a fundamentally binary phenomenon. This is evidence, they argue, that acceptability judgments are gradient in nature. Nevertheless, we consider binary judgments in published examples sufficient for our purposes. These examples are generally chosen to be unambiguously acceptable or unacceptable, and provide the evidence that relevant experts consider maximally germane the questions at hand. Gibson and Fedorenko (2010) express concern about standard practices around acceptability judgments and call for theoreti-cal linguists to quantitatively measure the reliability of the judgments they report, sparking an ongoing dialog about the validity and reproducibility of these judgments (Sprouse and Almeida, 2012;Sprouse et al., 2013). We take no position on this general question, but perform a small human evaluation to gauge the reproducibility of the judgments in CoLA (Section 5).

Motivation
The acceptability judgment task offers a direct way to test the Poverty of the Stimulus Argument, a key argument in the theory of a strong Universal Grammar (Chomsky, 1965). In addition, acceptability classifiers can be used to probe the grammatical knowledge of neural network models by enabling researchers to test their models for knowledge of specific grammatical constructions.

The Poverty of the Stimulus
The Poverty of the Stimulus Argument holds that purely data-driven learning is not powerful enough to explain the richness and uniformity of human grammars, particularly with data of such low quality as children are exposed to (Clark and Lappin, 2011). This argument is generally wielded in support of the theory of a strong Universal Grammar, which claims that all humans share an innately-given set of language universals, and that domain-general learning procedures are not sufficient to acquire language (Chomsky, 1965).
There is a need to test whether data-driven learners are indeed able to do acceptability classification within constraints similar to those of human learners. For these experiments to have any bearing on this argument, the artificial learner must not be exposed to any knowledge of language that could not plausibly be part of the input to a human learner. We call such knowledge grammatical bias. For example, training a learner with a part-of-speech tagging objective or with extremely large datasets would expose the model to rich linguistic knowledge far beyond what human learners see. Our experiments as described in Section 6 are designed to meet these standards (see Section 6.2 for discussion).

Investigating the Black Box
Recurrent neural network models like the Long Short-Term Memory (LSTM) networks we use (Hochreiter and Schmidhuber, 1997) can discover some forms of structure in raw language data (Le-Cun et al., 2015). These models are widely used to encode features of sentences in fixed-length sentence embeddings (Cho et al., 2014;Sutskever et al., 2014;Kiros et al., 2015). Evaluating generalpurpose sentence embeddings is an active research area. Some approaches probe the contents of sentence embeddings using common natural language processing tasks (Conneau et al., 2017;Wang et al., 2018). Others test whether embeddings encode top level features like sentence length, parse-tree depth, and tense (Adi et al., 2016;Shi et al., 2016;Conneau et al., 2018).
Acceptability classification can be used to probe sentence representations for linguistic features at a finer level of granularity. After training an acceptability classifier with the sentence representation as its input, it is simple to ask how well that representation encodes linguistic phenomena like thematic role, animacy, and anaphoric dependency simply by testing the classifier on appropriate examples. Section 8 discusses several such case studies. Linzen et al. (2016) present a special case of this approach. They train LSTM models to identify violations in a specific grammatical principle: subjectverb agreement. If the subject is complex, as in the keys to the cabinet are/*is here, this task requires implicitly learning some dependency structure, which some models are able to do. Evaluating a sentence representation system through acceptability classification makes it possible to expand the scope of experiments like these.

Related Work
There have been several prior attempts to use machine learning to learn a function that maps a sentence to a scalar acceptability score. These works vary considerably in the kinds of data used, the sources and natures of the acceptability judgments, and the role of labeled data.

Sources of Sentences
It is possible to construct an acceptability corpus automatically if a method is found that can produce (roughly) unacceptable sentences. One approach is to programmatically generate fake sentences that are unlikely to be acceptable. Wagner et al. (2009) distort real sentences by, for example, deleting words, inserting words, or altering verbal inflection. Lau et al. (2016) use round-trip machine-translation from English into various languages and back. A second family of approaches take sentences from essays written by non-native speakers (Heilman et al., 2014).  2009) label a sentence unacceptable if it has gone through one of their automatic distortion procedures. We take a similar approach in our auxiliary (real/fake) task (section 6). Heilman et al. (2014) and Lau et al. (2016) represent acceptability judgments on a continuous scale from 1 to 4, and average judgments across multiple speakers. Our labeling approach for CoLA follows that of Lawrence et al. (2000) in using already-annotated examples from the linguistics literature.
Grammatical Bias Prior work is inconsistent in the degree to which it gives models access to explicit information about English grammar. At one extreme, Lawrence et al. (2000) convert all their data to part-of-speech tags by hand, giving their model explicit access to these categories, which are not available to human learners. At the other extreme, Lau et al. (2016) use fully unsupervised methods, predicting acceptability as a function of probabilities assigned by unsupervised language models (LMs). We take care not to introduce linguistic bias in the form of grammatical annotations, though our models are partially supervised by example judgments.

CoLA
This paper introduces the Corpus of Linguistic Acceptability (CoLA), 6 a set of example sentences from the linguistics literature labeled for acceptability. Upon publication, CoLA will be made available online, alongside source code for our baseline models, an interactive demo showing judgments by those models, and a leaderboard showing model performance on the test sets (using privately-held labels).
Sources We compile CoLA with the aim of representing a wide variety of phenomena of interest in theoretical linguistics. We draw examples from linguistics publications spanning a wide time period, a broad set of topics, and a range of target audiences. Table 2 enumerates our sources. By way of illustration, consider the three largest sources in the corpus: Kim & Sells (2008) is a recent undergraduate syntax textbook, Levin (1993) is a comprehensive reference detailing the lexical properties of thousands of verbs, and Ross (1967) is an influential dissertation on extraction and movement in English syntax.
Preparing the Data The corpus includes all usable examples from each source. We manually remove unacceptable examples falling into any of the excluded categories described in Section 2.3. The labels in the corpus are the original authors' acceptability judgments whenever possible. When examples appear with non-binary judgments (less than 3%), we either exclude them (for labels '?' or '#'), or label them unacceptable ('??' and '*?'). We also expand examples given with optional or alternate phrases into multiple data points. For example Betsy buttered (*at) the toast becomes Betsy buttered the toast and *Betsy buttered at the toast.
In some cases, we change the content of examples slightly. To avoid irrelevant complications from outof-vocabulary words, we restrict CoLA to the 100k most frequent words in the British National Corpus, and edit sentences as needed to remove words outside that set. For example, That new handle unscrews easily is replaced with That new handle detaches easily to avoid the out-of-vocabulary word unscrews. We make these alterations manually to preserve the author's stated intent, in this case selecting another verb that undergoes the middle voice alternation.
Finally we add content to examples that are not complete sentences, replacing, for example *The Bill's book with *The Bill's book has a red cover.

N % Description
Adger (2003)   The more books I ask to whom he will give, the more he reads. Culicover and Jackendoff (1999) 1 I said that my father, he was tight as a hoot-owl. Ross (1967) 1 The jeweller inscribed the ring with the name. Levin (1993) 0 many evidence was provided. Kim and Sells (2008)  1 They can sing. Kim and Sells (2008)  1 The men would have been all working.
Baltin (1982)  0 Who do you think that will question Seamus first? Carnie (2013)  0 Usually, any lion is majestic. Dayal (1998) 1 The gardener planted roses in the garden.
Miller (2002)  1 I wrote Blair a letter, but I tore it up before I sent it. Rappaport Hovav and Levin (2008)  Human Performance We measure human performance on a subset of CoLA to set a reasonable upper bound for machine performance on acceptability classification, and to estimate the reproducibility of the judgments in CoLA. We obtained acceptability judgments from five linguistics PhD students on 200 sentences from CoLA, divided evenly between the in-domain and out-of-domain development sets. Results are shown in Table 4. Average accuracy across annotators is 86.1%, and average Matthews correlation coefficient (MCC) 7 is 0.697.
Selecting the majority decision from our annotators gives us a rough upper bound on human performance. These judgments agreed with CoLA's ratings on 87% of sentences with a MCC of 0.713. In other words, 13% of the labels in CoLA contradict the observed majority judgment. We identify several reasons for disagreements between our annotators and CoLA. Some sentences show copying errors which change the acceptability of the sentence or omit the original judgment. Other disagreements can be ascribed to unreliable judgments on the part of authors or a lack of context. We also measured our individual annotators' agreement with the aggregate rating, yielding an average agreement of 93%, and an average MCC of 0.852. 7 Matthews correlation coefficient (Matthews, 1975) is our primary classification performance metric. It measures correlation on unbalanced binary classification tasks in range from -1 to 1, with any uninformed random guessing achieving an expected score of 0.

Models
We train several neural network models to do acceptability classification on CoLA. At 10k sentences, CoLA is likely too small to train a low-bias learner like a recurrent neural network without additional prior knowledge. In similar low-resource settings, transfer learning with sentence embeddings has proven to be effective (Kiros et al., 2015;Conneau et al., 2017). We use transfer learning in all our models and train large sequence models on auxiliary tasks. In most experiments a large sentence encoder is trained on a real/fake discrimination task, and a lightweight multilayer perceptron classifier is trained on top to do acceptability classification over CoLA. Inspired by ELMo (Peters et al., 2018), we also experiment with using hidden states from an LSTM language model (LM) as word embeddings.
Real/Fake Pretraining Task We train sentence encoders to distinguish real and fake English sentences. The real data is drawn from the 100 milliontoken British National Corpus (BNC), and the fake data is a similar quantity automatically generated by two different strategies: We generate strings, e.g.
(2-a), using an LSTM LM 8 trained on the BNC, and we manipulate sentences of the BNC, e.g. (2-b), by randomly permuting a subset of the words, keeping the other words in situ.
(2) a. either excessive tenure does not threaten a value to death. b. what happened in to the empire early the tra- This task is suitable because arbitrary numbers of labeled fake sentences can be generated without using any explicit knowledge of grammar in the process, and we expect that many of the same features are likely relevant to the real/fake task and the downstream acceptability task.
Real/Fake Encoder The real/fake model architecture is shown in Figure 1. A deep bidirectional LSTM reads a sequence of word embeddings. Then, following Conneau et al. (2017), the forward and backward hidden states for each time step are concatenated, and max-pooling over the sequence gives a sentence embedding. This is passed through a sigmoid output layer giving a scalar representing the probability that the sentence is real.
Acceptability Classifier Our acceptability classifier is a small two-layer network. The input is the fixed-length sentence embedding which is transferred from the real/fake encoder. This is passed through a tanh hidden layer followed by a sigmoid output layer. The encoder's weights are frozen during training on the acceptability task due the relatively small size of CoLA.

LM Encoder
Further experiments use the LSTM LM described above as an encoder. The hidden states are transferred and an additional LSTM layer is trained on CoLA. As in Figure 1, the max pooling of the hidden states gives a sentence embedding which is passed to a sigmoid output layer.

Word Representations
We experiment with several kinds of word representations: (i) We train word embeddings from scratch along with LSTMs on the language modeling or real/fake objectives. (ii) We use ELMo-style contextualized word embeddings from our trained LM. As in, ELMo (Peters et al., 2018), the representation for w i here is a linear combination of the hidden states h j i for each layer j in an LSTM LM, though we depart from the original paper by using only a forward LM. (iii) In a more exploratory experiment, we also use pretrained 300dimensional (6B) GloVe embeddings (Pennington et al., 2014). Note that these embeddings are trained on orders of magnitude more words than human learners ever see, limiting the possible interpretations of a positive result in this setting.
CBOW Baseline For a simple baseline, we train a continuous bag-of-words (CBOW) model directly on CoLA, using the sums of word embeddings from our best LSTM LM.

The Poverty of the Stimulus Revisited
Our experiments meet the design objectives laid out in Section 3.1 for testing the Poverty of the Stimulus Argument. The real/fake pretraining introduces no explicit grammatical bias into the sentence encoders; hence any linguistic features these models learn are acquired without instruction in what kinds of categories or structures are relevant to language understanding. The encoders are trained on 100-200 million tokens, which is within a factor of ten of the number of tokens human learners are exposed to during language acquisition (Hart and Risley, 1992). 9 Unbiased sentence embeddings are the sole input to our acceptability classifiers. While the classifiers themselves are potentially biased from the roughly 9,000 expert-annotated sentences in the CoLA training set, practically all the linguistic knowledge in our models comes from the sequence model, which has several orders of magnitude more parameters and more training data. The acceptability classifier's role is primarily to extract from the sentence embedding the linguistic knowledge needed for the acceptability judgment. We control for any advantage CoLA training gives our models over naïve speakers by comparing their performance to that of trained linguists experienced with acceptability judgments. We also mitigate the impact of this training data by evaluating the model on the out-of-domain test set, in which it must reproduce judgments unrelated to those available to the model at training time.

Lau et al. Baselines
We compare our models with those of Lau et al. (2016). Their models obtain an acceptability prediction from unsupervised LMs by normalizing the LM output using one of several metrics. Following their recommendation, we use both the SLOR and Word LogProb Min-1 metrics. 10 Since these metrics produce unbounded scalar scores rather than probabilities or binary judgments, we fit a threshold to the outputs in order to use these models as acceptability classifiers. This is done with 10-fold crossvalidation: we repeatedly find the optimum threshold for 90% of the model outputs and evaluate the remaining 10% with that threshold, until all the data have been evaluated. Following their methods, we train n-gram models on the BNC using their published code. 11 In place of their RNN LM, we use the same LSTM LM that we trained to generate sentences for the real/fake task.

Training details
All models are trained using PyTorch and optimized using Adam (Kingma and Ba, 2014). We train 20 LSTM LMs with from-scratch embeddings for up to 7 days or until completing 4 epochs without improving in development perplexity and select the best. Hyperparameters are chosen at random in these ranges: embedding size ∈ [200, 600], hidden size 10 Where s=sentence, pLM(x) is the probability the LM assigns to string x, pu(x) is the unigram probability of string x, and s is the length of s: Word LP Min-1 = min − log p LM (w) log p u (w) , w ∈ s and SLOR = log p LM (s)−log p u (s) s . 11 https://github.com/jhlau/acceptability_ prediction ∈ [600, 1200], number of layers ∈ [1, 4], learning rate ∈ [3 × 10 −3 , 10 −5 ], dropout rate ∈ {0.2, 0.5}. We train 20 real/fake classifiers with fromscratch embeddings, 20 with GloVe, and 20 with ELMo-style embeddings for up to 7 days or until completing 4 epochs without improving in development MCC. Hyperparameters are chosen at random in these ranges: embedding size ∈ [200, 600], hidden size ∈ [600, 1400], number of layers ∈ [1, 5], learning rate ∈ [3 × 10 −3 , 10 −5 ], dropout rate ∈ {0.2, 0.5}. We train 10 acceptability classifiers for each encoder until completing 20 epochs without improving in MCC on the CoLA development set. Hyperparameters are chosen at random in these ranges: hidden size ∈ [20, 1200] and learning rate ∈ [10 −2 , 10 −5 ], dropout rate ∈ {0.2, 0.5}. Table 4 shows our results. The best model is the real/fake model with ELMo-style embeddings. It achieves the highest MCC and accuracy both indomain and out-of-domain by a large margin, outperforming even the models with access to GloVe.

Results and Discussion
All our models perform better than the unsupervised models of Lau et al. (2016) on both evaluation metrics on the in-domain test set. Out of domain, Lau et al.'s baselines offer the second-best results. Our models consistently perform worse outof-domain than in-domain, with MCC dropping by as much as 50% in one case. Since Lau et al.'s baselines don't use the training set, they perform similarly in-domain and out-of-domain.
The sequence models consistently outperform the word order-independent CBOW baseline, indicating that the LSTM models are using word order for acceptability classification in a non-trivial way. In line with Lau et al.'s findings, the n-gram LM baselines are worse than the RNN LM. These results suggest that, unsurprisingly, LSTMs are better at capturing long-distance dependencies than n-gram models with a limited feature window.
Discussion Our LSTM models appear to be the best currently available low-bias learners for acceptability classification. Compared to humans, though, their absolute performance is underwhelming. We do not interpret this result as proof positive for the Poverty of the Stimulus Argument, though, as these  experiments reflect an early attempt at acceptability classification, and it is possible that more sophisticated low-bias models will decrease the performance gap substantially. The supervised models see a substantial drop in performance from the in-domain test set to the out-of-domain test sets. This suggests that they've learned an acceptability model that is somewhat specialized to the phenomena in the training set, rather than the general English model one would expect. Addressing this problem will likely involve new forms of regularization to mitigate this overfitting and, more importantly, better pretraining strategies that can help the model learn the fundamental ingredients of grammaticality from unlabeled data.
We also measure the models' performance on individual sources in the in-domain development set. Performance is highly variable, with MCC ranging from 0.487 for the ELMo-style real/fake model on the Ross (1967) data, to 0.023 for the real/fake model with GloVe on Adger (2003).

Fine-Grained Analysis
Here, we run additional evaluations to probe whether our models are able to successfully learn grammatical generalizations. For these tests we generate five auxiliary datasets (described below) using simple rewrite grammars which target specific grammatical contrasts. The results from these experiments are shown in Table 5.
Unlike in CoLA, none of these judgments are meant to be difficult or controversial, and we expect that most humans could reach perfect accuracy. We also take care to make the test sentences as simple as possible to reduce classification errors unrelated to the target contrast. This is accomplished by limiting noun phrases to 1 or 2 words, and by using semantically related vocabulary items within examples.
(3) John read the book. / *John the book read. / *read John the book. / *read the book John. / *the book read John.   (4). This is to test (1) whether a model has learned that a wh-word must correspond to a gap somewhere in the sentence, and (2) whether the model can identify non-local dependencies up to three words away. The data contain 10 first names as subjects and 8 sets of verbs and related objects (5). Every compatible verb-object pair appears with every subject.
(4) a. What did John fry? b. *What did John fry the potato?

Causative-Inchoative Alternation
This test set is based on a syntactic alternation conditioned by the lexical semantics of particular verbs. It contrasts verbs like popped which undergo the causativeinchoative alternation, with verbs like blew that do not. If popped is used transitively (6-a), the subject (Kelly) is an agent who causes the object (the bubble) to change states. Used intransitively (6-b), it is the subject (the bubble) that undergoes a change of state and the cause need not be specified (Levin, 1993). The test set includes 91 verb/object pairs, and each pair occurs in the two forms as in (6). 36 pairs allow the alternation, and the remaining 5 do not.

Subject-Verb Agreement
This test set is generated from 13 subjects in singular and plural form crossed with 13 verbs in singular and plural form. This gives 169 quadruples as in (7).
(7) a. My friend has/*have to go. b. My friends *has/have to go.
Reflexive-Antecedent Agreement This test set probes whether a model has learned that every reflexive pronouns must agree with an antecedent noun phrase in person, number, and gender. The dataset consists of a set of 4 verbs crossed with 6 subject pronouns and 6 reflexive pronouns, giving 144 sentences, only 1 out of 6 acceptable.

Results
The results in Table 5 show that LSTMs do make some systematic acceptability judgments as though they learn correct grammatical generalizations. Gross word order (SVO in Table 5) is especially easy for the models. The Real/Fake model with GloVe embeddings achieves near perfect correlation, suggesting that it systematically distinguishes gross word order. However, the remaining tests fall short of perfect performance. Our models consistently outperform Lau et al.'s baselines on lexical semantics (Causative), judging more accurately whether a verb can undergo the causative-inchoative alternation. This may be due in part to the fact that our models are trained on CoLA which contains examples of similar alternations from Levin (1993).
Lau et al.'s baselines outperform our models on the remaining tests. It consistently identifies the long-distance dependency between a wh-word and its gap (Wh-extraction), while our models are near chance. Our models also perform relatively poorly on judgments involving agreement (Singular/Pl, Reflexive). While the poor absolute performance of these models is likely due to their inability to access sub-word morphological information, we have no working hypothesis to explain the relative success of the Lau et al. metrics. This work offers resources and baselines for the study of semi-supervised machine learning for acceptability judgments. Most centrally, we introduce the first large-scale corpus of acceptability judgments, making it possible to train and evaluate modern neural networks on this task. In baseline experiments, we find that a network trained on our artificial real/fake task, combined with ELMo-style word representations, outperforms other available models, but remains far from human performance.
Much work remains to be done to implement the agenda described in Section 3. To provide stronger evidence on the Poverty of the Stimulus Argument, we hope for future work to test the performance of a broader range of effective candidate low-bias machine learning models, and to investigate how much can be gained by explicitly introducing specific forms of grammatical knowledge into models.