Beat the AI: Investigating Adversarial Human Annotations for Reading Comprehension

Innovations in annotation methodology have been a propellant for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation approach and apply it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop. This allows us to explore questions such as the reproducibility of the adversarial effect, transfer from data collected with varying model-in-the-loop strengths, and generalisation to data collected without a model. We find that training on adversarially collected samples leads to strong generalisation to non-adversarially collected datasets, yet with progressive deterioration as the model-in-the-loop strength increases. Furthermore we find that stronger models can still learn from datasets collected with substantially weaker models in the loop: When trained on data collected with a BiDAF model in the loop, RoBERTa achieves 36.0F1 on questions that it cannot answer when trained on SQuAD - only marginally lower than when trained on data collected using RoBERTa itself.


Introduction
Data collection is a fundamental prerequisite for Machine Learning-based approaches to Natural Language Processing (NLP). Innovations in data acquisition methodology, such as crowdsourcing, have led to major breakthroughs in scalability and preceded the "deep learning revolution", for which they can arguably be seen as co-responsible (Deng et al., 2009;Bowman et al., 2015;Rajpurkar et al., 2016). Annotation approaches include expert annotation, e.g. by relying on trained linguists (Mar-

Model-in-the-loop Strength
Using boiling water to produce mechanical motion goes back over 2000 years, but early devices were not practical. The Spanish inventor Jerónimo de Ayanz y Beaumont obtained the first patent for a steam engine in 1606. In 1698 Thomas Savery patented a steam pump that used […]. Thomas Newcomen's atmospheric engine was the first commercial true steam engine using a piston, and was used in 1712 for pumping in a mine. cus et al., 1993), crowd-sourced annotation by non-experts (Snow et al., 2008), distant supervision (Mintz et al., 2009;Joshi et al., 2017), and leveraging document structure for annotation purposes (Hermann et al., 2015). The concrete data collection paradigm chosen dictates the degree of scalability, annotation cost, precise task structure (which often arises as a compromise of the above), domain coverage, task difficulty, as well as resulting dataset biases and model blind spots (Jia and Liang, 2017;Schwartz et al., 2017;Gururangan et al., 2018). A recently emerging trend in NLP dataset assembly is the use of a model in the loop when composing the samples: a contemporary model is used either as a filter or directly during annotation, retaining only samples wrongly predicted by the model. Examples of this method are realised in Build It Break It, The Language Edition (Ettinger et al., 2017), SWAG (Zellers et al., 2018), HotpotQA (Yang et al., 2018), DROP (Dua et al., 2019), CODAH (Chen et al., 2019), Quoref (Dasigi et al., 2019) and Adversar-ialNLI (Nie et al., 2019). 1 The practice probes model robustness and ensures that the resulting datasets pose a challenge to current models, in turn driving research and modelling efforts to tackle the new problem set.
But how robust is the approach itself in the face of continuously progressing models -do such datasets quickly become outdated in their usefulness as models become stronger (Devlin et al., 2019)? Based on models trained on the widely used SQuAD dataset, and following the same basic annotation protocol, we investigate the additional annotation requirement that the annotator has to compose questions for which the model predicts the wrong answer. As a result, only samples which the model fails to predict correctly are retained in the dataset -see Fig. 1 for an example.
We apply this annotation strategy with three distinct models in the loop, resulting in datasets with 12,000 samples each. We then study the reproducibility of the adversarial effect when retraining the models with the same data, as well as the generalisation abilities of models trained on the resulting datasets to datasets composed with and without a model adversary. Models can, to a considerable degree, learn to generalise towards these challenging questions, based on training sets collected with both stronger and also weaker models in the loop. Compared to training on SQuAD, training on adversarially composed questions leads to a similar degree of generalisation to non-adversarially written questions, both for SQuAD and NaturalQuestions (Kwiatkowski et al., 2019). It furthermore leads to general improvements across the model-in-the-loop datasets we collected, as well as improvements of more than 20.0F 1 for both BERT and RoBERTa on an extractive subset of DROP (Dua et al., 2019), another adversarially composed dataset. When conducting a systematic analysis of the concrete questions different models fail to answer correctly, 1 Richardson et al. (2013) alluded to this idea in their work, but it has only recently seen wider adoption. as well as non-adversarially composed questions, we see that the nature of the resulting questions changes: questions composed with a model in the loop are overall more diverse, use more paraphrasing, multi-hop inference, background knowledge and comparisons, and are generally less easily answerable by matching an explicit statement that states the required information literally. Given our observations, we believe a model-in-the-loop approach to annotation shows promise and should be considered as an option when creating future RC datasets.
To summarise, our contributions are as follows: 1. An investigation into the model-in-the-loop approach to RC data collection based on three progressively stronger RC models.
2. An empirical performance comparison of models trained on datasets constructed with adversaries of different strength.
3. A comparative investigation into the nature of questions composed to be unsolvable by a sequence of progressively stronger RC models.

Related Work
Constructing Challenging Datasets Recent efforts in dataset construction have driven considerable progress in the RC task, yet dataset structures are diverse and annotation methodologies vary. With its large size and combination of freeform questions with answers as extracted spans, SQuAD1.1 (Rajpurkar et al., 2016) has become an established benchmark which has inspired the construction of a series of similarly structured datasets. However, mounting evidence suggests that models can achieve strong generalisation performance merely by relying on superficial cuessuch as lexical overlap, term frequencies, or entity type matching (Chen et al., 2016;Weissenborn et al., 2017;Sugawara et al., 2018). It has thus become an increasingly important consideration to construct datasets which RC models find both challenging, and for which natural language understanding is a requisite for generalisation. Attempts to achieve this non-trivial aim have typically revolved around extensions to the SQuAD dataset annotation methodology. They include unanswerable questions (Trischler et al., 2016;Rajpurkar et al., 2018;Reddy et al., 2019;Choi et al., 2018), adding the option of "Yes" or "No" answers ( We are primarily interested in the latter category, as this feedback loop creates an environment where the annotator can probe the model directly to explore its weaknesses and formulate targeted adversarial attacks. While Dua et al. (2019) and Dasigi et al. (2019) make use of adversarial annotations for RC, both annotation setups limit the reach of the model-in-the-loop: in DROP, primarily due to the imposition of specific answer types, and in Quoref by focusing on co-reference, which is already a known RC model weakness.
In contrast, we investigate a scenario where annotators interact with a performant model in its original task setting -annotators must thus explore a range of natural adversarial attacks, as opposed to merely filtering out "easy" samples during the annotation process.

Human generates question q and
selects answer ah for passage p.

2.
(p, q) sent to the model. Model predicts answer am.
3. F1 score between ah and am is calculated; if the F1 score is greater than a threshold (40%), the human loses.

4(a). Human wins.
The human-sourced adversarial example (p, q, ah) is collected. Figure 2: Overview of the annotation process to collect adversarially written questions from humans using a model in the loop.

Annotation Protocol
The protocol used for data annotation is based on SQuAD1.1, but with the additional instruction that questions should have only one possible answer in the passage -as well as a model adversary in the loop.
Formally, provided with a passage p, a human annotator generates a question q and selects a (human) answer a h by highlighting the corresponding span in the passage. The input (p, q) is then given to the model, which returns a predicted (model) answer a m . To compare the two, a word-overlap F 1 score between a h and a m is computed; a score above a threshold of 40% is considered a win for the model. 2 This process is repeated until the human "wins"; Figure 2 gives a schematic overview of the process. All successful (p, q, a h ) triples, i.e. those which the model is unable to answer correctly, are then retained for further validation.

Annotation Details
Models in the Annotation Loop We begin by training three different models, which are used as adversaries during data annotation. As a seed dataset for training the models we select the widely used SQuAD1.1 (Rajpurkar et al., 2016) dataset, a large-scale resource for which a variety of mature and well-performing models are readily available. Furthermore, unlike cloze-based datasets, SQuAD is robust to passage/questiononly adversarial attacks (Kaushik and Lipton, 2018). We will compare dataset annotation with a series of three progressively stronger models as adversary in the loop, namely BiDAF (Seo et al., 2017), BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019). Each of these will serve as a model adversary in a separate annotation experiment and result in separate datasets; we will refer to these as D BiDAF , D BERT and D RoBERTa , respectively. We rely on the AllenNLP (Gardner et al., 2017) and Transformers (Wolf et al., 2019) model implementations, and our models achieve EM/F 1 scores of 65.5%/77.5%, 82.7%/90.3% and 86.9%/93.6% for BiDAF, BERT and RoBERTa, respectively on the SQuAD1.1 validation set.
Our choice of models reflects both the transition from LSTM-based to pre-trained transformerbased models, as well as a graduation among the latter; we will investigate how this is reflected in datasets collected with each of these different models in the annotation loop. For each of the models we collect 10,000 training, 1,000 validation and 1,000 test examples. Dataset sizes are motivated by the improved data efficiency of transformer-based pretrained models (Devlin et al., 2019;Liu et al., 2019), which has improved the viability of smaller-scale data collection efforts for investigative and analysis purposes.
To ensure the experimental integrity provided by reporting all results on a held-out test set, we split the existing SQuAD1.1 validation set in half (stratified by document title) since the test set is not publicly available. We maintain passage consistency across the training, validation and test sets of all analysis datasets to enable like-forlike comparisons. Since SQuAD1.1 validation set questions commonly have multiple answers and the standard SQuAD1.1 evaluation method involves taking the maximum score over all possible answers, we enforce an additional evaluation constraint by taking the majority vote answer as ground truth for SQuAD1.1. This ensures that all our experimental resources have one valid answer per question, enabling us to fairly draw direct comparisons. For clarity, we will hereafter refer to this modified version of SQuAD1.1 as D SQuAD .
Crowdsourcing We use custom-designed Human Intelligence Tasks (HITs) served through Amazon Mechanical Turk (AMT) for all annotation efforts (see Appendix B). Workers are required to be based in Canada, the UK, or the US, have a HIT Approval Rate greater than 98%, and have previously completed at least 1,000 HITs successfully. We experiment with and without the AMT Master requirement and find no substantial difference in quality, yet a throughput reduction of nearly 90%. We pay $2 for every question generation HIT, during which workers are required to compose up to five questions which "beat" the model in the loop. The mean HIT completion times for BiDAF, BERT and RoBERTa are 551.8s, 722.4s and 686.4s respectively. Furthermore we find that human workers are able to generate questions which successfully "beat" the model in the loop 59.4% of the time for BiDAF, 47.1% for BERT and 44.0% for RoBERTa. These metrics broadly reflect the strength of the models.

Quality Control
Training and Qualification We provide a twopart worker training interface in order to i) familiarise workers with the process, and ii) conduct a first screening based on workers' outputs. The interface familiarises workers with formulating questions, and answering them through span selection controls. Workers are asked to highlight two answers for provided questions, generate two questions for provided answers, generate one full question-answer pair, and finally complete a question generation HIT with BiDAF as the model in the loop. Each worker's output is then manually reviewed; those who pass the screening are qualified to a second annotation stage.

Manual Worker Validation
In the second annotation stage, workers produce data for the "Beat the AI" question generation task. A sample of every worker's question generation HITs is manually reviewed based on their total number of completed tasks n, determined by 5 · log 10 (n) + 1 , chosen for convenience. This is done after every annotation batch; if workers fall below an 80% success threshold at any point, their qualification is revoked and their work discarded in its entirety.
Question Answerability As the models used in the annotation task become stronger, the resulting questions tend to become more complex. However, this also means that it becomes more challenging to disentangle measures of dataset quality from inherent question difficulty. As such, we define the condition of human answerability for an annotated question-answer pair as follows: it is answerable if at least one of three additional nonexpert human validators can provide an answer matching the original. We conduct answerability checks on both the validation and test sets and achieve answerability scores of 87.95%, 85.41% and 82.63% for D BiDAF , D BERT and D RoBERTa , respectively. We discard all questions deemed unanswerable from the validation and test sets, and further discard all data from any workers with less than half of their questions considered answerable. It should be emphasised that the main purpose of this process is to create a level playing field for comparison across datasets constructed for different model adversaries and can inevitably result in valid questions being discarded. The total cost for training and qualification, dataset construction and validation is approximately $27,000.
Human Performance We select a randomly chosen validator's answer to each question and compute Exact Match (EM) and word overlap F 1 scores with the original to calculate non-expert human performance; Table 1 shows the result. We observe a clear trend: the stronger the model in the loop used to construct the dataset, the harder the resulting questions become for humans.

Dataset Statistics
In Table 2 we provide general details on the number of passages and question-answer pairs used in the different dataset splits. The average number of words in questions and answers, as well as the average longest n-gram overlap between passage and question are furthermore given in Table 3. We can again observe two clear trends: from weaker towards stronger models used in the annotation loop, the average length of answers increases, and the largest n-gram overlap drops from 3 to 2 tokens. That is, on average there is a trigram overlap between the passage and question for D SQuAD , but only a bigram overlap for D RoBERTa (Figure 3   This is in line with prior observations on lexical overlap as a predictive cue in SQuAD (Weissenborn et al., 2017;Min et al., 2018); questions with less overlap are harder to answer for any of the three models. We furthermore perform analyses on question types based on the question wh-word. We find that -in contrast to D SQuAD -the datasets collected with a model in the loop have fewer when, how and in questions, and instead more which, where and why questions, as well as questions in the other category, which indicates increased question diversity. In terms of answer types, we observe more common noun and verb phrase clauses than in D SQuAD , as well as fewer dates, names, and numeric answers. This reflects on the strong answertype matching capabilities of contemporary RC models. For further dataset statistics on this, see Appendix A.
While D BiDAF , D BERT and D RoBERTa were created for the investigation and analysis of human-sourced adversarial examples in a model-   in-the-loop setting for RC, we recognise their potential value to the community and plan to release all three training and validation sets publicly.

Consistency of the Model in the Loop
We begin with an experiment about the consistency of the adversarial nature of the models in the annotation loop. Our annotation pipeline is designed to reject any samples where the model correctly predicts the answer. How reproducible is this when retraining the same model with the same data? To measure this, we evaluate the performance of two models of identical setup for each respective architecture, which differ only in their random initialisation and data order during SGD sampling. We can thus isolate how strongly the resulting dataset depends on the particular random initialisation and order of data points used to train the model. The results of this experiment are shown in Table 4. First, we observe -as expected given our annotation constraints -that model performance is 0.0EM on datasets created with the same respective model in the annotation loop. We observe however that a retrained model does not reliably perform as poorly on those samples. For example, BERT reaches as much as 20.5EM, whereas the initial model (Seed 1, used during annotation) has no correct answer and 0.0EM. We observed this effect repeatedly when re-running this experiment several times for other random re-initialisations. This demonstrates that random components in the model can substantially affect the adversarial annotation process. The evaluation furthermore serves as a baseline for subsequent model evaluations: this much of the performance range can be learned merely by retraining the same model. A possible takeaway for employing the model-inthe-loop annotation strategy in the future is to rely on ensembles of adversaries and reduce the dependency on one particular model instantiation.

Adversarial Generalisation
A potential problem with the focus on challenging questions is that they might all be very distinct from one another, hence leading to difficulties in learning to generalise from and to them. We next conduct a series of experiments in which we train on D BiDAF , D BERT , and D RoBERTa , and observe how well models can then learn to generalise on the respective test portions of these datasets. Table 5 shows the results, and there is a multitude of observations.
First, one clear trend we observe across all training data setups is a clear negative performance progression when evaluated against datasets constructed with a stronger model in the loop. This trend holds true for all but the BiDAF model, in each of the training configurations, and for each of the evaluation datasets. For example, RoBERTa trained on D RoBERTa achieves 71.4, 53.5, 48.6 and 38.9F 1 when evaluated on D SQuAD , D BiDAF , D BERT and D RoBERTa , respectively.
Second, we observe that the BiDAF model is not able to generalise well to datasets constructed with a model in the loop, independent of its training setup. In particular it is unable to learn from D BiDAF , thus failing to overcome some of its own blind spots through adversarial training. Both when training only on D BiDAF , as well as when adding D SQuAD to D BiDAF during training (cf. Table 6), BiDAF performs poorly across all the adversarial datasets.
In contrast, BERT and RoBERTa are able to partially overcome their blind spots through training on data collected with a model in the annotation loop, and to a degree that far exceeds what one would expect from random retraining (cf. Table 4). For example, RoBERTa trained on D RoBERTa reaches 38.9F 1 on D RoBERTa , and this  Next, we observe that training on D S where S is a stronger model helps generalise to D W with a weaker RC model W , e.g. training on D RoBERTa and testing on D BERT . But on the other hand, training on D W also leads to generalisation towards D S : for example, the baseline of RoBERTa trained on 10,000 SQuAD samples reaches 22.1F 1 on D RoBERTa (D S ), whereas training RoBERTa on D BiDAF and D BERT (D W ) bumps this number to 36.0F 1 and 34.6F 1 , respectively. This suggests an encouraging takeaway for the model-inthe-loop annotation paradigm: even though a particular model might be chosen as adversary in the annotation loop, which at some point falls behind more competitive state-of-the-art models, these future models can still use the data collected with the weaker model in the loop, and generalise better even to samples composed with the stronger model in the loop.
In Table 6 we show experimental results for the same models and training datasets, but now in-cluding SQuAD as additional training data. In this training setup we generally see improved generalisation to D BiDAF , D BERT , and D RoBERTa . Interestingly, the relative differences between D BiDAF , D BERT , and D RoBERTa as training set used in conjunction with SQuAD are now much diminished, and especially D RoBERTa as (part of the) training set now generalises substantially better. RoBERTa achieves the strongest results on any of the D BiDAF , D BERT , and D RoBERTa evaluation sets, in particular when trained on D SQuAD +D RoBERTa . This stands in contrast to the previous results in Table 5, where training on D BiDAF in several cases led to better generalisation than training on D RoBERTa . A possible explanation for this observation is that training on D RoBERTa leads to a larger degree of adversarial overfitting than training on D BiDAF , and the inclusion of a large number of standard SQuAD training samples can mitigate this effect.
Finally, we identify a risk of datasets constructed with weaker models in the loop becoming outdated.
For example, RoBERTa achieves 58.2EM/73.2F 1 on D BiDAF , in contrast to 0.0EM/5.5F 1 for BiDAF -which is not far from non-expert human performance of 62.6EM/78.5F 1 .

Generalisation to Non-Adversarial Data
Compared to standard annotation, the model-inthe-loop approach generally results in a new question distribution. Consequently, models trained  on adversarially composed questions might not be able to generalise to standard ("easy") questions, thus limiting the usefulness of the resulting data resource in practice. To what extent do models trained on model-in-the-loop questions generalise differently to standard ("easy") questions, compared to training on standard ("easy") questions?
To measure this we further train each of our three models on either D BiDAF , D BERT , or D RoBERTa and test on D SQuAD , with results in the D SQuAD columns of Table 5. For comparison, the models are also trained on 10,000 SQuAD1.1 samples (referred to as D SQuAD(10K) ) chosen from the same passages as the adversarial datasets, thus eliminating size and paragraph choice as potential confounding factors. The models are tuned for Exact Match (EM) on our held-out validation data derived from the split SQuAD1.1 validation set after applying majority vote (D SQuAD -dev). Note that for the reasons described earlier, this means that performance values are lower on the majority vote D SQuAD dataset than the unaltered one, but importantly enables us to make direct comparisons across datasets.
Remarkably, neither BERT or RoBERTa show a substantial drop when trained on D BiDAF compared to training on SQuAD data (−2.0F 1 , and −3.3F 1 ): training these models on a dataset with a weaker model in the loop still leads to strong generalisation even to data from the original SQuAD distribution, which all models in the loop are trained on. BiDAF, on the other hand, fails to learn such information from the adversarially collected data, and drops >30F 1 for each of the new training sets, compared to training on SQuAD.
We furthermore observe a gradual decrease in generalisation to SQuAD when training on D BiDAF towards training on D RoBERTa . This suggests that the stronger the model used in the annotation loop, the more dissimilar the data distribution becomes from the original SQuAD distribution. We will later find further support for this explanation in a qualitative analysis (Section 5). It may however also be due to a limitation of BERT and RoBERTa -similar to BiDAF -in learning from a data distribution designed to beat these models; an even stronger model might learn more e.g. from D RoBERTa .

Generalisation to DROP and NaturalQuestions
Finally, we investigate to what extent models can transfer skills learned on datasets created with a model in the loop to other datasets, concretely DROP and NaturalQuestions. In this experiment we select the subsets of DROP and NaturalQuestions which align with the structural constraints of SQuAD to ensure a like-for-like analysis. Specifically, we only consider questions in DROP where the answer is a span in the passage and where there is only one candidate answer. For NaturalQuestions, we consider all non-tabular long answers as passages, remove HTML tags and use the short answer as the extracted span. We apply this filter-ing on the validation sets for both datasets. Next we split it, stratifying by passage (as we did for D SQuAD ), which results in 1409/1418 validation and test set examples for DROP, and 964/982 for NaturalQuestions, respectively. We denote these datasets as D DROP and D NQ for clarity and distinction from their unfiltered versions. We consider the same models and training datasets as before, but tune on the respective validation set portions of D DROP and D NQ . In Table 5 we can see the results of these experiments in the respective D DROP and D NQ columns. First, we observe clear generalisation improvements towards D DROP across all models compared to training on D SQuAD(10K) when using any of the D BiDAF , D BERT , or D RoBERTa datasets for training. That is, including a model in the loop for the training dataset leads to improved transfer towards D DROP . Note that the DROP dataset also makes use of a BiDAF model in the loop during annotation; these results are in line with our prior observations when testing the same setups on D BiDAF , D BERT and D RoBERTa , compared to training on D SQuAD(10K) .
Second, we observe overall strong transfer results towards D NQ : up to 71.0F 1 for a BERT model trained on D BiDAF . Note that this result is similar and even slightly improves over model training with SQuAD data of the same size. That is, relative to training on SQuAD data, training on adversarially collected data D BiDAF does not impede generalisation to the D NQ dataset, which was created without a model in the annotation loop. We then however see a similar negative performance progression as observed before when testing on D SQuAD : the stronger the model in the annotation loop of the training dataset, the lower the test accuracy on test data from a data distribution composed without using a model in the loop.

Qualitative Analysis
Having applied the general model-in-the-loop methodology on models of varying strength, we next perform a qualitative comparison of the nature of the resulting questions. As reference points we also include the original SQuAD questions, as well as DROP and NaturalQuestions in this comparison: these datasets are both constructed to overcome limitations in SQuAD and have subsets which overlap sufficiently with SQuAD to make analysis possible. Specifically, we seek to under-stand the qualitative differences in terms of reading comprehension challenges posed by the questions in each of these datasets.

Comprehension Requirements
There exists a variety of prior work which seeks to understand the types of knowledge, comprehension skills or types of reasoning required to answer questions based on text (Rajpurkar et al., 2016;Clark et al., 2018;Sugawara et al., 2019;Dua et al., 2019;Dasigi et al., 2019); we are however unaware of any commonly accepted formalism. We take inspiration from these but develop our own taxonomy of comprehension requirements which suits the datasets being analysed, see Appendix D for a detailed breakdown and examples of our annotation catalogue. We annotate questions with labels from this catalogue in a manner that is not mutually exclusive, and neither fully comprehensive; the development of such a catalogue itself is very challenging. Instead, we focus on capturing the most salient characteristics of each given question, and assign it up to three of the labels in our catalogue. In total, we analyse 100 samples from the validation set of each of the datasets; Fig. 4 displays the results of this analysis.

Observations
An initial observation is that the majority (57%) of answers to SQuAD questions are stated in an explicit fashion, without comprehension requirements beyond the literal level. This number decreases substantially for any of the model-in-theloop datasets derived from SQuAD (e.g. 8% for D BiDAF ) and also D DROP , yet 42% of questions in D NQ share this property. In contrast to SQuAD, the model-in-the-loop questions generally tend to involve more paraphrasing. They also require more external knowledge, and multi-hop inference (beyond co-reference resolution) with an increasing trend for stronger models used in the annotation loop. Model-in-the-loop questions further fan out into a variety of small, but nonnegligible proportions of more specific types of inference required for comprehension, e.g. spatial or temporal inference (both going beyond explicitly stated spatial or temporal information) -SQuAD rarely requires these at all. Some of these more particular inference types are common features of the other two datasets, in particular comparative questions for DROP (60%) and to a small extent also NaturalQuestions. Inter-    Table 5, where models trained on D BiDAF outperformed those trained on D BERT or D RoBERTa when evaluated on D DROP . It is likely that BiDAF as a model in the loop is worse than BERT and RoBERTa at comparative questions, as evidenced by the results in Table 5 with BiDAF reaching 9.3F 1 and RoBERTa reaching 30.9F 1 on D DROP (when trained on D SQuAD(10K) ).
The distribution of NaturalQuestions contains elements of both the distribution of SQuAD and of D BiDAF , which offers a potential explanation for the strong performance of models trained on D SQuAD(10K) and D BiDAF on D NQ . Finally, the gradually shifting distribution away from both SQuAD and NaturalQuestions as the modelin-the-loop strength increases reflects our prior observations on the decreasing performance on SQuAD and NaturalQuestions of models trained on datasets with progressively stronger models in the annotation loop.

Discussion and Conclusions
We have in this work investigated an RC annotation paradigm which includes a model in the loop that has to be "beaten" by the annotator. Applying this approach with a series of progressively stronger RC models in the annotation loop, we arrived at three separate RC datasets, graduated by the difficulty of the model adversary. Based on this dataset series we investigated several questions surrounding the annotation paradigm, in particular whether such datasets grow outdated as stronger models emerge, and about their generalisation to standard (non-adversarially collected) questions. We found that stronger RC models can still learn from data collected with a weak adversary in the loop, and their generalisation improves even on datasets collected with a very strong adversary. Models trained on data collected with a model in the loop furthermore generalise well towards nonadversarially collected data, both on SQuAD and on NaturalQuestions, yet we observe a slow deterioration with progressively stronger adversaries.
We see our work as a contribution towards the emerging paradigm of model-in-the-loop annotation, both in RC and potentially other tasks. While the scope of this paper is focused on RC, with SQuAD as the original dataset used to train model adversaries, we see no reason in principle why similar findings would not be made for other tasks using the same annotation paradigm, when crowdsourcing the creation of challenging samples with a current model in the loop. We would expect the insights and benefits conveyed by model-in-the-loop annotation to be greatest on mature datasets where models exceed human performance: here the resulting data provides a magnifying glass on model performance, focused in particular on samples which models struggle on. On the other hand, applying the method on datasets where performance increments have not plateaued yet would likely result in a more similar distribution to the original data, which is challenging to models a priori. We hope that the series of experiments on replication, transfer between datasets collected with model adversaries of different strength, as well as our findings regarding generalisation to non-adversarially collected data, can support and inform future research and annotation efforts following the model-in-the-loop data collection paradigm.

A Additional Dataset Statistics
Question statistics In Figure 6 we analyse question lengths across SQuAD1.1 and compare them to questions constructed with different models in the annotation loop. While the mean of the distributions is similar, there is more question length variability when using a model in the loop. We also perform analysis of question types by whword as described earlier (see Figure 5). This is in further detail displayed using sunburst plots of the first three question tokens for D SQuAD (cf. Figure 10), D BiDAF (cf. Figure 12), D BERT (cf. Figure 11) and D RoBERTa (cf. Figure 13). We observe a general trend towards more diverse questions with increasing model-in-the-loop strength. Figure 8 allows for further analysis of answer lengths across datasets. We observe that answers for all datasets constructed with a model in the loop tend to be longer than in SQuAD. There is furthermore a trend of increas-C o m m o n N o u n P h r a s e D a t e O t h e r E n t it y P e r s o n O r g a n is a t io n O t h e r N u m e r ic V e r b P h r a s e O t h e r A d je c t iv e P h r a s e L o c a t io n C la u s e 0   ing answer length and variability with increasing model-in-the-loop strength. We show an analysis of answer types in Figure 7).

B Annotation Interface Details
We have three key steps in the dataset construction process: i) training and qualification, ii) "Beat the AI" annotation and iii) answer validation.
Training and Qualification This is a combined training and qualification task; a screenshot of the interface is shown in Figure 14. The first step involves a set of five assignments requiring the worker to demonstrate an ability to generate questions and indicate answers by highlighting the corresponding spans in the passage. Once complete, the worker is shown a sample "Beat the AI" HIT for a pre-determined passage which helps facilitate manual validation. In earlier experiments, these two steps were presented as separate interfaces, however, this created a bottleneck between the two layers of qualification and slowed down annotation considerably. In total, 1,386 workers completed this task with 752 being assigned the qualification.
"Beat the AI Annotation" The "Beat the AI" question generation HIT presents workers with a randomly selected passage from SQuAD1.1, about which workers are expected to generate questions and provide answers. This data is sent to the corresponding model-in-the-loop API running on AWS infrastructure and primarily consisting of a load balancer and a t2.xlarge EC2 instance with the T2/T3 Unlimited setting enabled to allow high sustained CPU performance during annotation runs. The model API returns a prediction which is scored against the worker's answer to determine whether the worker has successfully managed to "beat" the model. Only questions which the model fails to answer are considered valid; a screenshot for this interface is shown in Figure 15. Workers are tasked to ideally submit at least three valid questions, however fewer are also accepted -in particular for very short passages. A sample of each worker's HITs is manually validated; those who do not satisfy the question quality requirements have their qualification revoked and all their annotated data discarded. This was the case for 99 workers. Worker validation distributions are shown in Figure 9.
Answer Validation The answer validation interface (cf. Figure 16) is used to validate the answerability of the validation and test sets for each dif-ferent model used in the annotation loop. Every previously collected question generation HIT from these dataset parts, which had not been discarded during manual validation, is submitted to at least 3 distinct annotators. Workers are shown the passage and previously generated questions and are asked to highlight the answer in the passage. In a post-processing step, only questions with at least 1 valid matching answer out of 3 are finally retained.

C Examples of Annotated Questions
In Table 7 we provide a few examples of the questions collected with each different model in the annotation loop.

D Catalogue of Comprehension Requirements
We give a description for each of the items in our catalogue of comprehension requirements in Table 8, accompanied with an example for illustration. These are the labels used for the qualitative analysis performed in Section 5.      The Eucharist was one of how many issues debated by those in attendance of the meeting? BiDAF In a purely capitalist mode of production (i.e. where professional and labor organizations cannot limit the number of workers) the workers wages will not be controlled by these organizations, or by the employer, but rather by the market. Wages work in the same way as prices for any other good. Thus, wages can be considered as a function of market price of skill. And therefore, inequality is driven by this price.
What determines worker wages?
BERT Jochi died in 1226, during his father's lifetime. Some scholars, notably Ratchnevsky, have commented on the possibility that Jochi was secretly poisoned by an order from Genghis Khan. Rashid al-Din reports that the great Khan sent for his sons in the spring of 1223, and while his brothers heeded the order, Jochi remained in Khorasan. Juzjani suggests that the disagreement arose from a quarrel between Jochi and his brothers in the siege of Urgench.
Who went to Khan after his order in 1223?

BERT
In the Sandgate area, to the east of the city and beside the river, resided the close-knit community of keelmen and their families. They were so called because they worked on the keels, boats that were used to transfer coal from the river banks to the waiting colliers, for export to London and elsewhere. In the 1630s about 7,000 out of 20,000 inhabitants of Newcastle died of plague [. . . ] Where did almost half the people die?

BERT
The Grainger Market replaced an earlier market originally built in 1808 called the Butcher Market. Luther's reformed hymn did not feature stanzas of what quantity? RoBERTa Aken, adopted by Mexican movie actress Lupe Mayorga, grew up in the neighboring town of Madera and his song chronicled the hardships faced by the migrant farm workers he saw as a child.
When did Aken encounter the topic of his song?

RoBERTa
Newton's leading receivers were tight end Greg Olsen, who caught a career-high 77 passes for 1,104 yards and seven touchdowns, and wide receiver Ted Ginn, Jr., who caught 44 passes for 739 yards and 10 touchdowns; Ginn also rushed for 60 yards and returned 27 punts for 277 yards. Other key receivers included veteran Jerricho Cotchery (39 receptions for 485 yards), rookie Devin Funchess (31 receptions for 473 yards and five touchdowns), and second-year receiver Corey Brown (31 receptions for 447 yards).
Who caught the second most passes? Table 7: Examples of questions collected using different models in the annotation loop. The annotated answer is highlighted in yellow.

Description Passage Question
Explicit Answer stated nearly word-for-word in the passage as it is in the question.
Sayyid Abul Ala Maududi was an important early twentieth-century figure in the Islamic revival in India [. . . ] Who was an important early figure in the Islamic revival in India? Paraphrasing Question paraphrases parts of the passage, generally relying on contextspecific synonyms.
Seamans' establishment of an ad-hoc committee [. . . ] Who created the adhoc committee?

External Knowledge
The question cannot be answered without access to sources of knowledge beyond the passage.
[ Into what family did the artist who represented the Art Deco style marry? Comparative Requires a comparison between two or more attributes (e.g. smaller than, last) The previous chairs were Rajendra K. Pachauri, elected in May 2002;Robert Watson in 1997;and Bert Bolin in 1988.
Who was elected earlier, Robert Watson or Bert Bolin? Numeric Any numeric reasoning (e.g. some form of calculation is required to arrive at the correct answer).
[. . . ] it has been estimated that Africans will make up at least 30% of the delegates at the 2012 General Conference, and it is also possible that 40% of the delegates will be from outside [. . . ] From which continent is it estimated that members will make up nearly a third of participants in 2012? Negation Requires interpreting a single or multiple negations.
Subordinate to the General Conference are the jurisdictional and central conferences which also meet every four years.
What is not in charge?

Filtering
Narrowing down a set of answers to select one by some particular distinguishing feature.
[. . . ] was engaged with Johannes Bugenhagen, Justus Jonas, Johannes Apel, Philipp Melanchthon and Lucas Cranach the Elder and his wife as witnesses [. . . ] Whose partner could testify to the couple's agreement to marry?

Temporal
Requires an understanding of time and change, and related aspects. Goes beyond directly stated answers to When questions or external knowledge.
In 2010 the Amazon rainforest experienced another severe drought, in some ways more extreme than the 2005 drought.
What occurred in 2005 and then again five years later?

Spatial
Requires an understanding of the concept of space, location, or proximity. Goes beyond finding directly stated answers to Where questions.