Abstract
Adversarial evaluation stress-tests a model’s understanding of natural language. Because past approaches expose superficial patterns, the resulting adversarial examples are limited in complexity and diversity. We propose human- in-the-loop adversarial generation, where human authors are guided to break models. We aid the authors with interpretations of model predictions through an interactive user interface. We apply this generation framework to a question answering task called Quizbowl, where trivia enthusiasts craft adversarial questions. The resulting questions are validated via live human–computer matches: Although the questions appear ordinary to humans, they systematically stump neural and information retrieval models. The adversarial questions cover diverse phenomena from multi-hop reasoning to entity type distractors, exposing open challenges in robust question answering.
1 Introduction
Proponents of machine learning claim human parity on tasks like reading comprehension (Yu et al., 2018) and commonsense inference (Devlin et al., 2018). Despite these successes, many evaluations neglect that computers solve natural language processing (NLP) tasks in a fundamentally different way than humans.
Models can succeed without developing “true” language understanding, instead learning superficial patterns from crawled (Chen et al., 2016) or manually annotated data sets (Gururangan et al., 2018; Kaushik and Lipton, 2018). Thus, recent work stress-tests models via adversarial evaluation: elucidating a system’s capabilities by exploiting its weaknesses (Jia and Liang, 2017; Belinkov and Glass, 2019). Unfortunately, whereas adversarial evaluation reveals simplistic model failures (Ribeiro et al., 2018; Mudrakarta et al., 2018), exploring more complex failure patterns requires human involvement (Figure 1): Automatically modifying natural language examples without invalidating them is difficult. Hence, the diversity of adversarial examples is often severely restricted.
Instead, our human–computer hybrid approach uses human creativity to generate adversarial examples. A user interface presents model interpretations and helps users craft model-breaking examples (Section 3). We apply this to a question answering (qa) task called Quizbowl, where trivia enthusiasts—who write questions for academic competitions—create diverse examples that stump existing qa models.
The adversarially authored test set is nonetheless as easy as regular questions for humans (Section 4), but the relative accuracy of strong qa models drops as much as 40% (Section 5). We also host live human vs. computer matches—where models typically defeat top human teams—but observe spectacular model failures on adversarial questions.
Analyzing the adversarial edits uncovers phenomena that humans can solve but computers cannot (Section 6), validating that our framework uncovers creative, targeted adversarial edits (Section 7). Our resulting adversarial data set presents a fun, challenging, and diverse resource for future qa research: A system that masters it will demonstrate more robust language understanding.
2 Adversarial Evaluation for nlp
Adversarial examples (Szegedy et al., 2013) often reveal model failures better than traditional test sets. However, automatic adversarial generation is tricky for nlp (e.g., by replacing words) without changing an example’s meaning or invalidating it.
Recent work sidesteps this by focusing on simple transformations that preserve meaning. For instance, Ribeiro et al. (2018) generate adversarial perturbations such as replacing What has → What’s. Other minor perturbations such as typos (Belinkov and Bisk, 2018), adding distractor sentences (Jia and Liang, 2017; Mudrakarta et al., 2018), or character replacements (Ebrahimi et al., 2018) preserve meaning while degrading model performance.
Generative models can discover more adversarial perturbations but require post hoc human verification of the examples. For example, neural paraphrase or language models can generate syntax modifications (Iyyer et al., 2018), plausible captions (Zellers et al., 2018), or nli premises (Zhao et al., 2018). These methods improve example-level diversity but mainly target a specific phenomenon, (e.g., rewriting question syntax).
Furthermore, existing adversarial perturbations are restricted to sentences—not the paragraph inputs of Quizbowl and other tasks—due to challenges in long-text generation. For instance, syntax paraphrase networks (Iyyer et al., 2018) applied to Quizbowl only yield valid paraphrases 3% of the time (Appendix A).
2.1 Putting a Human in the Loop
Instead, we task human authors with adversarial writing of questions: generating examples that break a specific qa system but are still answerable by humans. We expose model predictions and interpretations to question authors, who find question edits that confuse the model.
The user interface makes the adversarial writing process interactive and model-driven, in contrast to adversarial examples written independently of a model (Ettinger et al., 2017). The result is an adversarially authored data set that explicitly exposes a model’s limitations by design.
Human-in-the-loop generation can replace or aid model-based adversarial generation approaches. Creating interfaces and interpretations is often easier than designing and training generative models for specific domains. In domains where adversarial generation is feasible, human creativity can reveal which tactics automatic approaches can later emulate. Model-based and human-in-the-loop generation approaches can also be combined by training models to mimic human adversarial edit history, using the relative merits of both approaches.
3 Our QA Testbed: Quizbowl
The ‘‘gold standard’’ of academic competitions between universities and high schools is Quizbowl. Unlike qa formats such as Jeopardy! (Ferrucci et al., 2010), Quizbowl questions are designed to be interrupted: Questions are read to two competing teams and whoever knows the answer first interrupts the question and “buzzes in.”
This style of play requires questions to be structured “pyramidally” (Jose, 2017): Questions start with difficult clues and get progressively easier. These questions are carefully crafted to allow the most knowledgeable player to answer first. A question on Paris that begins “this capital of France” would test reaction speed, not knowledge; thus, skilled authors arrange the clues so players will recognize them with increasing probability (Figure 2).
The answers to Quizbowl questions are typically well-known entities. In the qa community (Hirschman and Gaizauskas, 2001), this is called “factoid” qa: The entities come from a relatively closed set of possible answers.
3.1 Known Exploits of Quizbowl Questions
Like most qa data sets, Quizbowl questions are written for humans. Unfortunately, the heuristics that question authors use to select clues do not always apply to computers. For example, humans are unlikely to memorize every song in every opera by a particular composer. This, however, is trivial for a computer. In particular, a simple qa system easily solves the example in Figure 2 from seeing the reference to “Un Bel Di”. Other questions contain uniquely identifying “trigger words” (Harris, 2006). For example, “martensite” only appears in questions on steel. For these examples, a qa system needs to understand no additional information other than an if–then rule.
One might wonder whether this means that factoid qa is thus an uninteresting, nearly solved research problem. However, some Quizbowl questions are fiendishly difficult for computers. Many questions have intricate coreference patterns (Guha et al., 2015), require reasoning across multiple types of knowledge, or involve complex wordplay. If we can isolate and generate questions with these difficult phenemona, “simplistic” factoid qa quickly becomes non-trivial.
3.2 Models and Data Sets
We conduct two rounds of adversarial writing. In the first, authors attack a traditional information retrieval (ir) system. The ir model is the baseline from a nips 2017 shared task on Quizbowl (Boyd-Graber et al., 2018) based on ElasticSearch (Gormley and Tong, 2015).
In the second round, authors attack either the ir model or a neural qa model. The neural model is a bidirectional recurrent neural network (rnn) using the gated recurrent unit architecture (Cho et al., 2014). The model treats Quizbowl as classification and predicts the answer entity from a sequence of words represented as 300-dimensional GloVe embeddings (Pennington et al., 2014). Both models in this round are trained using an expanded data set of approximately 110,000 Quizbowl questions. We expanded the second round data set to incorporate more diverse answers (25,000 entities vs. 11,000 in round one).
3.3 Interpreting Quizbowl Models
To help write adversarial questions, we expose what the model is thinking to the authors. We interpret models using saliency heat maps: Each word of the question is highlighted based on its importance to the model’s prediction (Ribeiro et al., 2016).
For the neural model, word importance is the decrease in prediction probability when a word is removed (Li et al., 2016; Wallace et al., 2018). We focus on gradient-based approximations (Simonyan et al., 2014; Montavon et al., 2018) for their computational efficiency.
This simulates how model predictions change when a particular word’s embedding is set to the zero vector—it approximates word removal (Ebrahimi et al., 2018; Wallace et al., 2018).
For the ir model, we use the ElasticSearch Highlight api (Gormley and Tong, 2015), which provides word importance scores based on query matches from the inverted index.
3.4 Adversarial Writing Interface
The authors interact with either the ir or rnn model through a user interface1 (Figure 3). An author writes their question in the upper right and the model’s top five predictions (Machine Guesses) appear in the upper left. If the top prediction is the right answer, the interface indicates where in the question the model is first correct. The goal is to cause the model to be incorrect or to delay the correct answer position as much as possible.2 The words of the current question are highlighted using the applicable interpretation method in the lower right (Evidence). We do not enforce time restrictions or require questions to be adversarial: If the author fails to break the system, they are free to “give up” and submit any question.
3.5 Question Authors
We focus on members of the Quizbowl community: They have deep trivia knowledge and craft questions for Quizbowl tournaments (Jennings, 2006). We award prizes for questions read at live human–computer matches (Section 5.3).
The question authors are familiar with the standard format of Quizbowl questions (Lujan and Teitler, 2003). The questions follow a common paragraph structure, are well edited for grammar, and finish with a simple “give-away” clue. These constraints benefit the adversarial writing process as it is very clear what constitutes a difficult but valid question. Thus, our examples go beyond surface level “breaks” such as character noise (Belinkov and Bisk, 2018) or syntax changes (Iyyer et al., 2018). Rather, questions are difficult because of their semantic content (examples in Section 6).
3.6 How an Author Writes a Question
To see how an author might write a question with the interface, we walk through an example of writing a question’s first sentence. The author first selects the answer to their question from the training set—Johannes Brahms—and begins:
Karl Ferdinand Pohl showed this composer some pieces on which this composer’s Variations on a Theme by Haydn were based.
The qa system buzzes (i.e., it has enough information to interrupt and answer correctly) after “composer”. The author sees that the name “Karl Ferdinand Pohl” appears in Brahms’ Wikipedia page and avoids that specific phrase, describing Pohl’s position instead of naming him directly:
This composer was given a theme called “Chorale St. Antoni” by the archivist of the Vienna Musikverein, which could have been written by Ignaz Pleyel.
This rewrite adds in some additional information (there is a scholarly disagreement over who wrote the theme and its name), and the qa system now incorrectly thinks the answer is Frédéric Chopin. The user can continue to build on the theme, writing
While summering in Tutzing, this composer turned that theme into “Variations on a Theme by Haydn”.
Again, the author sees that the system buzzes “Variations on a Theme” with the correct answer. However, the author can rewrite the title in its original German, “Variationen über ein Thema von Haydn” to fool the system. The author continues to create entire questions the model cannot solve.
4 A New Adversarially Authored Data Set
Science | 17% |
History | 22% |
Literature | 18% |
Fine Arts | 15% |
Religion, Mythology, Philosophy, and Social Science | 13% |
Current Events, Geography, and General Knowledge | 15% |
Total Questions | 1,213 |
Science | 17% |
History | 22% |
Literature | 18% |
Fine Arts | 15% |
Religion, Mythology, Philosophy, and Social Science | 13% |
Current Events, Geography, and General Knowledge | 15% |
Total Questions | 1,213 |
4.1 Validating Questions with Quizbowlers
We validate that the adversarially authored questions are not of poor quality or too difficult for humans. We first automatically filter out questions based on length, the presence of vulgar statements, or repeated submissions (including re-submissions from the Quizbowl training or evaluation data).
We next host a human-only Quizbowl event using intermediate and expert players (former and current collegiate Quizbowl players). We select 60 adversarially authored questions and 60 standard high school national championship questions, both with the same number of questions per category (list of categories in Table 1).
To answer a Quizbowl question, a player interrupts the question—the earlier the better. To capture this dynamic, we record both the average answer position (as a percentage of the question, lower is better) and answer accuracy. We shuffle the regular and adversarially authored questions, read them to players, and record these two metrics.
The adversarially authored questions are on average easier for humans than the regular test questions. For the adversarially authored set, humans buzz in with 41.6% of the question remaining and an accuracy of 89.7%. On the standard questions, humans buzz in with 28.3% of the question remaining and an accuracy of 84.2%. The difference in accuracy between the two types of questions is not significantly different (p = 0.16 using Fisher’s exact test), but the buzzing position is earlier for adversarially authored questions (p = 0.0047 for a two-sided t-test). We expect the questions that were not played to be of comparable difficulty because they went through the same submission process and post-processing. We further explore the human-perceived difficulty of the adversarially-authored questions in Section 5.3.
5 Computer Experiments
This section evaluates qa systems on the adversarially authored questions. We test three models: the ir and rnn models shown in the interface, as well as a Deep Averaging Network (Iyyer et al., 2015, dan) to evaluate the transferability of the adversarial questions. We break our study into two rounds. The first round consists of adversarially authored questions written against the ir system (Section 5.1); the second-round questions target both the ir and rnn (Section 5.2).
Finally, we also hold live competitions that pit the state-of-the-art Studio Ousia model (Yamada et al., 2018) against human teams (Section 5.3).
5.1 First-Round Attacks: IR Adversarial Questions Transfer To All Models
The first round of adversarially authored questions target the ir model and are significantly harder for the ir, rnn, and dan models (Figure 4). For example, the dan’s accuracy drops from 54.1% to 32.4% on the full question (60% of original performance).
For both adversarially authored and original test questions, the early clues are difficult to answer (near zero accuracy for the first 10–25% of the question). However, during the middle third of the questions, where buzzes in Quizbowl most frequently occur, the accuracy on original test questions rises significantly more quickly than the adversarially authored ones. For both type of questions, the accuracy rises towards the end as the clues become “give-aways”.
5.2 Second-Round Attacks: RNN Adversarial Questions are Brittle
In the second round, the authors also attack an rnn model. All models tested in the second round are trained on a larger data set (Section 3.2).
A similar trend holds for ir adversarial questions in the second round (Figure 5): A question that tricks the ir system also fools the two neural models (i.e., adversarial examples transfer). For example, the dan model was never targeted but had substantial accuracy decreases in both rounds.
This does not hold for questions written adversarially against the rnn model, however. On these questions, the neural models struggle but the ir model is largely unaffected (Figure 5, right).
5.3 Humans vs. Computer, Live!
In the offline setting (i.e., no pressure to “buzz in” before an opponent), models demonstrably struggle on the adversarial questions. But, what happens in standard Quizbowl—live, head-to-head games?
We run two live humans vs. computer matches. The first match uses ir adversarial questions in a 40-question, tossup-only Quizbowl format. We pit a human team of national-level Quizbowl players against the Studio Ousia model (Yamada et al., 2018), the current state-of-the-art Quizbowl system. The model combines neural, ir, and knowledge graph components (details in Appendix B), and won the 2017 nips shared task, defeating a team of expert humans 475 to 200 on regular Quizbowl test questions. Although the team at our live event was comparable to the nips 2017 team, the tables were turned: The human team won handedly 300 to 30.
Our second live event is significantly larger: Seven human teams play against models on over 400 questions written adversarially against the rnn model. The human teams range in ability from high school Quizbowl players to national-level teams (Jeopardy! champions, Academic Competition Federation national champions, top scorers in the World Quizzing Championships). The models are based on either ir or neural methods. Despite a few close games between the weaker human teams and the models, humans prevailed in every match.4
Figures 6 and 7 summarize the live match results for the humans and Ousia model, respectively. Humans and models have considerably different trends in answer accuracy. Human accuracy on both regular and adversarial questions rises quickly in the last half of the question (curves in Figure 6). In essence, the “give-away” clues at the end of questions are easy for humans to answer.
On the other hand, models on regular test questions do well in the first half, i.e., the “difficult” clues for humans are easier for models (Regular Test in Figure 7). However, models, like humans, struggle on adversarial questions in the first half.
6 What Makes Adversarially Authored Questions Hard?
This section analyzes the adversarially authored questions to identify the source of their difficulty.
6.1 Quantitative Differences in Questions
One possible source of difficulty is data scarcity: The answers to adversarial questions rarely appear in the training set. However, this is not the case; The mean number of training examples per answer (e.g., George Washington) is 14.9 for the adversarial questions versus 16.9 for the regular test data.
Another explanation for question difficulty is limited “overlap” with the training data— namely, models cannot match n-grams from the training clues. We measure the proportion of test n-grams that also appear in training questions with the same answer (Table 2). The overlap is roughly equal for unigrams but surprisingly higher for adversarial questions’ bigrams. The adversarial questions are also shorter and have fewer named entities (nes). However, the proportion of nes is roughly equivalent.
. | Adversarial . | Regular . |
---|---|---|
Unigram overlap | 0.40 | 0.37 |
Bigram overlap | 0.08 | 0.05 |
Longest n-gram overlap | 6.73 | 6.87 |
Average ne overlap | 0.38 | 0.46 |
ir Adversarial | 0.35 | |
rnn Adversarial | 0.44 | |
Total Words | 107.1 | 133.5 |
Total ne | 9.1 | 12.5 |
. | Adversarial . | Regular . |
---|---|---|
Unigram overlap | 0.40 | 0.37 |
Bigram overlap | 0.08 | 0.05 |
Longest n-gram overlap | 6.73 | 6.87 |
Average ne overlap | 0.38 | 0.46 |
ir Adversarial | 0.35 | |
rnn Adversarial | 0.44 | |
Total Words | 107.1 | 133.5 |
Total ne | 9.1 | 12.5 |
One difference between the questions written against the ir system and the ones written against the rnn model is the drop in nes. The decrease in nes is higher for ir adversarial questions, which may explain their generalization: The rnn is more sensitive to changes in phrasing, whereas the ir system is more sensitive to specific words.
6.2 Categorizing Adversarial Phenomena
We next qualitatively analyze adversarially authored questions. We manually inspect the author edit logs, classifying questions into six different phenomena in two broad categories (Table 3) from a random sample of 100 questions, double-counting questions into multiple phenomena when applicable.
Composing Seen Clues | 15% |
Logic & Calculations | 5% |
Multi-Step Reasoning | 25% |
Paraphrases | 38% |
Entity Type Distractors | 7% |
Novel Clues | 26% |
Total Questions | 1,213 |
Composing Seen Clues | 15% |
Logic & Calculations | 5% |
Multi-Step Reasoning | 25% |
Paraphrases | 38% |
Entity Type Distractors | 7% |
Novel Clues | 26% |
Total Questions | 1,213 |
6.2.1 Adversarial Category 1: Reasoning
The first question category requires reasoning about known clues (Table 4).
Question . | Prediction . | Answer . | Phenomenon . |
---|---|---|---|
This man, who died at the Battle of the Thames, experienced a setback when his brother Tenskwatawa’s influence over their tribe began to fade. | Battle of Tippecanoe | Tecumseh | Composing Seen Clues |
This number is one hundred fifty more than the number of Spartans at Thermopylae. | Battle of Thermopylae | 450 | Logic & Calculations |
A building dedicated to this man was the site of the “I Have A Dream” speech. | Martin Luther King Jr. | Abraham Lincoln | Multi-Step Reasoning |
Question . | Prediction . | Answer . | Phenomenon . |
---|---|---|---|
This man, who died at the Battle of the Thames, experienced a setback when his brother Tenskwatawa’s influence over their tribe began to fade. | Battle of Tippecanoe | Tecumseh | Composing Seen Clues |
This number is one hundred fifty more than the number of Spartans at Thermopylae. | Battle of Thermopylae | 450 | Logic & Calculations |
A building dedicated to this man was the site of the “I Have A Dream” speech. | Martin Luther King Jr. | Abraham Lincoln | Multi-Step Reasoning |
Composing Seen Clues:
These questions provide entities with a first-order relationship to the correct answer. The system must triangulate the correct answer by “filling in the blank”. For example, the first question of Table 4 names the place of death of Tecumseh. The training data contains a question about his death reading “though stiff fighting came from their Native American allies under Tecumseh, who died at this battle” (The Battle of the Thames). The system must connect these two clues to answer.
Logic & Calculations:
These questions require mathematical or logical operators. For example, the training data contains a clue about the Battle of Thermopylae: “King Leonidas and 300 Spartans died at the hands of the Persians.” The second question in Table 4 requires adding 150 to the number of Spartans.
Multi-Step Reasoning:
This question type requires multiple reasoning steps between entities. For example, the last question of Table 4 requires a reasoning step from the “I Have A Dream” speech to the Lincoln Memorial and then another reasoning step to reach Abraham Lincoln.
6.2.2 Adversarial Category 2: Distracting Clues
The second category consists of circumlocutory clues (Table 5).
Set . | Question . | Prediction . | Phenomenon . |
---|---|---|---|
Training | Name this sociological phenomenon, the taking of one’s own life. | Suicide | Paraphrase |
Adversarial | Name this self-inflicted method of death. | Arthur Miller | |
Training | Clinton played the saxophone on The Arsenio Hall Show. | Bill Clinton | |
Adversarial | He was edited to appear in the film “Contact”… For ten points, name this American president who played the saxophone on an appearance on the Arsenio Hall Show. | Don Cheadle | Entity Type Distractor |
Set . | Question . | Prediction . | Phenomenon . |
---|---|---|---|
Training | Name this sociological phenomenon, the taking of one’s own life. | Suicide | Paraphrase |
Adversarial | Name this self-inflicted method of death. | Arthur Miller | |
Training | Clinton played the saxophone on The Arsenio Hall Show. | Bill Clinton | |
Adversarial | He was edited to appear in the film “Contact”… For ten points, name this American president who played the saxophone on an appearance on the Arsenio Hall Show. | Don Cheadle | Entity Type Distractor |
Paraphrases:
A common adversarial modification is to paraphrase clues to remove exact n-gram matches from the training data. This renders our ir system useless but also hurts the neural models. Many of the adversarial paraphrases go beyond syntax-only changes (e.g., the first row of Table 5).
Entity Type Distractors:
Whether explicit or implicit in a model, one key component for qa is determining the answer type of the question. Authors take advantage of this by providing clues that cause the model to select the wrong answer type. For example, in the second question of Table 5, the “lead-in” clue implies the answer may be an actor. The rnn model answers DonCheadle in response despite previously seeing the Bill Clinton “playing a saxophone” clue in the training data.
Novel Clues:
Some adversarially authored questions are hard not because of phrasing or logic but because our models have not seen these clues. These questions are easy to create: Users can add Novel Clues that—because they are not uniquely associated with an answer—confuse the models. While not as linguistically interesting, novel clues are not captured by Wikipedia or Quizbowl data, thus improving the data set’s diversity. For example, adding clues about literary criticism (Hardwick, 1967; Watson, 1996) to a question about Lillian Hellman’s The Little Foxes: “Ritchie Watson commended this play’s historical accuracy for getting the price for a dozen eggs right—ten cents—to defend against Elizabeth Hardwick’s contention that it was a sentimental history.” Novel clues create an incentive for models to use information beyond past questions and Wikipedia.
Novel clues have different effects on ir and neural models: Whereas ir models largely ignore them, novel clues can lead neural models astray. For example, on a question about Tiananmen Square, the rnn model buzzes on the clue “World Economic Herald”. However, adding a novel clue about “the history of shaving” renders the brittle rnn unable to buzz on the “World Economic Herald” clue that it was able to recognize before.5 This helps to explain why adversarially authored questions written against the rnn do not stump ir models.
7 How Do Interpretations Help?
This section explores how model interpretations help to guide adversarial authors. We analyze the question edit log, which reflects how authors modify questions given a model interpretation.
A direct edit of the highlighted words often creates an adversarial example (e.g., Figure 8). Figure 9 shows a more intricate example. The left plot shows the Question Length, as well as the position where the model is first correct (Buzzing Position, lower is better). We show two adversarial edits. In the first (1), the author removes the first sentence of the question, which makes the question easier for the model (buzzing position decreases). The author counteracts this in the second edit (2), where they use the interpretation to craft a targeted modification that breaks the ir model.
However, models are not always this brittle. In Figure C.1, the interpretation fails to aid an adversarial attack against the rnn model. At each step, the author uses the highlighted words as a guide to edit targeted portions of the question yet fails to trick the model. The author gives up and submits their relatively non-adversarial question.
7.1 Interviews With Adversarial Authors
We also interview the adversarial authors who attended our live events. Multiple authors agree that identifying oft-repeated “stock” clues was the interface’s most useful feature. As one author explained, “There were clues which I did not think were stock clues but were later revealed to be.” In particular, the author’s question about the Congress of Vienna used a clue about “Kraków becoming a free city,” which the model immediately recognized.
Another interviewee was Jordan Brownstein,6 a national Quizbowl champion and one of the best active players, who felt that computer opponents were better at questions that contained direct references to battles or poetry. He also explained how the different writing styles used by each Quizbowl author increases the difficulty of questions for computers. The interface’s evidence panel allows authors to read existing clues that encourage these unique stylistic choices.
8 Related Work
New data sets often allow for a finer-grained analysis of a linguistic phenomenon, task, or genre. The lambada data set (Paperno et al., 2016) tests a model’s understanding of the broad contexts present in book passages, whereas the Natural Questions corpus (Kwiatkowski et al., 2019) combs Wikipedia for answers to questions that users trust search engines to answer (Oeldorf-Hirsch et al., 2014). Other work focuses on natural language inference, where challenge examples highlight model failures (Glockner et al., 2018; Naik et al., 2018; Wang et al., 2019). Our work is unique in that we use human adversaries to expose model weaknesses, which provides a diverse set of phenomena (from paraphrases to multi-hop reasoning) that models cannot solve.
Other work puts an adversary in the data annotation or postprocessing loop. For instance, Dua et al. (2019) and Zhang et al. (2018) filter out easy questions using a baseline qa model, and Zellers et al. (2018) use stylistic classifiers to filter language inference examples. Rather than filtering out easy questions, we use human adversaries to generate hard ones. Similar to our work, Ettinger et al. (2017) use human adversaries. We extend their setting by providing humans with model interpretations to facilitate adversarial writing. Moreover, we have a ready-made audience of question writers to generate adversarial questions.
The collaborative adversarial writing process reflects the complementary abilities of humans and computers. For instance, “centaur” chess teams of both a human and a computer are often stronger than a human or computer alone (Case, 2018). In Starcraft, humans devise high-level “macro” strategies, whereas computers are superior at executing fast and precise “micro” actions (Vinyals et al., 2017). In nlp, computers aid simultaneous human interpreters (He et al., 2016) at remembering forgotten information or translating unfamiliar words.
Finally, recent approaches to adversarial evaluation of nlp models (Section 2) typically target one phenomenon (e.g., syntactic modifications) and complement our human-in-the-loop approach.
9 Conclusion
One of the challenges of machine learning is knowing why systems fail. This work brings together two threads that attempt to answer this question: visualizations and adversarial examples. Visualizations underscore the capabilities of existing models, whereas adversarial examples— crafted with the ingenuity of human experts— show that these models are still far from matching human prowess.
Our experiments with both neural and ir methodologies show that qa models still struggle with synthesizing clues, handling distracting information, and adapting to unfamiliar data. Our adversarially authored data set is only the first of many iterations (Ruef et al., 2016). As models improve, future adversarially authored data sets can elucidate the limitations of next-generation qa systems.
Whereas we focus on qa, our procedure is applicable to other nlp settings where there is (1) a pool of talented authors who (2) write text with specific goals. Future research can look to craft adversarially authored data sets for other nlp tasks that meet these criteria.
A Failure of Syntactically Controlled Paraphrase Networks
We apply the Syntactically Controlled Paraphrase Network SCPN; (Iyyer et al., 2018) to Quizbowl questions. The model operates on the sentence level and cannot paraphrase paragraphs. We thus feed in each sentence independently, ignoring possible breaks in coreference. The model does not correctly paraphrase most of the complex sentences present in Quizbowl questions. The paraphrases were rife with issues: ungrammatical, repetitive, or missing information.
To simplify the setting, we focus on paraphrasing the shortest sentence from each question (often the final clue). The model still fails in this case. We analyze a random sample of 200 paraphrases: Only six maintained all of the original information.
Table A.1 shows common failure cases. One recurring issue is an inability to maintain the correct NEs after paraphrasing. In Quizbowl, maintaining entity information is vital for ensuring question validity. We were surprised by this failure because SCPN incorporates a copy mechanism.
B Studio Ousia Quizbowl Model
The Studio Ousia system works by aggregating scores from both a neural text classification model and an ir system. Additionally, it scores answers based on their match with the correct entity type (religious leader, government agency, etc.) predicted by a neural entity type classifier. The Studio Ousia system also uses data beyond Quizbowl questions and the text of Wikipedia pages, integrating entities from a knowledge graph and customized word vectors (Yamada et al., 2018).
C Failed Adversarial Attempt
Figure C1 shows a user’s failed attempt to break the neural Quizbowl model.
Acknowledgments
We thank all of the Quiz Bowl players, writers, and judges who helped make this work possible, especially Ophir Lifshitz and Daniel Jensen. We also thank the anonymous reviewers and members of the UMD “Feet Thinking” group for helpful comments. Finally, we would also like to thank Sameer Singh, Matt Gardner, Pranav Goel, Sudha Rao, Pouya Pezeshkpour, Zhengli Zhao, and Saif Mohammad for their useful feedback. This work was supported by nsf grant iis-1822494. Shi Feng is partially supported by subcontract to Raytheon bbn Technologies by darpa award HR0011-15-C-0113, and Pedro Rodriguez is partially supported by nsf grant iis-1409287 (umd). Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsors.
Notes
The authors want normal Quizbowl questions that humans can easily answer by the very end. For popular answers, (e.g., Australia or Suez Canal), writing novel final give-away clues is difficult. We thus expect models to often answer correctly by the very end of the question.
Data available at http://trickme.qanta.org.
Videos available at http://trickme.qanta.org.
The “history of shaving” is a tongue-in-cheek name for a poster displaying the hirsute leaders of Communist thought. It goes from the bearded Marx and Engels, to the mustachioed Lenin and Stalin, and finally the clean-shaven Mao.
References
Author notes
Action Editor: Marco Baroni.