Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension

Machine reading comprehension tasks require a machine reader to answer questions relevant to the given document. In this paper, we present the first free-form multiple-Choice Chinese machine reading Comprehension dataset (C3), containing 13,369 documents (dialogues or more formally written mixed-genre texts) and their associated 19,577 multiple-choice free-form questions collected from Chinese-as-a-second-language examinations. We present a comprehensive analysis of the prior knowledge (i.e., linguistic, domain-specific, and general world knowledge) needed for these real-world problems. We implement rule-based and popular neural methods and find that there is still a significant performance gap between the best performing model (68.5%) and human readers (96.0%), especiallyon problems that require prior knowledge. We further study the effects of distractor plausibility and data augmentation based on translated relevant datasets for English on model performance. We expect C3 to present great challenges to existing systems as answering 86.8% of questions requires both knowledge within and beyond the accompanying document, and we hope that C3 can serve as a platform to study how to leverage various kinds of prior knowledge to better understand a given written or orally oriented text. C3 is available at https://dataset.org/c3/.


Introduction
''Language is, at best, a means of directing others to construct similar-thoughts from their own prior knowledge.'' Adams and Bruce (1982) * Part of this work was conducted when K. S. was an intern at the Tencent AI Lab, Bellevue, WA.
Machine reading comprehension (MRC) tasks have attracted substantial attention from both academia and industry. These tasks require a machine reader to answer questions relevant to a given document provided as input (Poon et al., 2010;Richardson et al., 2013). In this paper, we focus on free-form multiple-choice MRC tasks-given a document, select the correct answer option from all options associated with a freeform question, which is not limited to a single question type such as cloze-style questions formed by removing a span or a sentence in a text (Hill et al., 2016;Bajgar et al., 2016;Mostafazadeh et al., 2016;Xie et al., 2018;Zheng et al., 2019) or close-ended questions that can be answered with a minimal answer (e.g., yes or no; Clark et al., 2019).
Researchers have developed a variety of freeform multiple-choice MRC datasets that contain a significant percentage of questions focusing on the implicitly expressed facts, events, opinions, or emotions in the given text (Richardson et al., 2013;Lai et al., 2017;Ostermann et al., 2018;Khashabi et al., 2018;Sun et al., 2019a). Generally, we require the integration of our own prior knowledge and the information presented in the given text to answer these questions, posing new challenges for MRC systems. However, until recently, progress in the development of techniques for addressing this kind of MRC task for Chinese has lagged behind their English counterparts. A primary reason is that most previous work focuses on constructing MRC datasets for Chinese in which most answers are either spans (Cui et al., 2016;Cui et al., 2018a;Shao et al., 2018) or abstractive texts (He et al., 2017) merely based on the information explicitly expressed in the provided text.
With a goal of developing similarly challenging, but free-form multiple-choice datasets, and promoting the development of MRC techniques for Chinese, we introduce the first free-form multiple-Choice Chinese machine reading Comprehension dataset (C 3 ) that not only contains multiple types of questions but also requires both the information in the given document and prior knowledge to answer questions. In particular, for assessing model generalizability across different domains, C 3 includes a dialogue-based task C 3 D in which the given document is a dialogue, and a mixed-genre task C 3 M in which the given document is a mixed-genre text that is relatively formally written. All problems are collected from real-world Chinese-as-a-secondlanguage examinations carefully designed by experts to test the reading comprehension abilities of language learners of Chinese.
We perform an in-depth analysis of what kinds of prior knowledge are needed for answering questions correctly in C 3 and two representative freeform multiple-choice MRC datasets for English (Lai et al., 2017;Sun et al., 2019a), and to what extent. We find that solving these general-domain problems requires linguistic knowledge, domainspecific knowledge, and general world knowledge, the latter of which can be further broken down into eight types such as arithmetic, connotation, cause-effect, and implication. These freeform MRC datasets exhibit similar characteristics in that (i) they contain a high percentage (e.g., 86.8% in C 3 ) of questions requiring knowledge gained from the accompanying document as well as at least one type of prior knowledge and (ii) regardless of language, dialogue-based MRC tasks tend to require more general world knowledge and less linguistic knowledge compared with tasks accompanied with relatively formally written texts. Specifically, compared with existing MRC datasets for Chinese (He et al., 2017;Cui et al. 2018b), C 3 requires more general world knowledge (57.3% of questions) to arrive at the correct answer options.
We implement rule-based and popular neural approaches to the MRC task and find that there is still a significant performance gap between the best-performing model (68.5%) and human readers (96.0%), especially on problems that require prior knowledge. We find that the existence of wrong answer options that highly superficially match the given text plays a critical role in increasing the difficulty level of questions and the demand for prior knowledge. Furthermore, additionally introducing 94k training instances based on translated free-form multiple-choice datasets for English can only lead to a 4.6% improvement in accuracy, still far from closing the gap to human performance. Our hope is that C 3 can serve as a platform for researchers interested in studying how to leverage different types of prior knowledge for in-depth text comprehension and facilitate future work on crosslingual and multilingual machine reading comprehension.

Related Work
Traditionally, MRC tasks have been designed to be text-dependent (Richardson et al., 2013;Hermann et al., 2015): They focus on evaluating comprehension of machine readers based on a given text, typically by requiring a model to answer questions relevant to the text. This is distinguished from many question answering (QA) tasks (Fader et al., 2014;Clark et al., 2016), in which no ground truth document supporting answers is provided with each question, making them relatively less suitable for isolating improvements to MRC. We will first discuss standard MRC datasets for English, followed by MRC/QA datasets for Chinese.
English. Much of the early MRC work focuses on designing questions whose answers are spans from the given documents (Hermann et al., 2015;Hill et al., 2016;Bajgar et al., 2016;Rajpurkar et al., 2016;Trischler et al., 2017;Joshi et al., 2017). As a question and its answer are usually in the same sentence, stateof-the-art methods (Devlin et al., 2019) have outperformed human performance on many such tasks. To increase task difficulty, researchers have explored a number of options including adding unanswerable (Trischler et al., 2017;Rajpurkar et al., 2018) or conversational (Choi et al., 2018;Reddy et al., 2019) questions that might require reasoning (Zhang et al., 2018a), and designing abstractive answers (Nguyen et al., 2016;Kočiskỳ et al., 2018;Dalvi et al., 2018) or (question, answer) pairs that involve cross-sentence or crossdocument content (Welbl et al., 2018;Yang et al., 2018). In general, most questions concern the facts that are explicitly expressed in the text, making these tasks possible to measure the level of fundamental reading skills of machine readers.
Another research line has studied MRC tasks, usually in a free-form multiple-choice form, containing a significant percentage of questions that focus on the understanding of the implicitly expressed facts, events, opinions, or emotions in the given text (Richardson et al., 2013;Mostafazadeh et al., 2016;Khashabi et al., 2018;Lai et al., 2017;Sun et al., 2019a). Therefore, these benchmarks may allow a relatively comprehensive evaluation of different reading skills and require a machine reader to integrate prior knowledge with information presented in a text. In particular, real-world language exams are ideal sources for constructing this kind of MRC dataset as they are designed with a similar goal of measuring different reading comprehension abilities of human language learners primarily based on a given text. Representative datasets in this category include RACE (Lai et al., 2017) and DREAM (Sun et al., 2019a), both collected from English-asa-foreign-language exams designed for Chinese learners of English. C 3 M and C 3 D can be regarded as a Chinese counterpart of RACE and DREAM, respectively, and we will discuss their similarities in detail in Section 3.3.
Chinese. Extractive MRC datasets for Chinese (Cui et al., 2016;Cui et al., 2018b;Cui et al., 2018a;Shao et al., 2018) have also been constructed-using web documents, news reports, books, and Wikipedia articles as source documents-and for which all answers are spans or sentences from the given documents. Zheng et al. (2019) propose a cloze-style multiple-choice MRC dataset by replacing idioms in a document with blank symbols, and the task is to predict the correct idiom from candidate idioms that are similar in meanings. The abstractive dataset DuReader (He et al., 2017) contains questions collected from query logs, free-form answers, and a small set of relevant texts retrieved from web pages per question. In contrast, C 3 is the first free-form multiple-choice Chinese MRC dataset that contains different types of questions and requires rich prior knowledge especially general world knowledge for a better understanding of the given text. Furthermore, 48.4% of problems require dialogue understanding, which has not been studied yet in existing Chinese MRC tasks.
Similarly, questions in many existing multiplechoice QA datasets for Chinese (Cheng et al., 2016;Guo et al., 2017a,b;Zhang and Zhao, 2018;Zhang et al., 2018b;Hao et al., 2019; are also free-form and collected from exams. However, most of the pre-existing QA tasks for Chinese are designed to test the acquisition and exploitation of domain-specific (e.g., history, medical, and geography) knowledge rather than general reading comprehension, and the performance of QA systems is partially dependent on the performance of information retrieval or the relevance of external resource (e.g., corpora or knowledge bases). We compare C 3 with relevant MRC/QA datasets for Chinese and English in Table 1.

Data
In this section, we describe the construction of C 3 (Section 3.1). We also analyze the data (Section 3.2) and the types of prior knowledge needed for the MRC tasks (Section 3.3).

Collection Methodology and Task Definitions
We collect the general-domain problems from Hanyu Shuiping Kaoshi (HSK) and Minzu Hanyu Kaoshi (MHK), which are designed for evaluating the Chinese listening and reading comprehension ability of second-language learners such as international students, overseas Chinese, and ethnic minorities. We include problems from both real and practice exams; all are freely accessible online for public usage. Each problem consists of a document and a series of questions. Each question is associated with several answer options, and EXACTLY ONE of them is correct. The goal is to select the correct option. According to the document type, we divide these problems into two subtasks: C 3 -Dialogue (C 3 D ), in which a dialogue serves as the document, and C 3 -Mixed (C 3 M ), in which the given non-dialogue document is of mixed genre, such as a story, a news report, a monologue, or an advertisement. We show a sample problem for each type in Tables 2 and 3, respectively.
We remove duplicate problems and randomly split the data (13,369 documents and 19,577 questions in total) at the problem level, with 60% training, 20% development, and 20% test.  (Guo et al., 2017a) N/A free-form multiple-choice 14.4K ARC (Clark et al., 2016) MedQA (Zhang et al., 2018b) N  Table 1: Comparison of C 3 and representative Chinese question answering and machine reading comprehension tasks. We list only one English counterpart for each Chinese dataset.

Data Statistics
We summarize the overall statistics of C 3 in Table 4. We observe notable differences exist between C 3 M and C 3 D . For example, C 3 M , in which most documents are formally written texts, has a larger vocabulary size compared to that of C 3 D with documents in spoken language. Similar observations have been made by Sun et al. (2019a) that the vocabulary size is relatively small in English dialogue-based machine reading comprehension tasks. In addition, the average document length (180.2) in C 3 M is longer than that in C 3 D (76.3). In general, C 3 may not be suitable for evaluating the comprehension ability of machine readers on lengthy texts as the average length of document C 3 is relatively short compared to that in datasets such as DuReader (He et al., 2017) (396.0) and RACE (Lai et al., 2017) (321.9).

Categories of Prior Knowledge
Previous studies on Chinese machine reading comprehension focus mainly on the linguistic knowledge required (He et al., 2017;Cui et al., 2018a). We aim instead for a more comprehensive analysis of the types of prior knowledge for answering questions. We carefully analyze a subset of questions randomly sampled from the development and test sets of C 3 and arrive at the following three kinds of prior knowledge required for answering questions. A question is labeled as matching if it exactly matches or nearly matches (without considering determiners, aspect particles, or conjunctive adverbs; Xia, 2000) a span in the given document; answering questions in this category seldom requires any prior knowledge.
LINGUISTIC: To answer a given question (e.g., Q 1-2 in Table 2 and Q3 in Table 3), we require lexical/syntactic knowledge including but not limited to: idioms, proverbs, negation, antonymy, synonymy, the possible meanings of the word, and syntactic transformations (Nassaji, 2006). DOMAIN-SPECIFIC: This kind of world knowledge consists of, but is not limited to, facts about domain-specific concepts, their definitions and properties, and relations among these concepts (Grishman et al., 1983;Hansen, 1994). GENERAL WORLD: It refers to the general knowledge about how the world works, sometimes called commonsense knowledge. We focus on the sort of world knowledge that an encyclopedia would assume readers know without being told (Lenat et al., 1985;Schubert, 2002) instead of the factual knowledge such as properties of famous entities. We further break down general world knowledge into eight subtypes, some of which (marked with †) are similar to the categories summarized by LoBue and Yates (2011) for textual entailment recognition.
• Arithmetic † : This includes numerical computation and analysis (e.g., comparison and unit conversion).
• Connotation: Answering questions requires knowledge about implicit and implied sentiment towards something or somebody, emotions, and tone (Edmonds and Hirst, 2002; In 1928, recommended by Hsu Chih-Mo, Hu Shih, who was the president of the previous National University of China, employed Shen Ts'ung-wen as a lecturer of the university in charge of teaching the optional course of modern literature. At that time, Shen already made himself conspicuous in the literary world and was a little famous in society. For this sake, even before the beginning of class, the classroom was crowded with students. Upon the arrival of class, Shen went into the classroom. Seeing a dense crowd of students sitting beneath the platform, Shen was suddenly startled and his mind went blank. He was even unable to utter the first sentence he had rehearsed repeatedly.
He stood there motionlessly, extremely embarrassed. He wrung his hands without knowing where to put them. Before class, he believed that he had a ready plan to meet the situation so he did not bring his teaching plan and textbook. For up to 10 minutes, the classroom was in perfect silence. All the students were curiously waiting for the new teacher to open his mouth. Breathing deeply, he gradually calmed down. Thereupon, the materials he had previously prepared gathered in his mind for the second time. Then he began his lecture. Nevertheless, since he was still nervous, it took him less than 15 minutes to finish the teaching contents he had planned to complete in an hour.
What should he do next? He was again caught in embarrassment. He had no choice but to pick up a piece of chalk before writing several words on the blackboard: This is the first time I have given a lecture. In the presence of a crowd of people, I feel terrified.
Immediately, a peal of friendly laughter filled the classroom. Presently, a round of encouraging applause was given to him. Hearing this episode, Hu heaped praise upon Shen, thinking that he was very successful. Because of this experience, Shen always reminded himself of not being nervous in his class for years afterwards. Gradually, he began to give his lecture at leisure in class.
A. the light in the classroom was dim.

B.
B. the number of students attending his lecture was large. ⋆ C.
C. the room was noisy. D.
D. the students were active in voicing their opinions.

Q2
Q2 Shen did not bring the textbook because he felt that A.
A. the teaching contents were not many. B.
B. his preparation was sufficient. ⋆ C.
C. his mental pressure could be reduced in this way. D.
D. the textbook was likely to restrict his ability to give a lecture.

Q3
Q3 Seeing the sentence written by Shen, the students A.

B.
B. blamed him in mind.

C.
C. were greatly encouraged. D.
D. expressed their understanding and encouraged him. ⋆ Q4 Q4 The passage above is mainly about A.
A. the development of the Chinese educational system. B. B. how to make self-adjustment if one is nervous. C.
C. the situation where Shen gave his lecture for the first time. ⋆ D.
D. how Shen turned into a teacher from a writer. • Cause-effect † : The occurrence of event A causes the occurrence of event B. We usually need this kind of knowledge to solve ''why'' questions when a causal explanation is not explicitly expressed in the given document.
• Implication: This category refers to the main points, suggestions, opinions, facts, or event predictions that are not expressed explic-itly in the text, which cannot be reached by paraphrasing sentences using linguistic knowledge. For example, Q4 in Table 2 and Q2 in Table 3 belong to this category.
• Part-whole: We require knowledge that object A is a part of object B. Relations such as member-of, stuff-of, and component-of between two objects also fall into this category (Winston et al., 1987;Miller, 1998). For example, we require implication mentioned above as well as part-whole knowledge (i.e., ''teacher'' is a kind of job) to summarize the main topic of the following  • Scenario: We require knowledge about observable behaviors or activities of humans and their corresponding temporal/locational information. We also need knowledge about personal information (e.g., profession, education level, personality, and mental or physical status) of the involved participant and relations between the involved participants, implicitly indicated by the behaviors or activities described in texts. For example, we put Q3 in Table 2 in this category as ''friendly laughter'' may express ''understanding''. Q1 in Table 3 about the relation between the two speakers also belongs to this category.
• Precondition † : If event A had not happened, event B would not have happened (Ikuta et al., 2014;O'Gorman et al., 2016). Event A is usually mentioned in either the question or the correct answer option(s). For example, ''I went to a supermarket'' is a necessary precondition for ''I was shopping at a supermarket when my friend visited me''.
• Other: Knowledge that belongs to none of the above subcategories.
Two annotators (authors of this paper) annotate the type(s) of required knowledge for each question over 600 instances. To explore the differences and similarities in the required knowledge types between C 3 and existing free-form MRC datasets, following the same annotation schema, we also annotate instances from the largest Chinese freeform abstractive MRC dataset DuReader (He et al., 2017) and free-form multiple-choice English MRC datasets RACE (Lai et al., 2017) and DREAM (Sun et al., 2019a), which can be regarded as the English counterpart of C 3 M and C 3 D , respectively. We also divide questions into one of three types-single, multiple, or independentbased on the minimum number of sentences in the document that explicitly or implicitly support the correct answer option. We regard a question as independent if it is context-independent, which usually requires prior vocabulary or domain-specific knowledge. The Cohen's kappa coefficient is 0.62. C 3 M vs. C 3 D As shown in Table 5, compared with the dialogue-based task (C 3 D ), C 3 M with nondialogue texts as documents requires more linguistic knowledge (49.0% vs. 30.7%) yet less general world knowledge (50.7% vs. 64.0%). As many as 24.3% questions in C 3 D need scenario knowledge, perhaps due to the fact that speakers in a dialogue (especially face-to-face) may not explicitly mention information that they assume others already know such as personal information, the relationship between the speakers, and temporal and location information. Interestingly, we observe a similar phenomenon when we compare the English datasets DREAM (dialogue-based) and RACE. Therefore, it is likely that dialogue-based freeform tasks can serve as ideal platforms for studying how to improve language understanding with general world knowledge regardless of language. C 3 vs. its English counterparts We are also interested in whether answering a specific type of question may require similar types of prior knowledge across languages. For example, C 3 D and its English counterpart DREAM (Sun et al., 2019a) have similar problem formats, document    (He et al., 2017), and English free-form multiple-choice datasets RACE (Lai et al., 2017) and DREAM (Sun et al., 2019a). Answering a question may require more than one type of prior knowledge.
types, and data collection methodologies (from Chinese-as-a-second-language and English-as-aforeign-language exams, respectively). We notice that the knowledge type distributions of the two datasets are indeed very similar. Therefore, C 3 may facilitate future cross-lingual MRC studies. C 3 vs. DuReader The 150 annotated instances of DuReader also exhibit properties similar to those identified in studies of abstractive MRC for English (Nguyen et al., 2016;Kočiskỳ et al., 2018;Reddy et al., 2019). Namely, turkers asked to write answers in their own words tend instead to write an extractive summary by copying short textual snippets or whole sentences in the given documents; this may explain why models designed for extractive MRC tasks achieve reasonable performance on abstractive tasks. We notice that questions in DuReader seldom require general world knowledge, which is possibly because users seldom ask questions about facts obvious to most people. On the other hand, as many as 16.7% of (question, answer) pairs in DuReader cannot be supported by the given text (vs. 1.3% in C 3 ); in most cases, they require prior knowledge about a particular domain (e.g., ''On which website can I watch The Glory of Tang Dynasty?'' and ''How to start a clothing store?''). In comparison, a larger fraction of C 3 requires linguistic knowledge or general world knowledge.

147
We implement a classical rule-based method and recent state-of-the-art neural models.

Distance-Based Sliding Window
We implement Distance-based Sliding Window (Richardson et al., 2013), a rule-based method that chooses the answer option by taking into account (1) lexical similarity between a statement (i.e., a question and an answer option) and the given document with a fixed window size and (2) the minimum number of tokens between occurrences of the question and occurrences of an answer option in the document. This method assumes that a statement is more likely to be correct if there is a shorter distance between tokens within a statement, and more informative tokens in the statement appear in the document.

Co-Matching
We utilize Co-Matching (Wang et al., 2018), a Bi-LSTM-based model for multiple-choice MRC tasks for English. It explicitly treats a question and one of its associated answer options as two sequences and jointly models whether or not the given document matches them. We modify the pre-processing step and adapt this model to MRC tasks for Chinese (Section 5.1).

Fine-Tuning Pre-Trained Language Models
We also apply the framework of fine-tuning a pre-trained language model on machine reading comprehension tasks (Radford et al., 2018). We consider the following four pre-trained language models for Chinese: Chinese BERT-Base (denoted as BERT) (Devlin et al., 2019), Chinese ERNIE-Base (denoted as ERNIE) (Sun et al., 2019b), and Chinese BERT-Base with whole word masking during pre-training (denoted as BERT-wwm) (Cui et al., 2019) and its enhanced version pre-trained over larger corpora (denoted as BERT-wwm-ext). These models have the same number of layers, hidden units, and attention heads. Given document d, question q, and answer option o i , we construct the input sequence by concatenating [CLS], tokens in d, [SEP], tokens in q, [SEP], tokens in o i , and [SEP], where [CLS] and [SEP] are the classifier token and sentence separator in a pre-trained language model, respectively. We add an embedding vector t 1 to each token before the first [SEP] (inclusive) and an embedding vector t 2 to every other token, where t 1 and t 2 are learned during language model pre-training for discriminating sequences. We denote the final hidden state for the first token in the input sequence as S i ∈ R 1×H , where H is the hidden size. We introduce a classification layer W ∈ R 1×H and obtain the unnormalized log probability P i ∈ R of o i being correct by P i = S i W T . We obtain the final prediction for q by applying a softmax layer over the unnormalized log probabilities of all options associated with q.

Experimental Settings
We use C 3 M and C 3 D together to train a neural model and perform testing on them separately, following the default setting on RACE that also contains two subsets (Lai et al., 2017). We run every experiment five times with different random seeds and report the best development set performance and its corresponding test set performance.
Distance-Based Sliding Window. We simply treat each character as a token. We do not use Chinese word segmentation as it results in drops in performance based on our experiment.
Co-Matching. We replace the English tokenizer with a Chinese word segmenter in HanLP. 1 We use the 300-dimensional Chinese word embeddings released by Li et al. (2018).

Fine-Tuning Pre-Trained Language Models.
We set the learning rate, batch size, and maximal sequence length to 2 × 10 −5 , 24, and 512, respectively. We truncate the longest sequence among d, q, and o i (Section 4.3) when an input sequence exceeds the length limit 512. For all experiments, we fine-tune a model on C 3 for eight epochs. We keep the default values for the other hyperparameters (Devlin et al., 2019).
with other three pre-trained language models, though there still exists a large gap (27.5%) between this method and human performance (96.0%). We also report the performance of Co-Matching, BERT, BERT-wwm-ext, and human on different question categories based on the annotated development sets (Table 7), which consist of 150 questions in C 3 M and 150 questions in C 3 D . These models generally perform worse on questions that require prior knowledge or reasoning over multiple sentences than questions that can be answered by surface matching or only need the information from a single sentence (Section 3.3).

Discussions on Distractor Plausibility
We look into incorrect predictions of Co-Matching, BERT, and BERT-wwm-ext on the development set. We observe that the existence of plausible distractors may play a critical role in raising the difficulty level of questions for models. We regard a distractor (i.e., wrong answer option) as plausible if it, compared with the correct answer option, is more superficially similar to the given document. Two typical cases include (1) the information in the distractor is accurate based on the document but does not (fully) answer the question, and (2) the distractor distorts, oversimplifies, exaggerates, or misinterprets the information in the document.
Given document d, the correct answer option c, and wrong answer options {w 1 , w 2 , . . . , w i , . . . , w n } associated with a certain question, we measure the distractor plausibility of distractor w i by:  where S(x, y) is a normalized similarity score between 0 and 1 that measures the edit distance to change x into a substring of y using single-character edits (insertions, deletions or substitutions). Particularly, if x is a substring of y, S(x, y) = 1; if x shares no character with y, S(x, y) = 0. By definition, S(w i , d) in Equation (1) measures the lexical similarity between distractor w i and d; S(c, d) measures the similarity between the correct answer option c and d.
To quantitatively investigate the impact of the existence of plausible distractors on model performance, we group questions from the development set of C 3 by the largest distractor plausibility (i.e., max i γ i ), in the range of [−1, 1], for each question and compare the performance of Co-Matching, BERT, and BERT-wwm-ext in different groups.
As shown in Figure 1(a), the largest distractor plausibility may serve as an indicator of the difficulty level of questions presented to the investigated models. When the largest distractor plausibility is smaller than −0.8, all three models exhibit strong performance (≥ 90%). As the largest distractor plausibility increases, the performance of all models consistently drops. All models perform worse than average on questions having at least one high-plausible distractor (e.g., distractor plausibility > 0). Compared with BERT, the gain of the best-performing model (i.e., BERT-wwmext) mainly comes from its superior performance on these ''difficult'' questions.
Further, we find that distractor plausibility is strongly correlated with the need for prior knowledge when answering questions in C 3 based on the annotated instances, as shown in Figure 1(b). For further analysis, we group annotated instances by different max i S(w i , d) and S(c, d) (in Equation (1)) and separately compare their need for linguistic knowledge and general world knowledge. As shown in Figure 2, general world knowledge is crucial for question answering when the correct answer option is not mentioned explicitly in the document (i.e., S(c, d) is relatively small). In contrast, we tend to require linguistic knowledge when both the correct answer option and the most confusing distractor (i.e., the one with the largest distractor plausibility) are very similar to the given document.

Discussions on Data Augmentation
To extrapolate to what extent we can improve the performance of current models with more training data, we plot the development set performance of BERT-wwm-ext trained on different portions of the training data of C 3 . As shown in Figure 3, the accuracy grows roughly linearly with the logarithm of the size of training data, and we observe a substantial gap between human performance and the expected BERT-wwm-ext performance, even assuming that 10 5 training instances are available, leaving much room for improvement.
Furthermore, as the knowledge type distributions of C 3 and its English counterparts RACE and DREAM are highly similar (Section 3.3), we translate RACE and DREAM from English to Chinese with Google Translate and plot the performance of BERT-wwm-ext trained on C 3 plus different numbers of translated instances. The learning curve is also roughly linear with the logarithm of the number of training instances from translated RACE and DREAM, but with a lower growth rate. Even augmenting the training data with all 94k translated instances only leads to a 4.6% improvement (from 67.8% to 72.4%) in accuracy on the development set of C 3 . From another perspective, BERT-wwm-ext trained on all translated instances without using any data in C 3 only achieves an accuracy of 67.1% on the development set of C 3 , slightly worse than 67.8% achieved when only the training data in C 3 is used, whose size is roughly 1/8 of that of the translated instances. These observations suggest a need to better leverage large-scale English resources from similar MRC tasks.
Besides augmenting the training data with translated instances, we also attempt to fine-tune a pre-trained multilingual BERT-Base released by Devlin et al. (2019) on the training data of C 3 and all original training instances in English from RACE and DREAM. However, the accuracy on the development set of C 3 is 63.4%, which is even lower than the performance (65.7% in Table 6) of fine-tuning Chinese BERT-Base only on C 3 .

Conclusion
We present the first free-form multiple-choice Chinese machine reading comprehension dataset (C 3 ), collected from real-world language exams, requiring linguistic, domain-specific, or general world knowledge to answer questions based on the given written or orally oriented texts. We study the prior knowledge needed in this challenging machine reading comprehension dataset and carefully investigate the impacts of distractor plausibility and data augmentation (based on similar resources for English) on the performance of state-of-the-art neural models. Experimental results demonstrate the there is still a significant performance gap between the best-performing model (68.5%) and human readers (96.0%) and a need for better ways for exploiting rich resources in other languages.