Chinese Idiom Paraphrasing

Idioms, are a kind of idiomatic expression in Chinese, most of which consist of four Chinese characters. Due to the properties of non-compositionality and metaphorical meaning, Chinese Idioms are hard to be understood by children and non-native speakers. This study proposes a novel task, denoted as Chinese Idiom Paraphrasing (CIP). CIP aims to rephrase idioms-included sentences to non-idiomatic ones under the premise of preserving the original sentence's meaning. Since the sentences without idioms are easier handled by Chinese NLP systems, CIP can be used to pre-process Chinese datasets, thereby facilitating and improving the performance of Chinese NLP tasks, e.g., machine translation system, Chinese idiom cloze, and Chinese idiom embeddings. In this study, CIP task is treated as a special paraphrase generation task. To circumvent difficulties in acquiring annotations, we first establish a large-scale CIP dataset based on human and machine collaboration, which consists of 115,530 sentence pairs. We further deploy three baselines and two novel CIP approaches to deal with CIP problems. The results show that the proposed methods have better performances than the baselines based on the established CIP dataset.


INTRODUCTION
Idiom, called "成语" (ChengYu) in Chinese, is widely used in daily communications and various literary genres.Idioms are a kind of compact Chinese expressions that consist of few words but imply relatively complex social nuances.Moreover, the Chinese idioms are often used to describe similar phenomena, events, etc, which means the idioms can not be interpreted with their literal meanings in some cases.Thus, it has always been a challenge for non-native speakers, even native speakers to recognize Chinese idioms [1].For instance, the idiom "亡羊补牢"(MangYangBuLao) shown in Table 1, represents "Never be late to try", instead of its literal meaning -"To mend the fence after sheep are lost".
In real life, if some people do not understand the meaning of idioms, we have to explain them by converting them into a set of word segments that reflect more intuitive and understandable paraphrasing.In this study, we try to manipulate computational approaches to automatically rephrase idiom-included sentences into simpler sentences (a.k.a., non-idiom-included sentences) for preserving context-based paraphrasing, and then benefit both Chinese-based natural language processing and societal applications.
Since idioms are a kind of obstacles for many NLP tasks, CIP can be used as a pre-processing phase that facilitates and improves the performance of machine translation system [2], [3], Chinese idiom cloze [1], [4], and Chinese idiom embeddings [5].Furthermore, CIP-based applications can help specific crowds, such as children, non-native speakers, the people with cognitive disabilities, to improve their abilities of reading comprehension.
We propose a new task in this study, denoted as Chinese Idiom Paraphrasing (CIP), which aims to rephrase the idiomincluded sentences into fluent, intuitive, and meaning-preserving Input 虽然你已经犯下了错误,但是亡羊补牢也为时不晚. (Although you have made a mistake, it's not too late to mend it.)Output 虽然你已经犯下了错误,但是现在改正也为时不晚.

TABLE 1
Given a Chinese idiom-included sentence, we aim to output a fluent, intuitive, and meaning-preserving non-idiom-included sentence.In the example, the idiom is marked in red.
non-idiom-included sentences.We can treat CIP task as a special paraphrase generation task.The general paraphrase generation task aims to rephrase a given sentence to another one that possesses identical semantics but various lexicons or syntax [6], [7].Similarly, CIP emphasizes rephrasing the idioms of input sentences to word segments that reflect more intuitive and understandable paraphrasing.In recent decades, many researchers devoted to paraphrase generation [8], [9] are struggled due to the lack of reliable supervision dataset [10].Inspired by the challenge, we establish a large-scale training dataset in this work for CIP task.
Contributions.This study produces two main contributions toward the development of CIP systems.
First, a large-scale benchmark is established for CIP task.The benchmark is comprised of 115,530 sentence pairs, which of 8,421 idioms.A recurrent challenge in crowdsourcing NLPoriented datasets at scale-level is that human writers frequently utilize repetitive patterns to fabricate examples, leading to a lack of linguistic diversity [11].A new large-scale CIP dataset is created in this study by taking advantage of the collaboration between humans and machines.
In detail, we initially divide a large-scale Chinese-English machine translation corpus into two parts (idioms-included subcorpus, and non-idioms-included sub-corpus) by judging if a Chinese sentence contains idioms.Next, we train an English-to-Chinese machine translation (MT) system using the non-idiomsincluded sub-corpus.Because the training corpus for MT system does not include any idiom, MT system will not translate input English sentences to idiom-included Chinese sentences.Then, the MT system is deployed to translate English sentences of idiomsincluded sub-corpus to non-idioms-included sentences.A large-arXiv:2204.07555v2 [cs.CL] 20 Apr 2022 scale pseudo-parallel CIP dataset can be constructed by pairing the idioms-included sentences of idioms-included sub-corpus and the translated non-idioms-included sentences.Finally, we employ native speakers to validate the generated sentences and modify defective sentences if necessary.
Second, we deploy five baselines to rephrase the input idiomincluded sentence.Since the constructed dataset is used as the training dataset, we treat CIP task as a paraphrase generation task.(i,ii,iii) We adopt three different sequence-to-sequence (Seq2Seq) methods as baselines: LSTM-based approach, Transformer-based approach, and mT5-based approach, where mT5 is a massively multilingual pre-trained text-to-text Transformer [12].(iv) People have always used dictionaries to deal with CIP problems in the past, but the dictionaries cannot be directly incorporated into Seq2Seq methods.We propose a new method that allows networks to "attach" interpretations to idioms, which means the networks can learn the correlation capabilities between interpretations and idioms.(v) mT5 is pre-trained aiming at span masked language modeling, where consecutive spans of input tokens are replaced with a mask token and the trained model can reconstruct the masked-out tokens.The CIP problem can be dealt with by the span masked language modeling when the idioms of the sentences are masked.Specifically, a CIP sentence pair can be processed to produce a (corrupted) input sentence by replacing both the idioms of the source sentence and a corresponding target extracted from the simplified sentence.The mT5-based CIP method is fine-tuned to reconstruct the corresponding target.Experimental results show that the baselines evaluated on the constructed CIP dataset can output high-quality paraphrasing of sentences that are grammatically correct and semantically appropriate.
The constructed dataset and employed baselines that are used to accelerate this research are source-opened in Github 1 .

RELATED WORKS
Paraphrase Generation.Paraphrase generation aims to extract paraphrases of given sentences.The extracted paraphrases can preserve original meanings of the sentence, but are assembled with different words or syntactic structures [8], [9], [13].
Most recent neural paraphrase generation methods primarily take advantage of the sequence-to-sequence framework that can achieve inspiring performance improvements compared with traditional approaches [13].A long-standing issue embraced in paraphrase generation studies is the lack of reliable supervised datasets.The issue can be avoided by constructing manually annotated paired-paraphrase datasets [6] or designing unsupervised paraphrase generation methods [10].Differ from existing paraphrase generation research, we take our attention to Chinese idiom paraphrasing that rephrases idiom-included sentences to non-idiom-included ones.
Chinese Idiom Understanding.Idiom is an interesting linguistic phenomenon in the Chinese language.Compared with other types of words, most idioms are unique in perspectives of non-compositionality and metaphorical meaning.Idiom understanding plays an important role in the research area of Chinese language understanding.Many types of research related to Chinese idiom understanding have been pushed forward that can benefit a variety of related down-streaming tasks.For example, Shao et al. [3] focused on evaluating the quality of idiom translation 1 https://www.github.com/jpqiang/Chinese-Idiom-Paraphrasing of machine translation systems.Zheng et al. [1] provided a benchmark to assess the abilities of multiple models on Chinese idiom-based cloze tests, and evaluated how well the models can comprehend Chinese idiom-included texts.Liu et al. [14] studied how to improve essay writing skills by recommending Chinese idioms.Tan et al. [5] investigated the tasks on learning and quality evaluation of Chinese idiom embeddings.In this paper, we study a novel CIP task that is different from the above tasks.Since the proposed CIP method can rephrase idiom-included sentences to non-idiom-included ones, it is expected that CIP can benefit the tasks related to idiom representation and idiom translation.
Other related tasks.Pershina et al. [15] studied a new task of English idiom paraphrases aiming to determine if two idioms have alike or similar meanings.They collected idioms' definitions in dictionary and utilized word embedding modelings to represent idioms to calculate the similarity of two idioms.Qiang et al. [16] proposed a Chinese lexical simplification method, which focuses on replacing complex words in given sentences with simpler and meaning-equivalent alternatives.It is noteworthy that the substitutes in Chinese lexical simplification are all made up of a single word, but an idiom typically cannot be substituted by a single word to express original concepts or ideas.

HUMAN AND MACHINE COLLABORATIVE DATASET CONSTRUCTION
This section describes the process of constructing a large-scale parallel dataset for Chinese idiom paraphrasing (CIP).A qualified CIP dataset needs to meet the following two requirements: (1) The two sentences in a sentence pair have to convey the same meaning; (2) A sentence pair has to contain an idiom-included sentence and an idiom-included one.We outline a three-stage pipeline for dataset construction, which takes advantage of both the generative strength of machine translation (MT) methods and the evaluative strength of human annotators.Human annotators are generally reliable in correcting examples, but it is challenging while crafting diverse and creative examples at scale.Therefore, we deploy a machine translator to automatically create an initial CIP dataset, and then inquire annotators to proofread each generated sample.

Pipeline
Figure 1 exhibits the details of the pipeline.Our pipeline starts with an existing English-Chinese machine translation dataset denoted as D. Firstly, we refer a collect Chinese idiom list I to split the MT dataset D into two parts: non-idiom-included subdataset D1 and idiom-included sub-dataset D2 (Stage 1).All the data items in both D1 and D2 are in forms of sentence pairs.Then, we train a neural machine translation system M using D1, which can translate English sentences to non-idiom-included Chinese sentences.Next, we input English sentences in D2 to M to output non-idiom-included Chinese sentences.Afterwards, the Chinese sentences in D2 and the generated sentences are mated as pairs to construct a large-scale initial parallel CIP dataset (Stage 2).Finally, the constructed dataset is reviewed and revised by annotators for quality assurances (Stage 3).
Stage 1: Corpus Segmentation.The English-Chinese MT dataset D we applied in the research are grabbed from WMT18 [17], which contains 24,752,392 sentence pairs.We collect a Chinese idiom list I that embraces 31,114 idioms.Since the list enables determining whether the Chinese sentence in a pair contains idioms, D can be split as D1 and D2.The sub-dataset D1 is used to train a special MT system M that can translate English sentences to non-idiom-included Chinese sentences.In our experiments, only 0.2% of the translated Chinese sentences contain idioms (see Table 6).After cleansing redundant Chinese sentences, the number of sentence pairs in D2 is 105,530.
Stage 2: Pseudo-CIP Dataset.Giving a sentence pair (c i , e i ) in D2, we input the English sentence e i into MT system M, and output a Chinese translation t i .We pair Chinese sentence c i and Chinese translation t i as a pseudo-CIP sentence pair.Thus, a CIP dataset can be built up by pairing original Chinese sentences and corresponding translated English-to-Chinese ones in D2.The pseudo-CIP dataset D2 can meet the two requirements of CIP dataset construction.On one hand, the pseudo-CIP data is from the MT dataset, which can guarantee that the paired sentences deliver the same meanings.On another hand, all original sentences include one or more idioms, and all the translated sentences do not contain idioms.
Stage 3: Human Review.As the final stage of the pipeline, we recruit five human annotators to review each sentence pair (c i , t i ) in the pseudo-CIP dataset D2 .These annotators are all undergraduates, and Chinese natives.Given (c i , t i ), annotators are asked to revise and improve the quality of t i .t i is required to be non-idiom-included and fully meaning-preserved.

Corpus Statistics
The dataset D2 is treated as in-domain data, which contains 105,530 examples including 8,243 different idioms.D2 is partitioned into three parts: a training set Train, a development set Dev, and a test set Test.We sort the examples based on the frequencies of idioms in the corpus from high to low.We choose two samples from each of the first 5,000 idioms, and put the samples into Dev and Test, respectively.
We observe that both the Train and Test datasets come from a same distribution.However, when models are deployed in realworld applications, the inference might be performed on the data from different distributions, i.e. out-of-domain [18].Therefore, we additionally collected 9,970 sentences with idioms from modern vernacular classics, including prose and fiction, as out-of-domain data, to assess the generalization ability of CIP methods.Unlike MT corpus, these sentences have no English sentences as their references, we manually modify them to non-idiom-included sentences with the helps of Chinese native speakers.The statistical details of CIP dataset are shown in Table 2.
There are three significant differences between in-domain and out-of-domain data.First, the average length of sentences in indomain data is around 35 words, while is about 45 words for out-of-domain data.Second, the average number of idioms in indomain data is 1.07, which is lower than that of out-of-domain data (i.e., 1.17).Third, the sentence pairs in out-of-domain data need fewer modifications than that in in-domain data.In this case, a lack of linguistic diversity might be taken place due to human annotators often relying on repetitive patterns to generate sentences.
We present some examples of the idiom "深居简出" (reclusive) in the CIP dataset, shown in Table 3.The idiom "深居简出" can be rephrased with different descriptions, which displays the diversity of linguistics.

METHODS
In this section, we introduce three baselines and two proposed CIP methods to perform CIP task.The performances of the five methods are compared with each other in the aspect of idiomsincluded sentence rephrasing.

Problem Formulation
The CIP task can be defined as follows.Given a source sentence c = {c 1 , ..., c j , ..., c J } with one or more idioms, intending to produce a target sentence t = {t 1 , ..., t i , ...t I }.More specifically, t is expected to be non-idiom-included and meaning-preserved, where c j or t i refers to a Chinese character.In this study, we suppose to design a supervised method to approach this monolingual machine translation task.We adopt a Sequence-to-Sequence (Seq2Seq) framework that directly predicts the probability of the character-sequential translation from source

In-domain
Out-of-domain  sentences to target ones [19], where the probability is calculated using the following equation 1: where t <i = t 1 , ..., t i−1 .

Seq2Seq method
In this research, we adopt three Seq2Seq methods to handle CIP task that are LSTM-based, Transformer-based, and mT5based models, respectively.
(1) LSTM-based Seq2Seq methods are a basic Seq2Seq method, which uses a LSTM (Long Short-Term Memory [20]) to convert a sentence to a dense, fixed-length vector representation.In contrast to RNN, LSTM is helpful to deal with long sequences, but it fails to maintain the global information of the sequences.
(2) Transformer-based Seq2Seq methods [21] is a state-ofthe-art Seq2Seq method that has been widely adopted to process various NLP tasks, such as machine translation, abstractive summarization, etc. Transformer applies a self-attention mechanism that directly models the relationships among all words of an input sequence regardless of words' positions.Unlike LSTM, Transformer handles the entire input sequence for once, rather than iterating words one by one.
(3) mT5 is a Seq2Seq method that uses the framework of Transformer.Currently, most downstream NLP tasks build their models by fine-tuning pre-trained language models [22].mT5 is a massively multilingual pre-trained language model that is implemented in a form of unified "text-to-text" to process different downstream NLP problems.In this study, we fine-tune the mT5based approach to handle CIP task.

Knowledge-based CIP method
We present an extension of Seq2Seq modeling that enables replacing idioms with their representations at the same locations in source sentences, thereby the network can better leverage idioms.The proposed method refers to a knowledge-based Seq2Seq method.Figure 2 illustrates the process of encoding a sample sentence.The model firstly extracts a dictionary of idiom interpretation from the established CIP training dataset, and then concatenates original sentences and all idioms' interpretations.Finally, the concatenations are utilized for Seq2Seq modeling.
We traverse the edit sequences Diff using a rule <'-','+'>, where the rule means '+' follows by '-' in Diff.We get the following two matching sequence pairs <'约 翰','汤 姆'> and <'以牙还牙','也踢了他一脚'>.A sequence pair will be ignored by the model if any idiom is included.In this example, we obtain the sequence pair <'以牙还牙','也踢了他一脚'>, where "以 牙还牙" and "也踢了他一脚" represent an idiom and corresponding interpretation.Finally, we sort different interpretations of each idiom based on their frequencies, and collect the top three interpretations into the dictionary.
(2) Given a training example {c, t}, if {c contains multiple idioms, all interpretations of the idioms are collected from the dictionary.In case of multiple interpretations of an idiom, we first concatenate the interpretations with spaces.It is noteworthy that if a sentence has multiple idioms, we need to concatenate the interpretations of different idioms with symbol "<SEP>".Finally, the original sentence and the collected interpretations are concatenated and processed for Seq2Seq modeling.
We prompt the Seq2Seq modeling to learn how to make use of the concatenated information.It's noteworthy that the knowledgebased Seq2Seq method can be incorporated into all three baselines.In our experiments, we adopt mT5-based Seq2Seq modeling as the base method.Given a sentence pair {c, t}, we completely generate the whole target sentence t from the original sentence c using the above four CIP methods.CIP merely requires to rephrase the idioms of the sentence, which means we only expect to generate contextbased interpretations of idioms, rather than the whole sentence.Therefore, we propose a novel CIP method, denoted as infill-based CIP method.

Infill-based CIP method
As shown in Figure 3, the proposed method takes advantage of mT5 which is a pre-trained span masked language modeling (MLM) to build a Seq2Seq model.In contrast to MLM in BERT [23], span MLM reconstructs consecutive spans of input tokens and mask them with a mask token <X>.With the help of mT5, the proposed method enables reconstructing the idiom I's interpretation of sentence c via replacing the idiom with the token <X>.
To embed mT5-based Seq2Seq modeling into CIP, we additionally make two modifications.(1) If the reconstructed c is directly fed to the span MLM, the information of idiom I is likely to be ignored.We concatenate the original c and the reconstructed c as the input sequence.(2) When a sentence has multiple idioms, we recursively simplify each idiom to a corresponding representation.

EXPERIMENTS
In this section, we introduce the process of conducting experiments that are used to handle CIP problem.We firstly illuminate the system configurations that include implementation details, the metrics used to evaluate performances, and baseline methods.
Next, the performances of different CIP methods are assessed in Section 5.2.Then, we acquire human evaluations to survey the capabilities of deployed CIP methods on CIP processing.In addition, we calculate the proportions of idiom-included sentences that are rephrased to non-idiom-included ones.Finally, a case study is demonstrated in Section 5.5 that compares the different outputs of five methods on representing difficult Chinese idioms.
We implement LSTM and Transformer methods using fairseq [24].mT5, Knowledge, and Infill methods are mT5-based, and are fulfilled using Huggingface transformers [25].Furthermore, the sentence tokenization is accomplished using Jieba Chinese word segmenter and BPE tokenization.The size of the vocabulary is set to 32K.The LSTM-based Seq2Seq method adopts Adam optimizer configured with β = (0.9, 0.98), 3e −4 learning rate, and 0.2 dropout rate.The Transformer-based Seq2Seq method maintains the hyperparameters of the base Transformer [21] (base), which contains a six-layered encoder and a six-layered decoder.The three parameters (β of Adam optimizer, learning rate, and dropout rate) in the Transformer-based method are equivalent to these in the LSTM-based method.It's noteworthy that the learning rate is gradually increased to 3e −4 by 4k steps and correspondingly decays according to the inverse square root schedule.Knowledge and Infill methods are mT5-based Seq2Seq models.For mT5, Knowledge, and Infill, we fine-tune mT5 (base) that consists of 580M parameters3 .We train the three methods with the Adam optimizer (Kingma and Ba, 2015) and an initial learning rate of 3e −4 up to 20 epochs using early stopping on development data.The training will be stopped when the accuracy on the development set did not improve within 5 epochs.We used a beam search with 5 beams for inference.
Metrics.As we mentioned above, the CIP task can be treated as a sentence paraphrasing task.Therefore, We apply four metrics to evaluate sentence paraphrasing task namely, BLEU, BERTScore, ROUGE-1 and ROUGE-2.BLEU is a widely used machine translation metric, which measures opposed references to evaluate lexical overlaps with human intervention [26].BERTScore is chosen as another metric due to its high correlation with human judgments [27].Compared to BLEU, BERTScore is measured using tokenwise cosine similarity between representations produced by BERT.We measure semantic overlaps between generated sentences and reference ones using ROUGE scores [28].ROUGE is often used to evaluate text summarization.The two metrics ROUGE-1 and ROUGE-2 refer to the overlaps of unigram and bigram between the system and reference summaries, respectively.
Baselines.We additionally provide two unsupervised methods that can facilitate solving the CIP problem, namely, Re-translation and BERT-CLS.
(1) Re-translation: is implemented by utilizing the backtranslation techniques of machine translation methods.We first translate an idiom-included sentence to an English sentence using an efficient Chinese-English translation system, and then translate the generated English sentence using our trained English-Chinese translation system (introduced by Section 3.1) to generate a non-idiom-included Chinese sentence.The Chinese-English translation system can be easily accessed through the link 4 .The trained English-Chinese translation system is a transformer-based Seq2Seq method.
(2) BERT-CLS: is an existing BERT-based Chinese lexical simplification method [16].In this task, an idiom is treated as a complex word that will be replaced with a simpler word.

Performance of CIP Method
Table 4 summarizes the evaluation results on our established CIP dataset.The proposed five supervised CIP methods are significantly better than two unsupervised methods (Re-translation and BERT) in perspectives of the four metrics.The results reveal that the dataset is a high-quality corpus, which can help to benefit CIP task.
Table 4 shows that the performance of LSTM-based baseline is worsen than the other four baselines on in-domain and out-ofdomain test datasets.In general, the three mT5-based CIP methods  5 The Results of human evaluation."Simp."denotes "simplicity", "Avg" denotes "average".
(mT5, Infill, and Knowledge) outperform the other two methods (LSTM and Transformer), which suggests that CIP methods finetuned on mT5 can significantly improve the CIP performance.
It is observed from Table 4 that knowledge-based baseline can yield the best results on in-domain test set among three mT5based baselines, because the incorporated knowledge is extracted from the training dataset that allows neural networks to learn how to make use of the attached knowledge.Infill baseline outperforms mT5 and Knowledge on the out-of-domain test set, despite using less input information than its counterpart Knowledge-based mT5 method.It verifies that Infill is quite effective.

Human Evaluation
For further evaluating the CIP methods, we adopt human evaluation to analyze the deployed CIP methods.We manually select 32 infrequent and easily misused idioms.We choose 30 sentences from in-domain test set and 32 sentences from out-of-domain test set.We inquire four native speakers to rate each generated sentence using three features: simplicity, meaning, and fluency.The five-point Likert scale is adopted to rate these features, and the average scores of the features are calculated correspondingly.
(1) Simplicity is responsible for evaluating whether re-paraphrased idioms of generated sentences are easily understandable, which means idioms in original sentences should be rewritten with simpler and more common words.( 2 sentences preserve the meaning of original sentences.(3) Fluency is used to judge if a generated sentence is fluent, and does not contain grammatical errors.
The results of human evaluation are shown in Table 5.We calculate the scores of annotated sentences t, denoted as Reference.
We first analyze the results of these CIP methods on in-domain test set.The infill-based mT5 method outperforms other methods, which means Infill is an effective method on CIP task.The human evaluation performance of Knowledge-based CIP method is worse than that of LSTM, Transformer, and mT5.Since the knowledgebased baseline has the best results among all the baselines while adopting automatic metrics(see Table 4), it is inconsistent with the results yielded using human evaluation.We also notice that the three idioms are wrongly rephrased due to the "attached" information.If the "attached" information is fitful to this example, it may bring noise into the network.
Then, we analyze the results on an out-of-domain test set.The scores of the four CIP methods (Transformer, Infill, and Knowledge) are slightly larger than the scores of Reference.Although it is abnormal, the sentences generated by the CIP methods are certainly high-quality.The out-of-domain test set is individually collected by native Chinese crowd-workers without human and machine collaboration.The crowd-workers often take limited writing strategies to speed up the establishment of a dataset, which is harmful to the diversity of the dataset [11], [29].The quality of out-of-domain test set can be further improved.

Proportion of Idiom Paraphrasing
CIP aims to rephrase an input idiom-included sentence to a meaning-preserved and non-idiom-included sentence.In this subsection, we count the number of idiom-included sentences that are rephrased to non-idiom-included sentences.The results are shown in Table 7.The result shows that Re-translation achieves the best results, which can rephrase almost all idioms to non-  7 The proportion between the times of idiom paraphrasing and the number of all idioms.
idiom-included representations.That means, our idea on CIP dataset construction using machine learning method is feasible.Theoretically, if the trained English-Chinese machine translation method (stage 2 in the pipeline) can output high-quality results, we do not need to ask annotators to optionally revise Chinese translations.We observe that the proportions of these CIP methods (LSTM, Transformer, mT5, Knowledge and Infill) are nearly 90%, which means they have great potential on dealing with idiom paraphrasing.Moreover, a tiny part of idioms cannot be rephrased, because some idioms are simple, thereby are retained in the training set.

Case Study
We first choose five complicated Chinese idioms that are "缘木 求鱼"(climb trees to catch fish),"以儆效尤"(warn others against following a bad example),"名落孙山"(fail in official examinations),"讳莫如深"(carefully conceal mentioning), and "概莫能 外"(admit of no exception whatsoever).Then, we select a sample from our CIP test set for each idiom, and analyze the output sentences generated by the five CIP baselines.
Table 6 shows the sentences rephrased using the five CIP methods.In general, these methods trained on the established CIP dataset are capable to generate meaning-preserved and non-idiomincluded sentences.In these examples, all the generated sentences are simpler than the original sentences, even if these idioms are challenging to native speakers.

CONCLUSION
In this paper, we propose a novel Chinese idiom paraphrasing (CIP) task, which aims to rephrase idiom-included sentences to non-idiom-included ones.The CIP task can be treated as a special paraphrase generation task and is feasible to be dealt with by adopting sequence-to-sequence(Seq2Seq) modeling.We construct a large-scale training dataset for CIP by taking the collaborations between humans and machines.Specifically, we first design a framework to construct a pseudo-CIP dataset and then ask workers to revise and evaluate the dataset.In this study, We deploy three Seq2Seq methods and propose two novel CIP methods (Infill and Knowledge) for the CIP task.Experimental results reveal that all the five methods trained on our dataset can yield good results.In our subsequent research, the proposed two CIP methods will be reserved as strong baselines, and the established dataset will also be used to accelerate the study on this topic.In the future, we will keep exploring the abilities of our CIP methods in solving the problems of Chinese idiom understanding and Chinese idiom representation.

Fig. 1 .
Fig.1.A pipelined illustration of creating a CIP dataset based on a Chinese-English machine translation (MT) corpus.(a) The corpus is split into an idiom-included sub-corpus and a non-idiom-included sub-corpus based on a Chinese idiom list; (b) We train a MT system using the non-idiom-included sub-corpus, and create a pseudo-CIP Dataset by pairing the original Chinese idiom-included sentences and the translated non-idiom-included sentences using the trained MT system; (c) We ask human annotators to revise the translated Chinese sentence of pairs to strengthen the quality of the created CIP dataset.

Fig. 2 .
Fig. 2. The knowledge-based Seq2Seq method incorporates idiom interpretations into Seq2Seq modeling.The words marked in red are idioms, and the words labeled with blue are the corresponding interpretations."</s>" and "<SEP>" are special symbols.

Fig. 3 .
Fig.3.An example of Infill-based CIP method.The input sequence fed into mT5-based Seq2Seq modeling comprised of an original sentence and a target sentence, in which the idiom of the original sentence is replaced by one special symbol "<X>".The interpretation of the idiom is treated as the reference sentence, rather than the target sentence.

TABLE 2
The statistics of CIP dataset.c 约翰并不有钱,他住在海边,深居简出。 e John is not rich, he live in a small way near the coast t 约翰并不有钱,他住在海边,很少出门。 c 她除了演唱外,其余时间则人深居简出。 e she seldom goes out at other times, except when she sings.The examples contain the idiom "深居简出" in CIP dataset.c and e are a machine learning sentence pair, t is the CIP reference sentence of c generated by collaborating machine translation and human intervention.
) Meaning assesses if generated Output examples from different methods.The idioms of sentences are marked in red.