Abstract
Idioms are a kind of idiomatic expression in Chinese, most of which consist of four Chinese characters. Due to the properties of non-compositionality and metaphorical meaning, Chinese idioms are hard to be understood by children and non-native speakers. This study proposes a novel task, denoted as Chinese Idiom Paraphrasing (CIP). CIP aims to rephrase idiom-containing sentences to non-idiomatic ones under the premise of preserving the original sentence’s meaning. Since the sentences without idioms are more easily handled by Chinese NLP systems, CIP can be used to pre-process Chinese datasets, thereby facilitating and improving the performance of Chinese NLP tasks, e.g., machine translation systems, Chinese idiom cloze, and Chinese idiom embeddings. In this study, we can treat the CIP task as a special paraphrase generation task. To circumvent difficulties in acquiring annotations, we first establish a large-scale CIP dataset based on human and machine collaboration, which consists of 115,529 sentence pairs. In addition to three sequence-to-sequence methods as the baselines, we further propose a novel infill-based approach based on text infilling. The results show that the proposed method has better performance than the baselines based on the established CIP dataset.
1 Introduction
Idioms, called “成语” (ChengYu) in Chinese, are widely used in daily communications and various literary genres. Idioms are a kind of compact Chinese expressions that consist of few words but imply relatively complex social nuances. Moreover, Chinese idioms are often used to describe similar phenomena, events, etc., which means the idioms cannot be interpreted with their literal meanings in some cases. Thus, it has always been a challenge for non-native speakers, and even native speakers, to recognize Chinese idioms (Zheng et al., 2019). For instance, the idiom “鱼肉百姓” (YuRouBaiXing) shown in Figure 1, represents “oppress the people”, instead of its literal meaning - “fish meat the people”.
In real life, if some people do not understand the meaning of idioms, we have to explain them by converting them into a set of word segments that reflect more intuitive and understandable paraphrasing. In this study, we try to manipulate computational approaches to automatically rephrase idiom-containing sentences into simpler sentences (i.e., non-idiom-containing sentences) for preserving context-based paraphrasing, and then benefit both Chinese-based natural language processing and societal applications.
Since idioms are a kind of obstacles for many NLP tasks, CIP can be used as a pre-processing phase that facilitates and improves the performance of machine translation systems (Ho et al., 2014; Shao et al., 2018), Chinese idiom cloze (Jiang et al., 2018; Zheng et al., 2019), and Chinese idiom embeddings (Tan and Jiang, 2021). Furthermore, CIP-based applications can help specific groups, such as children, non-native speakers, and people with cognitive disabilities, to improve their reading comprehension.
We propose a new task in this study, denoted as Chinese Idiom Paraphrasing (CIP), which aims to rephrase the idiom-containing sentences into fluent, intuitive, and meaning-preserving non-idiom-containing sentences. We can treat the CIP task as a special paraphrase generation task. The general paraphrase generation task aims to rephrase a given sentence to another one that possesses identical semantics but various lexicons or syntax (Kadotani et al., 2021; Lu et al., 2021). Similarly, CIP emphasizes rephrasing the idioms of input sentences to word segments that reflect more intuitive and understandable paraphrasing. In recent decades, many researchers devoted to paraphrase generation (McKeown, 1979; Meteer and Shaked, 1988) have struggled due to the lack of a reliable supervision dataset (Meng et al., 2021). Inspired by the challenge, we establish a large-scale training dataset in this work for the CIP task.
Contributions. This study produces two main contributions toward the development of CIP systems.
First, a large-scale benchmark is established for the CIP task. The benchmark comprises 115,529 sentence pairs, which of 8,421 are idioms. A recurrent challenge in crowdsourcing NLP-oriented datasets at scale is that human writers frequently utilize repetitive patterns to fabricate examples, leading to a lack of linguistic diversity (Liu et al., 2022). A new large-scale CIP dataset is created in this study by taking advantage of the collaboration between humans and machines.
In detail, we initially divide a large-scale Chinese-English machine translation corpus into two parts (idiom-containing sub-corpus, and non-idiom-containing sub-corpus) by judging if a Chinese sentence contains idioms. Next, we train an English-to-Chinese machine translation (MT) system using the non-idiom-containing sub-corpus. Because the training corpus for the MT system does not include any idioms, the MT system will not translate input English sentences to idiom-containing Chinese sentences. Then, the MT system is deployed to translate English sentences of the idiom-containing sub-corpus to the non-idiom-containing sentences. A large-scale pseudo-parallel CIP dataset can be constructed by pairing the idiom-containing sentences of idiom-containing sub-corpus and the translated non-idiom-containing sentences. Finally, we employ native speakers to validate the generated sentences and modify defective sentences if necessary.
Second, we propose one novel infill-based method to rephrase the input idiom-containing sentence. Since the constructed dataset is used as the training dataset, we treat the CIP task as a paraphrase generation task. We adopt three different sequence-to-sequence (Seq2Seq) methods as baselines: LSTM-based approach, Transformer-based approach, and mT5-based approach, where mT5 is a massively multilingual pre-trained text-to-text Transformer (Xue et al., 2021). Our proposed infill-based method is only required to rephrase the idioms of the sentence, which means that we only need to generate context-based interpretations of idioms, rather than the whole sentence. Specifically, a CIP sentence pair can be processed to produce a (corrupted) input sentence by replacing both the idioms of the source sentence and a corresponding target extracted from the simplified sentence. The mT5-based CIP method is fine-tuned to reconstruct the corresponding target. Experimental results show that, compared with the baselines evaluated on the constructed CIP dataset, our infill-based method can output high-quality paraphrasing of sentences that are grammatically correct and semantically appropriate.
As the use of the Chinese language becomes more widespread, the need for effective Chinese paraphrasing methods may increase, leading to further research and development in this area. The constructed dataset and employed baselines that are used to accelerate this research are open-source, available on Github.2
2 Related Work
Paraphrase Generation:
Paraphrase generation aims to extract paraphrases of given sentences. The extracted paraphrases can preserve the original meaning of the sentence, but are assembled with different words or syntactic structures (McKeown, 1979; Meteer and Shaked, 1988; Zhou and Bhat, 2021).
Most recent neural paraphrase generation methods primarily take advantage of the sequence-to-sequence framework, which can achieve considerable performance improvements compared with traditional approaches (Zhou and Bhat, 2021). Some approaches use reinforcement learning or multi-task learning to improve the quality and diversity of generated paraphrases (Xie et al., 2022). A long-standing issue embraced in paraphrase generation studies is the lack of reliable supervised datasets. The issue can be avoided by constructing manually annotated paired-paraphrase datasets (Kadotani et al., 2021) or designing unsupervised paraphrase generation methods (Meng et al., 2021).
Differ from existing paraphrase generation research, we take our attention to Chinese idiom paraphrasing that rephrases idiom-containing sentences to non-idiom-containing ones.
Text Infilling: Originating from cloze tests (Taylor, 1953), text infilling aims to fill in missing blanks in a sentence or paragraph by making use of the preceding and subsequent text, to make the text complete and meaningful.
Current text infilling methods may be categorized into four groups. GAN-based methods train GANs to ensure that the generator generates highly dependable infilling content that can trick the discriminator (Fedus et al., 2018). Intricate inference-based methods use dynamic programming or gradient search to locate infilling content that is highly probable within its surrounding context (Zaidi et al., 2020). Masked LM-based methods generate infilling content based on its bidirectional contextual word embedding (Shen et al., 2020). LM-based methods fine-tune off-the-shelf LMs in an auto-regressive manner, and some approaches modify the input format by putting an infilling answer after the masked input (Donahue et al., 2020), whereas others do not modify the input format (Zhu et al., 2019). In contrast to the aforementioned methods, our goal in this paper is not only to make the text complete, but also to maintain the sentence’s meaning when creating paraphrases. As a result, we employ a sequence-to-sequence framework to identify infilling content.
Idioms: Idiom is an interesting linguistic phenomenon in the Chinese language. Compared with other types of words, most idioms are unique in perspective of non-compositionality and metaphorical meaning. Idiom understanding plays an important role in the research area of Chinese language understanding. Many types of research related to Chinese idiom understanding have been proposed that can benefit a variety of related down-streaming tasks. For example, Shao et al. (2018) focused on evaluating the quality of idiom translation of machine translation systems. Zheng et al. (2019) provided a benchmark to assess the abilities of multiple models on Chinese idiom-based cloze tests, and evaluated how well the models can comprehend Chinese idiom-containing texts. Liu et al. (2019) studied how to improve essay writing skills by recommending Chinese idioms. Tan and Jiang (2021) investigated the tasks on learning and quality evaluation of Chinese idiom embeddings. In this paper, we study a novel CIP task that is different from the above tasks. Since the proposed CIP method can rephrase idiom-containing sentences to non-idiom-containing ones, it is expected that CIP can benefit tasks related to idiom representation and idiom translation.
Pershina et al. (2015) studied a new task of English idiom paraphrases aiming to determine whether two idioms have alike or similar meanings. They collected idioms’ definitions in a dictionary and utilized word embedding modelings to represent idioms to calculate the similarity between two idioms. Qiang et al. (2021) proposed a Chinese lexical simplification method, which focuses on replacing complex words in given sentences with simpler and meaning-equivalent alternatives. It is noteworthy that the substitutes in Chinese lexical simplification are all made up of a single word, but an idiom typically cannot be substituted by a single word to express original concepts or ideas.
3 Human and Machine Collaborative Dataset Construction
This section describes the process of constructing a large-scale parallel dataset for CIP. A qualified CIP dataset needs to meet the following two requirements: (1) The two sentences in a sentence pair have to convey the same meaning; and (2) A sentence pair has to contain an idiom-containing sentence and an idiom-containing one. We outline a three-stage pipeline for dataset construction, which takes advantage of both the generative strength of machine translation (MT) methods and the evaluative strength of human annotators. Human annotators are generally reliable in correcting examples, but it is challenging while crafting diverse and creative examples at scale. Therefore, we deploy a machine translator to automatically create an initial CIP dataset, and then inquire annotators to proofread each generated instance.
3.1 Pipeline
Figure 2 exhibits the details of the pipeline. Our pipeline starts with an existing English-Chinese machine translation dataset denoted as . Firstly, we refer to a collect Chinese idiom list to split the MT dataset into two parts: non-idiom-containing sub-dataset and idiom-containing sub-dataset (Stage 1). All the data items in both and are in forms of sentence pairs. Then, we train a neural machine translation system ℳ using , which can translate English sentences to non-idiom-containing Chinese sentences. Subsequently, we input English sentences in to ℳ to output non-idiom-containing Chinese sentences. Afterward, the Chinese sentences in and the generated sentences are paired to construct a large-scale initial parallel CIP dataset (Stage 2). Finally, the constructed dataset is reviewed and revised by annotators for quality assurance (Stage 3).
Stage 1: Corpus Segmentation. The English-Chinese MT dataset we applied in the research are grabbed from WMT18 (Bojar et al., 2018), which contains 24, 752, 392 sentence pairs. We extract a Chinese idiom list that embraces 31,114 idioms.3 Since the list enables determining whether the Chinese sentence in a pair contains idioms, can be split as and . The sub-dataset is used to train a special MT system ℳ that can translate English sentences to non-idiom-containing Chinese sentences. In our experiments, only 0.2% of the translated Chinese sentences contain idioms (see Table 6). After removing redundant Chinese sentences, the number of sentence pairs in is 105,559.
Stage 2: Pseudo-CIP Dataset. Giving a sentence pair (ci, ei) in , we input the English sentence ei into MT system ℳ, and output a Chinese translation ti. We pair Chinese sentence ci and Chinese translation ti as a pseudo-CIP sentence pair. Thus, a CIP dataset can be built up by pairing original Chinese sentences and corresponding translated English-to-Chinese ones in . The pseudo-CIP dataset can meet the two requirements of CIP dataset construction. On one hand, the pseudo-CIP data is from the MT dataset, which can guarantee that the paired sentences deliver the same meanings. On another hand, all original sentences include one or more idioms, and all the translated sentences do not contain idioms.
Stage 3: Human Review. As the final stage of the pipeline, we recruit five human annotators to review each sentence pair (ci, ti) in the pseudo-CIP dataset . These annotators are all undergraduate native Chinese speakers. Given (ci, ti), annotators are asked to revise and improve the quality of ti. ti is required to be non-idiom-containing and fully meaning-preserving.
3.2 Corpus Statistics
The statistical details of the CIP dataset are shown in Table 1. The dataset is treated as in-domain data, which contains 105,559 instances including 8,261 different idioms. is partitioned into three parts: a training set Train, a development set Dev, and a test set Test. The number of instances in Train, Dev, and Test are 95,560, 5,000, and 4,999, respectively.
. | In-domain . | Out-of-domain . | . | ||||
---|---|---|---|---|---|---|---|
Train | Dev | Test | Dev | Test | Total | ||
sentence pairs | 95,560 | 5,000 | 4,999 | 4,994 | 4,976 | 115,529 | |
Source sentence | token | 3,390,179 | 173,001 | 169,793 | 225,850 | 221,673 | 4,180,496 |
Avg. sentence length | 35 | 35 | 34 | 45 | 45 | 36 | |
All Idioms | 102,997 | 5,423 | 5,494 | 5,808 | 5,800 | 251,055 | |
Unique Idioms | 7,609 | 5,225 | 5,279 | 5,149 | 5,128 | 8,421 | |
Reference sentence | tokens | 3,454,127 | 175,083 | 172,028 | 239,578 | 224,907 | 4,265,723 |
Avg. sentence length | 36 | 35 | 34 | 48 | 45 | 36 | |
Avg. edit disatance | 7.85 | 7.26 | 7.37 | 6.21 | 5.36 | 7.62 |
. | In-domain . | Out-of-domain . | . | ||||
---|---|---|---|---|---|---|---|
Train | Dev | Test | Dev | Test | Total | ||
sentence pairs | 95,560 | 5,000 | 4,999 | 4,994 | 4,976 | 115,529 | |
Source sentence | token | 3,390,179 | 173,001 | 169,793 | 225,850 | 221,673 | 4,180,496 |
Avg. sentence length | 35 | 35 | 34 | 45 | 45 | 36 | |
All Idioms | 102,997 | 5,423 | 5,494 | 5,808 | 5,800 | 251,055 | |
Unique Idioms | 7,609 | 5,225 | 5,279 | 5,149 | 5,128 | 8,421 | |
Reference sentence | tokens | 3,454,127 | 175,083 | 172,028 | 239,578 | 224,907 | 4,265,723 |
Avg. sentence length | 36 | 35 | 34 | 48 | 45 | 36 | |
Avg. edit disatance | 7.85 | 7.26 | 7.37 | 6.21 | 5.36 | 7.62 |
We observe that both the Train and Test datasets come from the same distribution. However, when models are deployed in real-world applications, the inference might be performed on the data from different distributions, i.e., out-of-domain (Desai and Durrett, 2020). Therefore, we additionally collected 9,970 sentences with idioms from modern vernacular classics, including prose and fiction, as out-of-domain data, to assess the generalization ability of CIP methods. Unlike the MT corpus, these sentences have no English sentences as their references, we manually modify them to non-idiom-containing sentences with the help of Chinese native speakers.
There are three significant differences between in-domain and out-of-domain data. First, the average length of sentences in in-domain data is around 35 words, while it is about 45 words for out-of-domain data. Second, the average number of idioms in in-domain data is 1.07, which is lower than that of out-of-domain data (i.e., 1.17). Third, the sentence pairs in out-of-domain data need fewer modifications than that in in-domain data. In this case, a lack of linguistic diversity might be taken place due to human annotators often relying on repetitive patterns to generate sentences.
To verify the scalability and generalization ability of the CIP methods, we adopt the following strategy to construct Dev and Test. We counted the frequency of each idiom in the corpus, where the minimum and the maximum idiom frequency are 1 and 68, respectively. Based on the number of idioms in each frequency interval, we extract the instances into the Dev and Test. The idiom frequency statistics on the Dev and Test are shown in Table 2. We can see that those low-frequency idioms occupy a higher proportion of all the idiom occurrences (62.76% and 62.71% for low-frequency interval [0,20) in in-domain Dev and Test). There are 421 and 415 instances containing idioms in in-domain Dev and Test that are never seen in the Train.
Freq. Interval . | 0 . | [1,10) . | [10,20) . | [20,30) . | [30,40) . | [40,50) . | [50,68) . | |
---|---|---|---|---|---|---|---|---|
In | Valid | 415 | 1,787 | 941 | 1,159 | 1,026 | 67 | 28 |
Test | 421 | 1,814 | 946 | 1,171 | 1,057 | 60 | 25 | |
Out | Valid | 279 | 1,871 | 808 | 1,120 | 1,643 | 62 | 25 |
Test | 284 | 1,854 | 810 | 1,108 | 1,657 | 66 | 21 |
Freq. Interval . | 0 . | [1,10) . | [10,20) . | [20,30) . | [30,40) . | [40,50) . | [50,68) . | |
---|---|---|---|---|---|---|---|---|
In | Valid | 415 | 1,787 | 941 | 1,159 | 1,026 | 67 | 28 |
Test | 421 | 1,814 | 946 | 1,171 | 1,057 | 60 | 25 | |
Out | Valid | 279 | 1,871 | 808 | 1,120 | 1,643 | 62 | 25 |
Test | 284 | 1,854 | 810 | 1,108 | 1,657 | 66 | 21 |
3.3 Some Examples in the CIP Dataset
We present some examples of the idiom “深居简出” (reclusive) in the CIP dataset, shown in Table 3. The idiom “深居简出” can be rephrased with different descriptions, displaying the linguistic diversity.
4 Methods
Based on our constructing CIP dataset, the CIP task can be treated as a sentence paraphrasing task (Section 4.1). Additionally, we propose a novel infill-based method to solve it (Section 4.2).
4.1 Paraphrasing for CIP
4.2 Infill-based CIP Method
Given a sentence pair {c, t}, we completely generate the whole target sentence t from the original sentence c using the Seq2Seq methods. However, CIP merely requires us to rephrase the idioms of the sentence, which means we only expect to generate context-based interpretations of idioms, rather than the whole sentence. Text infilling (Zhu et al., 2019; Xiao et al., 2022) is a task that fills missing text segments of a sentence by a model trained on a large amount of data in a fill-in-the-blank format. Inspired by the work on text infilling, we propose a novel CIP method, denoted as the infill-based CIP method. Suppose is an edited sentence of c by replacing one idiom into the blank, and the interpretation y of the idiom is a sequence of words. The infill-based CIP method aims to generate y for filling the blank in .
Extracting Interpretation. Considering that sentence t does not only rephrase the part of the idioms, we cannot directly extract the interpretation for each idiom from t. We adopt the following method to extract the interpretations of the idioms for {c, t}.
The interpretation of a given sentence pair {c, t} is extracted by computing the edit operations. Suppose that c is “约翰踢了我一脚, 所以我以牙还牙” (John kicked me, so I tit for tat.), and t is “汤姆踢我一脚吧, 所以我也踢了他一脚.” (Tom kicked me, so I kicked him too.), where the characters of the sentence are split by spaces. We construct an edit sequence Diff by matching all the words in both c and t using three edit operations (‘=’,‘−’,‘+’), where ‘=’,‘−’,‘+’ represent ‘keep’, ‘delete’ and ‘add’ operations, respectively.4 The output Diff is “(‘−’,‘约翰’), (‘+’,‘汤姆’), (‘=’,‘踢’), (‘−’,‘了’), (‘=’, ‘我一脚’), (‘+’, ‘吧’), (‘=’, ’, ‘,,所以我’), (‘−’, ‘以牙还牙’), (‘+’, ‘也踢了他一脚’), (‘=’, ‘.’)”. Specifically, the tuple (‘-’, ‘约翰’) indicates that ‘约翰’ appears in c but not in t; (‘+’, ‘汤姆’) denotes that ‘汤姆’ appears in t but not in c; and (‘=’, ‘踢’) represents that ‘踢’ is in both c and t.
We traverse the edit sequences Diff using a rule <‘−’,‘+’ >, where the rule means ‘+’ follows by ‘−’ in Diff. We get the following two matching sequence pairs <‘约翰’, ‘汤姆’> and <‘以牙还牙’,‘也踢了他一脚’ >. A sequence pair will be ignored by the model if no idiom is included. In this example, we obtain the sequence pair <‘以牙还牙’,‘也踢了他一脚’ >, where “以牙还牙” and “也踢了他一脚” represent an idiom and corresponding interpretation.
Training. Given one sentence pair {c, t}, we first construct a new sentence pair { <sep> c, y}, as shown in Figure 3. Then, we employ the Seq2Seq methods to accomplish this task, as shown in Figure 4. Here, we make two modifications. (1) If is directly fed to the encoder, the information of the idiom of c is ignored. We concatenate the original c and the sentence as the input sequence. (2) When a sentence has two or more idioms, we construct one { <sep> c, y} for each idiom in c. We only use one blank for one idiom instead of multiple blanks for all idioms, because we can preserve enough information when generating the sequences to infill the blank.
During inference, if a sentence has multiple idioms, we iteratively decode each idiom to a corresponding representation.
Relation to Previous Work. Compared with the sentence paraphrasing task (Zhou and Bhat, 2021; Xie et al., 2022), our infill-based method only requires us to rephrase the idioms of the sentence, rather than the whole sentence. Actually, our method is inspired by the text infilling method of Zhu et al. (2019). But our method is different from the existing text infilling method, because our aim is to rephrase the original sentence, and the aim of text infilling is to make the text complete and meaningful.
5 Experiments
5.1 Experiment Setup
Implementation Details.
In this experiment design, four CIP methods are deployed, including: LSTM-based Seq2Seq modeling (LSTM), Transformer-based Seq2Seq modeling (Transformer), mT5-based Seq2Seq modeling (mT5), and infill-based CIP method (Infill). We implement LSTM and Transformer methods using fairseq (Ott et al., 2019). mT5 and Infill methods are mT5-based, and are fulfilled using HuggingFace transformers (Wolf et al., 2020). Furthermore, the sentence tokenization is accomplished using the Jieba Chinese word segmenter5 and BPE tokenization. The size of the vocabulary is set to 32K. The LSTM-based Seq2Seq method adopts the Adam optimizer configured with β = (0.9,0.98), 3e−4 learning rate, and 0.2 dropout rate. The Transformer-based Seq2Seq method maintains the hyperparameters of the base Transformer (Vaswani et al., 2017) (base), which contains a six-layered encoder and a six-layered decoder. The three parameters (β of Adam optimizer, learning rate, and dropout rate) in the Transformer-based method are equivalent to those in the LSTM-based method. It’s noteworthy that the learning rate is gradually increased to 3e−4 by 4k steps and correspondingly decays according to the inverse square root schedule. For mT5 and Infill, we adopt the mt5 version that is re-trained on Chinese corpus.6 We train the three methods with the Adam optimizer (Kingma and Ba, 2015) and an initial learning rate of 3e−4 up to 20 epochs using early stopping on development data. The training will be stopped when the accuracy on the development set does not improve within 5 epochs. We used a beam search with 5 beams for inference.
Metrics. As we mentioned above, the CIP task can be treated as a sentence paraphrasing task. Therefore, We apply four metrics to evaluate sentence paraphrasing task namely, BLEU (Papineni et al., 2002), BERTScore (Zhang et al., 2020), and ROUGE-1 and ROUGE-2 (Lin, 2004). BLEU is a widely used machine translation metric, which measures opposed references to evaluate lexical overlaps with human intervention (Papineni et al., 2002). BERTScore is chosen as another metric due to its high correlation with human judgments (Zhang et al., 2020). Compared to BLEU, BERTScore is measured using token-wise cosine similarity between representations produced by BERT. We measure semantic overlaps between generated sentences and reference ones using ROUGE scores (Lin, 2004). ROUGE is often used to evaluate text summarization. The two metrics ROUGE1 and ROUGE2 refer to the overlaps of unigram and bigram between the system and reference summaries, respectively.
Since generated paraphrases often only need to rewrite the idioms of the original sentence, evaluating the whole sentence cannot accurately reflect the quality of paraphrase generation. In order to better evaluate the quality of idiom paraphrasing, we only evaluate the rewrite part of the generating paraphrases instead of the whole sentence using the above metrics. Specifically, given an original sentence c, a reference sentence t, and the generated paraphrase sentence u, we first find the common words in all the sentences c, t, and u, and remove the common words from t and u. We evaluate the remaining words of t and u using the above metrics, denoted as (BERT-E, BERTScore-E, ROUGE1-E, and ROUGE2-E).
Baselines. In this research, we adopt three Seq2Seq methods to handle CIP tasks that are LSTM-based, Transformer-based, and mT5-based models, respectively. We additionally provide two zero-shot methods that can facilitate solving the CIP problem, namely, Re-translation and BERT-CLS.
(1) The LSTM-based Seq2Seq method is a basic Seq2Seq method, which uses an LSTM (long short-term memory [Hochreiter and Schmidhuber, 1997]) to convert a sentence to a dense, fixed-length vector representation. In contrast to vanilla form of RNNs, LSTM can handle long sequences, but it fails to maintain the global information of the sequences.
(2) The Transformer-based Seq2Seq method (Vaswani et al., 2017) is a state-of-the-art Seq2Seq method that has been widely adopted to process various NLP tasks, such as machine translation, abstractive summarization, etc. Transformer applies a self-attention mechanism that directly models the relationships among all words of an input sequence regardless of words’ positions. Unlike LSTM, Transformer handles the entire input sequence at once, rather than iterating words one by one.
(3) mT5 is a Seq2Seq method that uses the framework of Transformer. Currently, most downstream NLP tasks build their models by fine-tuning pre-trained language models (Raffel et al., 2020). mT5 is a massively multilingual pre-trained language model that is implemented in a form of unified ”text-to-text” to process different downstream NLP problems. In this study, we fine-tune the mT5-based approach to handle CIP task.
(4) Re-translation is implemented by utilizing the back-translation techniques of machine translation methods. We first translate an idiom-containing sentence to an English sentence using an efficient Chinese-English translation system, and then translate the generated English sentence using our trained English-Chinese translation system (introduced by Section 3.1) to generate a non-idiom-containing Chinese sentence. The Chinese-English translation system can be easily accessed online.7 The trained English-Chinese translation system is a transformer-based Seq2Seq method.
(5) BERT-CLS is an existing BERT-based Chinese lexical simplification method (Qiang et al., 2021). In this task, an idiom is treated as a complex word that will be replaced with a simpler word.
5.2 Performance of CIP Methods
Table 4 summarizes the evaluation results on our established CIP dataset using two types of metrics. The supervised CIP methods (LSTM, Transformer, mT5, and Infill) are significantly better than two zero-shot methods (Re-translation and BERT) in perspectives of the four metrics. The results reveal that the dataset is a high-quality corpus, which can help to benefit CIP task.
Method . | BLEU/BLEU-E . | BERTS/BERTS-E . | ROU1/ROU1-E . | ROU2/ROU2-E . |
---|---|---|---|---|
Re-translation | 27.37/2.67 | 78.73/64.43 | 57.01/24.46 | 31.93/5.32 |
BERT | 74.75/1.39 | 91.53/62.99 | 83.05/18.36 | 73.64/5.11 |
LSTM | 81.99/31.87 | 93.79/78.73 | 87.83/55.52 | 80.20/43.40 |
Transformer | 82.19/32.58 | 94.00/79.41 | 88.16/56.70 | 80.50/44.58 |
mT5 | 82.98/33.87 | 94.22/80.36 | 88.13/57.44 | 80.78/45.89 |
Infill | 83.55/(34.26 ± 0.02) | 94.46/(80.94±0.05) | 88.68/(58.44±0.03) | 81.57/(47.96±0.03) |
Method . | BLEU/BLEU-E . | BERTS/BERTS-E . | ROU1/ROU1-E . | ROU2/ROU2-E . |
---|---|---|---|---|
Re-translation | 27.37/2.67 | 78.73/64.43 | 57.01/24.46 | 31.93/5.32 |
BERT | 74.75/1.39 | 91.53/62.99 | 83.05/18.36 | 73.64/5.11 |
LSTM | 81.99/31.87 | 93.79/78.73 | 87.83/55.52 | 80.20/43.40 |
Transformer | 82.19/32.58 | 94.00/79.41 | 88.16/56.70 | 80.50/44.58 |
mT5 | 82.98/33.87 | 94.22/80.36 | 88.13/57.44 | 80.78/45.89 |
Infill | 83.55/(34.26 ± 0.02) | 94.46/(80.94±0.05) | 88.68/(58.44±0.03) | 81.57/(47.96±0.03) |
Table 4 and Table 5 show that the performance of LSTM-based baseline is inferior to the other three baselines on in-domain and out-of-domain test datasets. In general, the two mT5-based CIP methods (mT5 and Infill) outperform the other two methods (LSTM and Transformer), which suggests that CIP methods fine-tuned on mT5 can improve CIP performance. It is observed that Infill yields the best results on the in-domain test set compared with other CIP methods, which verifies that Infill is quite effective. On the out-of-domain test set, BERT-based method achieves the best results on ROU1 and ROU2, and obtains the worst results on ROU1-E and ROU2-E, because it makes minor modifications on the source sentence. It means that the type of metrics (BLEU-E, BERTS-E, ROU1-E, and ROU2-E) are more reasonable as the evaluation metrics for CIP task. Our proposed Infill-based method is still the best option for CIP task on the out-of-domain test set.
Method . | BLEU/BLEU-E . | BERTS/BERTS-E . | ROU1/ROU1-E . | ROU2/ROU2-E . |
---|---|---|---|---|
Re-translation | 13.65/0.87 | 72.63/61.87 | 47.18/22.73 | 19.47/2.65 |
BERT | 84.95/1.63 | 94.79/63.07 | 89.84/19.95 | 85.10/5.99 |
LSTM | 81.20/7.81 | 93.50/67.03 | 87.66/31.08 | 81.69/14.14 |
Transformer | 80.14/8.07 | 93.55/67.36 | 87.63/31.62 | 81.56/14.48 |
mT5 | 84.76/9.29 | 94.58/68.05 | 89.20/31.82 | 83.76/14.64 |
Infill | 86.60/(10.68±0.03) | 94.98/(68.54±0.05) | 89.70/(31.10±0.03) | 84.89/(15.85 ±0.02) |
Method . | BLEU/BLEU-E . | BERTS/BERTS-E . | ROU1/ROU1-E . | ROU2/ROU2-E . |
---|---|---|---|---|
Re-translation | 13.65/0.87 | 72.63/61.87 | 47.18/22.73 | 19.47/2.65 |
BERT | 84.95/1.63 | 94.79/63.07 | 89.84/19.95 | 85.10/5.99 |
LSTM | 81.20/7.81 | 93.50/67.03 | 87.66/31.08 | 81.69/14.14 |
Transformer | 80.14/8.07 | 93.55/67.36 | 87.63/31.62 | 81.56/14.48 |
mT5 | 84.76/9.29 | 94.58/68.05 | 89.20/31.82 | 83.76/14.64 |
Infill | 86.60/(10.68±0.03) | 94.98/(68.54±0.05) | 89.70/(31.10±0.03) | 84.89/(15.85 ±0.02) |
Our proposed method Infill is superior to the baselines in several key ways. First, our approach is more efficient, allowing us to achieve better results in less time, because infill-based method only needs to rephrase the idioms of the input sentence. Second, our method is more robust, since it achieves the best results in both in-domain and out-of-domain test sets. Finally, our method has been extensively tested and validated, giving us confidence in its reliability and accuracy through human evaluation. Overall, our method represents a significant improvement over the existing baselines and is the best option for solving the CIP problem at hand.
5.3 Human Evaluation
For further evaluating the CIP methods, we adopt human evaluation to analyze the deployed CIP methods. We choose 184 sentences from the in-domain test set and 197 sentences from the out-of-domain test set. To verify the scalability and generalization ability of the CIP methods, we show the performance of CIP methods when they are used to solve the problems where the idioms are never seen in the corpus. Therefore, we choose 49 and 48 test sentences for in-domain and out-of-domain test sets containing idioms that do not appear in the training set, respectively. We ask five native speakers to rate each generated sentence using three features: simplicity, meaning, and fluency. The five-point Likert scale is adopted to rate these features, and the average scores of the features are calculated correspondingly. (1) Simplicity is responsible for evaluating whether re-paraphrased idioms of generated sentences are easily understandable, which means idioms in the original sentences should be rewritten with simpler and more common words. (2) Meaning assesses whether generated sentences preserve the meaning of the original sentences. (3) Fluency is used to judge if a generated sentence is fluent and does not contain grammatical errors.
The results of the human evaluation are shown in Table 6. We calculate the scores of annotated sentences t, denoted as Reference. We see that the infill-based mT5 method outperforms other methods on in-domain and out-of-domain test sets, which means Infill is an effective method on the CIP task. The conclusions are consistent with the results using automatic metric. Compared with Reference, our method has a significant potential for improvement.
Method . | In . | Out . | ||||||
---|---|---|---|---|---|---|---|---|
Simp. . | Meaning . | Fluency . | Avg . | Simp. . | Meaning . | Fluency . | Avg . | |
Reference | 4.41 | 4.26 | 4.29 | 4.32 | 4.00 | 3.86 | 3.77 | 3.88 |
BERT | 3.28 | 2.24 | 2.58 | 2.70 | 2.98 | 2.06 | 2.26 | 2.44 |
LSTM | 3.58 | 3.33 | 3.27 | 3.39 | 3.26 | 3.02 | 2.89 | 3.06 |
Transformer | 3.48 | 3.29 | 3.22 | 3.33 | 3.16 | 3.01 | 2.89 | 3.02 |
mT5 | 3.85 | 3.57 | 3.62 | 3.68 | 3.23 | 3.02 | 2.97 | 3.07 |
Infill | 4.02 | 3.78 | 3.81 | 3.87 | 3.72 | 3.49 | 3.42 | 3.54 |
Method . | In . | Out . | ||||||
---|---|---|---|---|---|---|---|---|
Simp. . | Meaning . | Fluency . | Avg . | Simp. . | Meaning . | Fluency . | Avg . | |
Reference | 4.41 | 4.26 | 4.29 | 4.32 | 4.00 | 3.86 | 3.77 | 3.88 |
BERT | 3.28 | 2.24 | 2.58 | 2.70 | 2.98 | 2.06 | 2.26 | 2.44 |
LSTM | 3.58 | 3.33 | 3.27 | 3.39 | 3.26 | 3.02 | 2.89 | 3.06 |
Transformer | 3.48 | 3.29 | 3.22 | 3.33 | 3.16 | 3.01 | 2.89 | 3.02 |
mT5 | 3.85 | 3.57 | 3.62 | 3.68 | 3.23 | 3.02 | 2.97 | 3.07 |
Infill | 4.02 | 3.78 | 3.81 | 3.87 | 3.72 | 3.49 | 3.42 | 3.54 |
Additionally, we calculated the inter-annotator agreement among different annotators. Specifically, we computed Fleiss’ Kappa (Fleiss, 1971) scores for different domain test sets. The scores are 0.199 and 0.097 in in-domain and out-of-domain test sets, respectively. This indicates that the evaluation of Chinese idiom paraphrasing was relatively subjective but still managed to achieve a modest level of agreement. We acknowledge the limitations of human judgment in evaluating the quality of paraphrasing and believe that the diversity of opinions among raters is a valuable insight into the complexity of the CIP task.
5.4 Proportion of Idiom Paraphrasing
CIP aims to rephrase an input idiom-containing sentence to a meaning-preserving and non-idiom-containing sentence. In this subsection, we count the number of idiom-containing sentences that are rephrased to non-idiom-containing sentences. The results are shown in Table 7. The result shows that Re-translation achieves the best results, which can rephrase almost all idioms to non-idiom-containing representations. That means that our idea on CIP dataset construction using machine translation method is feasible. Theoretically, if the trained English-Chinese machine translation method (Stage 2 in the pipeline) can output high-quality results, we do not need to ask annotators to optionally revise Chinese translations. We observe that the proportions of these CIP methods (LSTM, Transformer, mT5, and Infill) are nearly 90%, which means they have great potential for dealing with idiom paraphrasing. Moreover, a small number of idioms cannot be rephrased, because some idioms are simple, thereby are retained in the training set.
Method . | In . | Out . |
---|---|---|
Re-translation | 99.82% | 99.86% |
BERT | 78.18% | 69.21% |
LSTM | 85.64% | 84.65% |
Transformer | 87.26% | 85.13% |
mT5 | 87.54% | 72.07% |
Infill | 87.98% | 81.29% |
Method . | In . | Out . |
---|---|---|
Re-translation | 99.82% | 99.86% |
BERT | 78.18% | 69.21% |
LSTM | 85.64% | 84.65% |
Transformer | 87.26% | 85.13% |
mT5 | 87.54% | 72.07% |
Infill | 87.98% | 81.29% |
5.5 Case Study
The in-domain test set contains idioms that are never seen in Train. We show the paraphrasing results of different CIP methods in Table 8.
Our method consistently outperforms all other approaches in the case study. We found that LSTM-based and Transformer-based methods tend to retain or output part of the idioms, because they cannot learn any knowledge of these idioms from the training corpus. We found that both mT5-based and Infill-based methods based on the pretrained language model mT5 can generate correct interpretations for some of the idioms, as the mT5 model has learned the knowledge of these idioms. The mT5-based method generates a whole new sentence for the original sentence, which can lead to some incorrect interpretations. In contrast, the Infill-based method only rephrases the idioms within the sentence based on their context, which can produce higher-quality interpretations compared to the mT5-based method.
5.6 The Translations of Chinese Idioms
Not only do idioms present a challenge for people to understand, but they also present a greater challenge for Chinese-based NLP applications. Here, we use Chinese-English machine translation as an example of an NLP application to evaluate the usefulness of CIP methods. Given an input sentence containing an idiom, we first use our CIP method as a preprocessing technique to rephrase the sentence, and then translate the paraphrased version into an English sentence.
We give some examples to compare the differences, and the results are shown in Table 9. Because many idioms cannot be translated with their literal meaning, our method helps to identify and paraphrase these idioms, making them easier for the machine translation system to process.
6 Conclusion
In this paper, we propose a novel Chinese idiom paraphrasing (CIP) task, which aims to rephrase sentences containing idioms into non-idiomatic versions. The CIP task can be treated as a special case of paraphrase generation and can be addressed using Seq2Seq modeling. We construct a large-scale training dataset for CIP by taking the collaborations between humans and machines. Specifically, we first design a framework to construct a pseudo-CIP dataset and then ask workers to revise and evaluate the dataset. In this study, we deploy three Seq2Seq methods and propose one novel CIP methods (Infill) for the CIP task. Experimental results reveal that our proposed methods trained on our dataset can yield good results. This could have a positive impact on the performance of machine translation systems, as well as other natural language processing applications that involve Chinese idioms. In our subsequent research, our proposed methods will be used as strong baselines, and the established dataset will also be used to accelerate the study on this topic.
Acknowledgments
This research is partially supported by the National Natural Science Foundation of China under grants 62076217, 62120106008, and 61906060, and the Blue Project of Yangzhou University.
Notes
translate.google.com. Accessed in: 2022-12-01.
https://translate.google.com/. Accessed in: 2022-12-01.
References
Author notes
Action Editor: Minlie Huang