ABSTRACT
Low-resource text plagiarism detection faces a significant challenge due to the limited availability of labeled data for training. This task requires the development of sophisticated algorithms capable of identifying similarities and differences in texts, particularly in the realm of semantic rewriting and translation-based plagiarism detection. In this paper, we present an enhanced attentive Siamese Long Short-Term Memory (LSTM) network designed for Tibetan-Chinese plagiarism detection. Our approach begins with the introduction of translation-based data augmentation, aimed at expanding the bilingual training dataset. Subsequently, we propose a pre-detection method leveraging abstract document vectors to enhance detection efficiency. Finally, we introduce an improved attentive Siamese LSTM network tailored for Tibetan-Chinese plagiarism detection. We conduct comprehensive experiments to showcase the effectiveness of our proposed plagiarism detection framework.
1. INTRODUCTION
In recent years, the media has highlighted numerous cases of plagiarism, casting a negative shadow on society. According to [1], 16% of retractions were attributed to plagiarism. Gupta and Rosso's analysis of papers from the Association for Computational Linguistics (ACL) proceedings revealed an increase in verbatim plagiarism from 5.11% to 9.67% between 2008 and 2012 [2]. Additionally, [3] observed that plagiarism accounted for 22.7% of retractions in obstetrics and gynecology.
To counter academic fraud and plagiarism, various institutions have implemented monitoring systems and technical methods. Notably, ACM and IEEE have established policies to foster a positive academic environment. Turnitin, for instance, has created a dedicated anti-plagiarism website that compiles cases, detection methods, policies, and regulations to prevent academic paper plagiarism.
While progress has been made, academic misconduct remains challenging to eradicate due to the low barriers to accessing diverse data. This challenge is particularly evident in two aspects. Firstly, as the volume of data continues to grow rapidly, traditional information retrieval-based plagiarism detection methods are becoming increasingly time-consuming. Secondly, the ease of accessing cross-lingual information has led to more cases of thesis plagiarism being translated from other languages. Therefore, effective detection methods are essential to address this issue.
Cross-lingual text plagiarism detection, which involves identifying plagiarism in texts written in different languages, poses a significant challenge, especially for low-resource languages. In this paper, we focus on studying cross-lingual Tibetan-Chinese plagiarism detection, specifically targeting semantic rewriting and translation-based plagiarism. We employ a corpus expansion method based on data augmentation for low-resource languages. Additionally, we propose a pre-detection method to achieve a coarse-grained detection. Finally, we introduce an improved attentive Siamese Long Short-Term Memory (LSTM) model for text plagiarism detection.
2. RELATED WORK
2.1 Plagiarism Detection Methods
Research on plagiarism detection encompasses various aspects, necessitating diverse modules. In the pre-detection phase, document retrieval plays a pivotal role. Detecting the content of text involves measures such as sentence semantic similarity, paraphrasing detection, and document similarity. Plagiarism detection methods typically fall into three categories: content-based, structure-based, and citations and references-based.
Content-based methods are the most commonly employed in text plagiarism detection. These include bag-of-words, N-gram, fingerprints, LCS (Longest Common Subsequence), and fusion methods. The bag-of-words model treats a document as a collection of words, disregarding word order, grammar, and other linguistic factors. Often combined with the vector space model in the document retrieval step, it is widely utilized in evaluations such as CLEF-PAN, as seen in studies like [4, 5, 6, 7, 8, 9]. N-gram, another prevalent method, involves the analysis of sequences of n consecutive words. The higher the count of shared N-grams between a suspicious and source document, the greater the perceived similarity. Building upon N-gram, researchers like Torrejón and Stamatatos have introduced variations such as Skip N-gram[10] and stop word N-gram[11]. Fingerprinting divides a document into chunks, employing methods such as the MD5 algorithm, as demonstrated in the CLEF-PAN 2010 winning system[12]. The LCS algorithm calculates document similarity by determining the longest common subsequences between two texts.
Structure-based plagiarism detection relies on the inherent framework and organization of academic papers, encompassing elements such as title, authors, institutions, abstract, chapter titles, chapters, and references. Chow and Rahman (2009) [13] introduce a three-level tree structure representation for documents: document, page, and chapter. This representation allows for document-level classification and the extraction of more specific features from pages and chapters. Zhang and Chow (2011) [14] propose a multi-level matching method, assigning weight parameters to document and paragraph levels. They then utilize principal component analysis to map the high-dimensional histogram to the underlying semantic space. Alzahrani (2012) [15] suggests that specific sections, such as methods, conclusions, and discussions, are more crucial for plagiarism detection than sections like introduction, acknowledgments, and copyrights. To implement this insight, he employs IGF, Spread, and Depth methods to assign weights to different sections of the document structure. Alzahrani further incorporates an analysis of document citations, which proves beneficial for document retrieval and classification when combined with the document structure.
Citations and references-based plagiarism detection, as proposed by Gipp (2010) [16], addresses the limitations of character matching methods, which struggle to identify semantic rewriting and translation plagiarism due to the similarity in the relative positions of plagiarized snippets and citations. The language-independent nature of both citation-based features and fingerprints makes them particularly robust. Gipp further introduces methods such as citation order analysis, greedy citation tiling, citation chunking, and the longest common citation sequence in [17].
While content-based methods excel in detecting continuous text plagiarism, they often fall short in identifying semantic rewriting and translation plagiarism. Combining structure-based plagiarism detection methods with content-based methods has proven effective in enhancing detection efficiency. Citations and references-based plagiarism detection methods can be employed when a paper has correct and complete citations. Experimental results consistently demonstrate the effectiveness of integrating content-based, structure-based, and citations and references-based methods for comprehensive plagiarism detection.
In recent years, deep neural networks (DNNs) have achieved remarkable success and found applications in various natural language processing tasks, including machine translation and questionanswering systems. DNNs are increasingly being applied in text plagiarism detection, text similarity calculation, and sentence rewriting detection. Notable contributions include He (2015) [18], Tai (2015) [19], Kiros (2015) [20], and Mueller (2016) [21], who proposed multi-perspective convolutional neural networks, TreeLSTM, Skip-Thought, and Siamese LSTM networks, respectively, for calculating sentence similarity.
2.2 Low-Resource Plagiarism Detection
In the realm of low-resource language plagiarism detection, noteworthy exploratory work has been undertaken. Karzan Wakil (2017) [22] conducted research on Kurdish plagiarism detection, employing an N-gram model to identify plagiarized elements at the levels of words, phrases, and paragraphs. Similarly, Sara Sameen (2017) [23] utilized a diverse set of machine learning techniques, including Naive Bayes, Support Vector Machine, J48, and Random Forest, to detect both continuous text plagiarism and rewriting in Urdu. Fingerprinting methods were applied in [24] and [25] for the detection of plagiarism in Russian and Indonesian, respectively.
Conversely, research on cross-language text plagiarism detection, particularly in the domain of translation plagiarism, is relatively scarce. Existing studies predominantly concentrate on cross-language text semantic similarity calculation, spanning language pairs such as English-German, English-French, English-Spanish, English-Arabic, and English-Haitian. Some cross-language text plagiarism detection research incorporates methods from cross-language information retrieval, multilingual text classification, and cross-lingual text similarity calculation. Generally, four methods have been employed in cross-language text plagiarism detection: grammar-based [26], dictionary-based [2], parallel/comparable corpus-based [27], and semantic-based [28, 29]. Machine translation-based methods also find application in this context.
Presently, limited research exists on low-resource language plagiarism detection, presenting challenges related to the acquisition of training data and the effectiveness of few-shot detection. This paper aims to address these challenges by proposing an improved attentive Siamese LSTM model tailored for detecting semantic rewriting and translation-based plagiarism. Additionally, we introduce a corpus expansion method based on data augmentation, specifically designed to enhance performance in low-resource language scenarios.
3. THE PROPOSED APPROACH
This paper focuses on cross-lingual Tibetan-Chinese text plagiarism detection, as illustrated in Figure 1. The system architecture comprises three main components: data augmentation, pre-detection module, and LSTM-based plagiarism detection.
Firstly, we employ a translation-based data augmentation method to expand the cross-lingual Tibetan-Chinese corpus. This step enhances the availability of training data for our plagiarism detection system.
Following that, a pre-detection module, utilizing abstract document vectors, is applied to effectively handle copy-and-paste plagiarism. This module facilitates coarse-grained plagiarism detection.
The final component leverages an improved attentive Siamese LSTM network for the detection of semantic rewriting and translation-based plagiarism. This enhances the system's capability to identify these intricate forms of plagiarism.
3.1 Translation-based Data Augmentation
To address the scarcity of Tibetan-Chinese plagiarism data, we employ a translation-based data augmentation method to expand the Tibetan-Chinese corpus. The effectiveness of synthetic data in enhancing performance across various applications has been demonstrated in previous studies [30, 31].
For this study, we utilize two datasets: CWMT and SICK from SemEval2014. The CWMT corpus comprises 146,000 Tibetan-Chinese sentence pairs. As an external dataset, SICK consists of 10,000 English sentence pairs, each manually labeled with a sentence similarity value ranging from 0 to 5. In this labeling scheme, a similarity value of 0 indicates completely different semantics between the two sentences, while a value of 5 denotes identical content in the two sentences.
The translation-based data augmentation process involves the following steps:
Similarity Model Training: Initially, a similarity model is trained using an attentive Siamese LSTM model on the SICK corpus. The details of the model structure will be discussed in Section 3.3.
Sentence Regrouping: Subsequently, we regroup the English sentence pairs by randomly selecting two sentences from the original set of 10,000 pairs. The similarity value for each pair is then calculated using the trained model.
Translation: The selected English sentence pairs are translated into Chinese and Tibetan using machine translation models [32, 33]. This results in a new cross-lingual Tibetan-Chinese corpus, denoted as SICKcn-tib. For each sentence pair (en1, en2) labeled with a similarity score of s, the corresponding Tibetan-Chinese translation pair will have the same score. In other words, sim(tib1, cn2) = sim(tib2, cn1) = s, where sim(x, y) denotes the similarity score of the sentence pair x and y.
Finally, we generated a total of 217,975 Tibetan-Chinese sentence pairs using a data augmentation method. Table 1 presents samples of the generated Tibetan-Chinese sentence pairs.
Sentence pairs . | Similarity . |
---|---|
Tib1: ཁོས་བློ་སེམ་གཅིག་ཏུ་བསྡུས་ནས་གསེར་མདོག་གི་ཡུལ་ལྗོངས་འདིར་བལྟས་འདུག (En: He stared with all his eyes at the golden scene.) Cn1: 他全神注视着这片金黄色的景色。 (En: He stared with all his eyes at the golden scene.) | 5 |
Tib2: ཁོས་བློ་སྤོབས་ཀྱིས་ཚང་མའི་བརྩི་བཀུརཐོབ། (En: He won everyone's respect with courage.) Cn2: 他以勇气赢得大家的尊敬。 (En: He won everyone's respect with courage.) | 5 |
Cn1: 他全神注视着这片金黄色的景色。 (En: He stared with all his eyes at the golden scene.) Cn2: 他以勇气赢得大家的尊敬。 (En: He won everyone's respect with courage.) | 1.6 |
Tib1: ཁོས་བློ་སེམ་གཅིག་ཏུ་བསྡུས་ནས་གསེར་མདོག་གི་ཡུལ་ལྗོངས་འདིར་བལྟས་འདུག (En: He stared with all his eyes at the golden scene.) Tib2: ཁོས་བློ་སྤོབས་ཀྱིས་ཚང་མའི་བརྩི་བཀུརཐོབ། (En: He won everyone's respect with courage.) | 1.6 |
Tib2: ཁོས་བློ་སྤོབས་ཀྱིས་ཚང་མའི་བརྩི་བཀུརཐོབ། (En: He won everyone's respect with courage.) Cn1: 他全神注视着这片金黄色的景色。 (En: He stared with all his eyes at the golden scene.) | 1.6 |
Tib1: ཁོས་བློ་སེམ་གཅིག་ཏུ་བསྡུས་ནས་གསེར་མདོག་གི་ཡུལ་ལྗོངས་འདིར་བལྟས་འདུག (En: He stared with all his eyes at the golden scene.) Cn2: 他以勇气赢得大家的尊敬。 (En: He won everyone's respect with courage.) | 1.6 |
Sentence pairs . | Similarity . |
---|---|
Tib1: ཁོས་བློ་སེམ་གཅིག་ཏུ་བསྡུས་ནས་གསེར་མདོག་གི་ཡུལ་ལྗོངས་འདིར་བལྟས་འདུག (En: He stared with all his eyes at the golden scene.) Cn1: 他全神注视着这片金黄色的景色。 (En: He stared with all his eyes at the golden scene.) | 5 |
Tib2: ཁོས་བློ་སྤོབས་ཀྱིས་ཚང་མའི་བརྩི་བཀུརཐོབ། (En: He won everyone's respect with courage.) Cn2: 他以勇气赢得大家的尊敬。 (En: He won everyone's respect with courage.) | 5 |
Cn1: 他全神注视着这片金黄色的景色。 (En: He stared with all his eyes at the golden scene.) Cn2: 他以勇气赢得大家的尊敬。 (En: He won everyone's respect with courage.) | 1.6 |
Tib1: ཁོས་བློ་སེམ་གཅིག་ཏུ་བསྡུས་ནས་གསེར་མདོག་གི་ཡུལ་ལྗོངས་འདིར་བལྟས་འདུག (En: He stared with all his eyes at the golden scene.) Tib2: ཁོས་བློ་སྤོབས་ཀྱིས་ཚང་མའི་བརྩི་བཀུརཐོབ། (En: He won everyone's respect with courage.) | 1.6 |
Tib2: ཁོས་བློ་སྤོབས་ཀྱིས་ཚང་མའི་བརྩི་བཀུརཐོབ། (En: He won everyone's respect with courage.) Cn1: 他全神注视着这片金黄色的景色。 (En: He stared with all his eyes at the golden scene.) | 1.6 |
Tib1: ཁོས་བློ་སེམ་གཅིག་ཏུ་བསྡུས་ནས་གསེར་མདོག་གི་ཡུལ་ལྗོངས་འདིར་བལྟས་འདུག (En: He stared with all his eyes at the golden scene.) Cn2: 他以勇气赢得大家的尊敬。 (En: He won everyone's respect with courage.) | 1.6 |
3.2 Pre-Detection based on Abstract Document Vector
The abstract of an academic paper serves as a succinct summary, encapsulating the core of the research. Drawing inspiration from the observation that significantly different abstracts indicate low likelihood of plagiarism, this module aims to assess the correlation between a source document (dsrc) and a potential plagiarism candidate document (dplg) based on their abstracts. This assessment serves as a preliminary step to determine the need for further plagiarism detection measures.
To achieve this, we train document vectors for abstracts using the Doc2vec algorithm. Doc2vec algorithm is the same as word2vec algorithm, which trains N-gram language model through deep learning algorithm. Doc2vec add paragraph vector in the input layer. Word2vec model gets the vector representation of each word in the training process, and the doc2vec model gets the vector representation of words and paragraphs. Abstracts from 90 Chinese papers and 60 Tibetan papers are utilized for this purpose. The document vectors are trained separately for Chinese and Tibetan papers, with both the document vector and word vector dimensions set to 300, and the window length set to 9. Following the extraction of abstract document vectors, we calculate the cosine similarity between the source document and the candidate document. To evaluate this similarity, native speakers manually annotate the similarity between the abstracts of the 90 Chinese papers and the 60 Tibetan papers.
We calculate the Pearson correlation coefficients (PCC) between the manual annotations and the experimental results. The PCC for Tibetan papers is 0.6367, while for Chinese papers, it is 0.8706. These values indicate a strong correlation between the experimental results and manual annotations. The PCC for Tibetan-Chinese papers is 0.539, which shows a moderate relevance. Therefore, the pre-detection method based on abstract document vectors effectively identifies similarities between Tibetan and Chinese papers.
After analyzing the experimental results, we set the threshold of 0.5 in the pre-detection module based on abstract document vector. When the similarity exceeds 0.5, two papers are suspected of plagiarism, the full paper detection is required. Otherwise, it is considered that there is no suspicion of plagiarism and the full text detection is no longer carried out.
3.3 Improved Attentive Siamese LSTM Network for Plagiarism Detection
Many of the plagiarism detection tools available online demonstrate proficiency in identifying instances of continuous text plagiarism. However, they often encounter challenges when it comes to effectively detecting more nuanced forms of plagiarism, such as semantic rewriting and translation-based plagiarism. Addressing this limitation, our paper introduces a refined and enhanced attentive Siamese LSTM model, specifically crafted to tackle Tibetan semantic rewriting and Tibetan-Chinese translation plagiarism.
The Siamese LSTM Network, initially proposed by Mueller [21], serves as the backbone for computing semantic similarity between texts. This network features two parallel LSTM sub-networks that share weights. While Mueller's original approach focuses solely on the last state of each sentence in the hidden layers for similarity calculation, our experiments take a step further by incorporating an attention mechanism layer. This addition allows our model to fully exploit the available information and enhance its ability to discern subtle nuances in the text. The architectural blueprint of our improved attentive Siamese LSTM model is elucidated in Figure 2.
The model comprises five parts: an input layer for sentence pairs (facilitating both monolingual and cross-lingual scenarios), an embedded layer dedicated to representing the input as vectors, hidden layers adept at extracting semantic information, an attention layer responsible for generating weight vectors, and an output layer tasked with producing the similarity value between the two sentences. Setting it apart from conventional plagiarism detection methods, our proposed attentive Siamese LSTM model operates with sentence pairs and word vectors as inputs, thus negating the need for prior knowledge or manual feature engineering. This not only streamlines the process but also enhances the model's adaptability to varied linguistic contexts and intricate instances of plagiarism.
4. EXPERIMENTS AND ANALYSIS
In this section, we have conducted a series of experiments, employing the following formulas to gauge the correlation between the outcomes of our system and manual annotation:
Pearson Correlation Coefficient: PCC is defined as the ratio of the covariance between two variables to the product of their standard deviations, formulated as
where simout and simlabel represent the similarity of the plagiarism detection model's output and manual annotation, respectively.
Mean Square Error (MSE):
where n denotes the number of data instances.
Spearman Correlation Coefficient:
where and denote the mean of simout and simlabel, respectively.
Extremely strong relevance: [0.8, 1.0]
Strong relevance: [0.6, 0.8]
Moderate relevance: [0.4, 0.6]
Weak relevance: [0.2, 0.4]
Very weak relevance or not relevant: [0, 0.2]
4.1 Tibetan Semantic Rewriting Plagiarism Detection
In this section, we conducted comparable experiments to evaluate the performance using Tibetan word vectors and syllable vectors.
The Tibetan word vector available online was trained by Facebook on a small-scale corpus. In contrast, we trained a large-scale Tibetan word vector using FastText based on the corpus collected in this paper. To compare the effects of different scale word vectors, we conducted a series of experiments, as shown in Table 2. The results indicate the following: (1) Word vector scale does have an impact on experimental performance. The Chinese plagiarism detection model achieves the best performance due to its large-scale word vector. When comparing Tibetan (Wiki) with Tibetan (this paper), the latter achieves better results, with a 0.214 higher score than Tibetan (Wiki). (2) The domain of the training corpus for word vectors also affects the performance. The results on Chinese (Wiki) show the same performance as Tibetan (this paper). We believe that the training data for Tibetan (this paper) includes news, academic papers, and other texts that are closer to the domain of the test set.
Model . | Training size . | ρ . | MSE . | ρs . |
---|---|---|---|---|
Chinese (Wiki) | 332,647 | 0.5605 | 0.9235 | 0.5151 |
Tibetan (Wiki) | 12,651 | 0.3562 | 1.3714 | 0.3339 |
Tibetan (this paper) | 102,054 | 0.5702 | 1.0086 | 0.5532 |
Model . | Training size . | ρ . | MSE . | ρs . |
---|---|---|---|---|
Chinese (Wiki) | 332,647 | 0.5605 | 0.9235 | 0.5151 |
Tibetan (Wiki) | 12,651 | 0.3562 | 1.3714 | 0.3339 |
Tibetan (this paper) | 102,054 | 0.5702 | 1.0086 | 0.5532 |
Tibetan belongs to an agglutinative language, which has unique rules of word formation and rich morphological changes. For example, there are a large number of adhesive words and a wealth of case particle words and function words in Tibetan. These reasons lead to ambiguity and difficulty in identifying unknown words in Tibetan word segmentation. Tibetan syllable segmentation is commonly used in Tibetan unknown word segmentation and POS tagging. In our experiments, we also trained Tibetan syllable vectors. The experimental results are presented in Table 3. It can be observed that Tibetan syllable vectors achieve better performance. The results of the AttSiaLSTM model are superior to those of the SiaLSTM model. The ρ value can reach 0.678, which is significantly higher than the results of other models. Moreover, the MSE and ρs values of the AttSiaLSTM model with Tibetan syllable vectors outperform the other models.
Model . | Method . | ρ . | MSE . | ρs . |
---|---|---|---|---|
SiaLSTM | Tibetan word vector | 0.4985 | 1.2512 | 0.4877 |
Tibetan syllable vector | 0.5691 | 1.0152 | 0.5489 | |
AttSiaLSTM | Tibetan word vector | 0.5702 | 1.0086 | 0.5532 |
Tibetan syllable vector | 0.6780 | 0.8329 | 0.6623 |
Model . | Method . | ρ . | MSE . | ρs . |
---|---|---|---|---|
SiaLSTM | Tibetan word vector | 0.4985 | 1.2512 | 0.4877 |
Tibetan syllable vector | 0.5691 | 1.0152 | 0.5489 | |
AttSiaLSTM | Tibetan word vector | 0.5702 | 1.0086 | 0.5532 |
Tibetan syllable vector | 0.6780 | 0.8329 | 0.6623 |
To compare the performance between Tibetan syllable vectors and Tibetan word vectors, we extracted several sentence pairs with similarity from the Tibetan plagiarism detection model, as shown in Table 4. When comparing the output between Tibetan syllable vectors and word vectors, it is evident that the former aligns more closely with manual annotation values. This suggests that the AttSiaLSTM model based on Tibetan syllable vectors performs better than the AttSiaLSTM model based on Tibetan word vectors.
Sentence Pair . | Similarity . | out_syl . | out_word . |
---|---|---|---|
lib1: སྐྱེས་པ་ཞིག་གིས་སྒེའུ་ཁུང་གི་འོག་ཏུ་པི་ཝང་རྡུང་བཞིན་ཡོད་། (En: A man plays guitar under the window) Tib2: སྐྱེས་པ་ཞིག་གིས་རྣོ་སྦྲེང་དཀྲོལ་བཞིན་ཡོད་། (En: A man plays guitar) | 2.4 | 2.4287 | 2.6786 |
Tib1: མི་ཞིག་གིས་རྣོ་སྦྲེང་འབུད་བཞིན་འདུག (En: A man is playing flute alone) Tib2: བུ་གཅིག་གིས་པི་ཝང་རྡུང་ཤེས། (En: A boy can play the guitar) | 2.7 | 2.9699 | 3.1576 |
Tib1: སྐྱེས་པ་ཞིག་གིས་རྐང་རྩེད་སྤོ་ལོ་རྒྱག་བཞིན་ཡོད་། (En: A man is playing football) Tib2: སྐྱེས་པ་ཞིག་གིས་སྤོ་ལོ་འགྲན་བསྡུར་ཞིག་ལ་བཀོལ་བཞིན་ཡོད་། (En: A man is playing basketball in basketball game) | 4 | 3.4221 | 3.3670 |
Sentence Pair . | Similarity . | out_syl . | out_word . |
---|---|---|---|
lib1: སྐྱེས་པ་ཞིག་གིས་སྒེའུ་ཁུང་གི་འོག་ཏུ་པི་ཝང་རྡུང་བཞིན་ཡོད་། (En: A man plays guitar under the window) Tib2: སྐྱེས་པ་ཞིག་གིས་རྣོ་སྦྲེང་དཀྲོལ་བཞིན་ཡོད་། (En: A man plays guitar) | 2.4 | 2.4287 | 2.6786 |
Tib1: མི་ཞིག་གིས་རྣོ་སྦྲེང་འབུད་བཞིན་འདུག (En: A man is playing flute alone) Tib2: བུ་གཅིག་གིས་པི་ཝང་རྡུང་ཤེས། (En: A boy can play the guitar) | 2.7 | 2.9699 | 3.1576 |
Tib1: སྐྱེས་པ་ཞིག་གིས་རྐང་རྩེད་སྤོ་ལོ་རྒྱག་བཞིན་ཡོད་། (En: A man is playing football) Tib2: སྐྱེས་པ་ཞིག་གིས་སྤོ་ལོ་འགྲན་བསྡུར་ཞིག་ལ་བཀོལ་བཞིན་ཡོད་། (En: A man is playing basketball in basketball game) | 4 | 3.4221 | 3.3670 |
4.2 Tibetan-Chinese Translation Plagiarism Detection
Leveraging Tibetan-Chinese data from translation-based data augmentation, we trained word vectors based on multilingual unsupervised or supervised word embeddings (MUSE). MUSE, proposed by Facebook, is a cross-lingual word vector approach. We employed corpora of varying scales and utilized the Tibetan-Chinese cross-lingual word vector as input to train a Tibetan-Chinese translation plagiarism detection model. The aim was to investigate the impact of corpus scale on the experiments. Experimental results are shown in Table 5. The ρ value of the Tibetan-Chinese cross-language plagiarism detection experiment is 0.1505 when the training data contains 10,000 sentence pairs. This suggests that the model output is weakly correlated with manual annotation. However, as the size of the training data increases to 50,994, the ρ value improves significantly to 0.4062, indicating a substantial improvement of 0.25 over the baseline. Moreover, with a training data size of 207,975 sentence pairs, the ρ value reaches 0.5476, indicating a moderate correlation with manual annotation. However, as the number of sentence pairs continues to increase, the Pearson correlation coefficient starts to decline. This can be attributed to the accumulation of errors in the generated data. It is worth noting that the Chinese plagiarism detection model used for data augmentation achieves a Pearson correlation coefficient of 0.5605. However, since errors are introduced during the generation of training data, the generated data based on the data augmentation method also contain errors. Therefore, as the training corpus size increases, the accumulation of errors has a negative impact on the model.
Number of sentence pairs . | ρ . | MSE . | ρs . |
---|---|---|---|
10,000 | 0.1505 | 2.6627 | 0.1425 |
14,994 | 0.2291 | 2.5062 | 0.1841 |
50,994 | 0.4062 | 1.8417 | 0.5139 |
101,988 | 0.3744 | 1.8402 | 0.4054 |
151,988 | 0.4746 | 1.0589 | 0.5239 |
167,976 | 0.4845 | 1.0424 | 0.5265 |
177,975 | 0.4957 | 1.0425 | 0.5044 |
197,975 | 0.5264 | 1.0311 | 0.5297 |
207,975 | 0.5476 | 1.0205 | 0.5508 |
217,975 | 0.5127 | 1.0753 | 0.5196 |
Number of sentence pairs . | ρ . | MSE . | ρs . |
---|---|---|---|
10,000 | 0.1505 | 2.6627 | 0.1425 |
14,994 | 0.2291 | 2.5062 | 0.1841 |
50,994 | 0.4062 | 1.8417 | 0.5139 |
101,988 | 0.3744 | 1.8402 | 0.4054 |
151,988 | 0.4746 | 1.0589 | 0.5239 |
167,976 | 0.4845 | 1.0424 | 0.5265 |
177,975 | 0.4957 | 1.0425 | 0.5044 |
197,975 | 0.5264 | 1.0311 | 0.5297 |
207,975 | 0.5476 | 1.0205 | 0.5508 |
217,975 | 0.5127 | 1.0753 | 0.5196 |
We also extract several sentence pairs samples when the number of training sentences pair is 14,994, 151,988 and 207,975, respectively. We compare similarity values between manual annotation and model output, as shown in Table 6. It is shown that the performance of the first three sentence pairs becomes better when the number of training data is increasing. It indicates that the data augmentation method is effective for low-resource language. The negative impact of the data augmentation method also can be seen in the last two sentence pairs.
Tibetan-Chinese . | Manual . | Model . | outputs . | . |
---|---|---|---|---|
sentence pair | 14,994 | 151,988 | 207,975 | |
一个男人在唱歌 སྐྱེས་པ་ ཞིག་ གིས་གླུ་ ལེན་ (A man is singing) | 3.4 | 3.9366 | 3.7038 | 3.0950 |
那人正在树林里坐着 སྐྱེས་པ་ ཞིག་ གིས་ནགས་ཚལ་ ནང་ བསྡད་ (The man is sitting in the woods) | 3.3 | 4.2417 | 3.9381 | 3.3296 |
一个人拿着话筒唱着歌 སྐད་སྦུག་ བཟུང་ ནས་ སྐྱེས་པ་ ཞིག་ ལེན་ བཞིན་ (A man singing with a microphone) | 1.1 | 4.0065 | 3.7388 | 3.4043 |
那人坐在火车上,把手放在脸上 སྐྱེས་པ་ ཞིག་ གིས་ མེ་འཁོར་ ནང་ བསྡད་ ། ལག་པ་ གདོང་ ལ་ བཞག་ (The man sat on the train and put his hands on his face) | 4.9 | 3.1056 | 3.5461 | 3.7200 |
一个人骑着马 སྐྱེས་པ་ ཞིག་ གིས་རྟ་ བཞོན་ ནས་ (A man is riding a horse alone) | 1.5 | 3.6221 | 3.6377 | 3.4920 |
Tibetan-Chinese . | Manual . | Model . | outputs . | . |
---|---|---|---|---|
sentence pair | 14,994 | 151,988 | 207,975 | |
一个男人在唱歌 སྐྱེས་པ་ ཞིག་ གིས་གླུ་ ལེན་ (A man is singing) | 3.4 | 3.9366 | 3.7038 | 3.0950 |
那人正在树林里坐着 སྐྱེས་པ་ ཞིག་ གིས་ནགས་ཚལ་ ནང་ བསྡད་ (The man is sitting in the woods) | 3.3 | 4.2417 | 3.9381 | 3.3296 |
一个人拿着话筒唱着歌 སྐད་སྦུག་ བཟུང་ ནས་ སྐྱེས་པ་ ཞིག་ ལེན་ བཞིན་ (A man singing with a microphone) | 1.1 | 4.0065 | 3.7388 | 3.4043 |
那人坐在火车上,把手放在脸上 སྐྱེས་པ་ ཞིག་ གིས་ མེ་འཁོར་ ནང་ བསྡད་ ། ལག་པ་ གདོང་ ལ་ བཞག་ (The man sat on the train and put his hands on his face) | 4.9 | 3.1056 | 3.5461 | 3.7200 |
一个人骑着马 སྐྱེས་པ་ ཞིག་ གིས་རྟ་ བཞོན་ ནས་ (A man is riding a horse alone) | 1.5 | 3.6221 | 3.6377 | 3.4920 |
5. CONCLUSION AND FUTURE WORK
In this paper, we focus on cross-lingual Tibetan-Chinese text plagiarism and propose an improved attentive Siamese LSTM model for semantic rewriting plagiarism and translation plagiarism. A translation-based data augmentation strategy is explored to alleviate the problem of data scarcity, and a pre-detection method is proposed based on abstract document vector to improve the efficiency of detection. Experimental results show that the proposed plagiarism detection model achieves a strong PCC correlation with the manual annotation for single language and cross-lingual Tibetan-Chinese translation plagiarism.
Our future work is to improve the performance of Tibetan-Chinese text plagiarism detection. More Tibetan-Chinese corpus and SOTA deep learning models will be applied in the experiments. Additionally, we will explore data augmentation method in other low-resource languages.
AUTHOR CONTRIBUTIONS
W. Bao ([email protected]), J. Dong ([email protected]), Y. Xu ([email protected]), Y. Yang (wateryoo919@ aliyun.com), X. Qi ([email protected]) were all responsible for the system design, the experimental implementation of the approach, and the result analysis. All authors contributed to the manuscript writing and approved the submitted version.
ACKNOWLEDGEMENTS
This work is supported by the National Natural Science Foundation of China (No.62271456), the Open Projects Program of State Key Laboratory of Multimodal Artificial Intelligence Systems.