Low-resource text plagiarism detection faces a significant challenge due to the limited availability of labeled data for training. This task requires the development of sophisticated algorithms capable of identifying similarities and differences in texts, particularly in the realm of semantic rewriting and translation-based plagiarism detection. In this paper, we present an enhanced attentive Siamese Long Short-Term Memory (LSTM) network designed for Tibetan-Chinese plagiarism detection. Our approach begins with the introduction of translation-based data augmentation, aimed at expanding the bilingual training dataset. Subsequently, we propose a pre-detection method leveraging abstract document vectors to enhance detection efficiency. Finally, we introduce an improved attentive Siamese LSTM network tailored for Tibetan-Chinese plagiarism detection. We conduct comprehensive experiments to showcase the effectiveness of our proposed plagiarism detection framework.

In recent years, the media has highlighted numerous cases of plagiarism, casting a negative shadow on society. According to [1], 16% of retractions were attributed to plagiarism. Gupta and Rosso's analysis of papers from the Association for Computational Linguistics (ACL) proceedings revealed an increase in verbatim plagiarism from 5.11% to 9.67% between 2008 and 2012 [2]. Additionally, [3] observed that plagiarism accounted for 22.7% of retractions in obstetrics and gynecology.

To counter academic fraud and plagiarism, various institutions have implemented monitoring systems and technical methods. Notably, ACM and IEEE have established policies to foster a positive academic environment. Turnitin, for instance, has created a dedicated anti-plagiarism website that compiles cases, detection methods, policies, and regulations to prevent academic paper plagiarism.

While progress has been made, academic misconduct remains challenging to eradicate due to the low barriers to accessing diverse data. This challenge is particularly evident in two aspects. Firstly, as the volume of data continues to grow rapidly, traditional information retrieval-based plagiarism detection methods are becoming increasingly time-consuming. Secondly, the ease of accessing cross-lingual information has led to more cases of thesis plagiarism being translated from other languages. Therefore, effective detection methods are essential to address this issue.

Cross-lingual text plagiarism detection, which involves identifying plagiarism in texts written in different languages, poses a significant challenge, especially for low-resource languages. In this paper, we focus on studying cross-lingual Tibetan-Chinese plagiarism detection, specifically targeting semantic rewriting and translation-based plagiarism. We employ a corpus expansion method based on data augmentation for low-resource languages. Additionally, we propose a pre-detection method to achieve a coarse-grained detection. Finally, we introduce an improved attentive Siamese Long Short-Term Memory (LSTM) model for text plagiarism detection.

2.1 Plagiarism Detection Methods

Research on plagiarism detection encompasses various aspects, necessitating diverse modules. In the pre-detection phase, document retrieval plays a pivotal role. Detecting the content of text involves measures such as sentence semantic similarity, paraphrasing detection, and document similarity. Plagiarism detection methods typically fall into three categories: content-based, structure-based, and citations and references-based.

Content-based methods are the most commonly employed in text plagiarism detection. These include bag-of-words, N-gram, fingerprints, LCS (Longest Common Subsequence), and fusion methods. The bag-of-words model treats a document as a collection of words, disregarding word order, grammar, and other linguistic factors. Often combined with the vector space model in the document retrieval step, it is widely utilized in evaluations such as CLEF-PAN, as seen in studies like [4, 5, 6, 7, 8, 9]. N-gram, another prevalent method, involves the analysis of sequences of n consecutive words. The higher the count of shared N-grams between a suspicious and source document, the greater the perceived similarity. Building upon N-gram, researchers like Torrejón and Stamatatos have introduced variations such as Skip N-gram[10] and stop word N-gram[11]. Fingerprinting divides a document into chunks, employing methods such as the MD5 algorithm, as demonstrated in the CLEF-PAN 2010 winning system[12]. The LCS algorithm calculates document similarity by determining the longest common subsequences between two texts.

Structure-based plagiarism detection relies on the inherent framework and organization of academic papers, encompassing elements such as title, authors, institutions, abstract, chapter titles, chapters, and references. Chow and Rahman (2009) [13] introduce a three-level tree structure representation for documents: document, page, and chapter. This representation allows for document-level classification and the extraction of more specific features from pages and chapters. Zhang and Chow (2011) [14] propose a multi-level matching method, assigning weight parameters to document and paragraph levels. They then utilize principal component analysis to map the high-dimensional histogram to the underlying semantic space. Alzahrani (2012) [15] suggests that specific sections, such as methods, conclusions, and discussions, are more crucial for plagiarism detection than sections like introduction, acknowledgments, and copyrights. To implement this insight, he employs IGF, Spread, and Depth methods to assign weights to different sections of the document structure. Alzahrani further incorporates an analysis of document citations, which proves beneficial for document retrieval and classification when combined with the document structure.

Citations and references-based plagiarism detection, as proposed by Gipp (2010) [16], addresses the limitations of character matching methods, which struggle to identify semantic rewriting and translation plagiarism due to the similarity in the relative positions of plagiarized snippets and citations. The language-independent nature of both citation-based features and fingerprints makes them particularly robust. Gipp further introduces methods such as citation order analysis, greedy citation tiling, citation chunking, and the longest common citation sequence in [17].

While content-based methods excel in detecting continuous text plagiarism, they often fall short in identifying semantic rewriting and translation plagiarism. Combining structure-based plagiarism detection methods with content-based methods has proven effective in enhancing detection efficiency. Citations and references-based plagiarism detection methods can be employed when a paper has correct and complete citations. Experimental results consistently demonstrate the effectiveness of integrating content-based, structure-based, and citations and references-based methods for comprehensive plagiarism detection.

In recent years, deep neural networks (DNNs) have achieved remarkable success and found applications in various natural language processing tasks, including machine translation and questionanswering systems. DNNs are increasingly being applied in text plagiarism detection, text similarity calculation, and sentence rewriting detection. Notable contributions include He (2015) [18], Tai (2015) [19], Kiros (2015) [20], and Mueller (2016) [21], who proposed multi-perspective convolutional neural networks, TreeLSTM, Skip-Thought, and Siamese LSTM networks, respectively, for calculating sentence similarity.

2.2 Low-Resource Plagiarism Detection

In the realm of low-resource language plagiarism detection, noteworthy exploratory work has been undertaken. Karzan Wakil (2017) [22] conducted research on Kurdish plagiarism detection, employing an N-gram model to identify plagiarized elements at the levels of words, phrases, and paragraphs. Similarly, Sara Sameen (2017) [23] utilized a diverse set of machine learning techniques, including Naive Bayes, Support Vector Machine, J48, and Random Forest, to detect both continuous text plagiarism and rewriting in Urdu. Fingerprinting methods were applied in [24] and [25] for the detection of plagiarism in Russian and Indonesian, respectively.

Conversely, research on cross-language text plagiarism detection, particularly in the domain of translation plagiarism, is relatively scarce. Existing studies predominantly concentrate on cross-language text semantic similarity calculation, spanning language pairs such as English-German, English-French, English-Spanish, English-Arabic, and English-Haitian. Some cross-language text plagiarism detection research incorporates methods from cross-language information retrieval, multilingual text classification, and cross-lingual text similarity calculation. Generally, four methods have been employed in cross-language text plagiarism detection: grammar-based [26], dictionary-based [2], parallel/comparable corpus-based [27], and semantic-based [28, 29]. Machine translation-based methods also find application in this context.

Presently, limited research exists on low-resource language plagiarism detection, presenting challenges related to the acquisition of training data and the effectiveness of few-shot detection. This paper aims to address these challenges by proposing an improved attentive Siamese LSTM model tailored for detecting semantic rewriting and translation-based plagiarism. Additionally, we introduce a corpus expansion method based on data augmentation, specifically designed to enhance performance in low-resource language scenarios.

This paper focuses on cross-lingual Tibetan-Chinese text plagiarism detection, as illustrated in Figure 1. The system architecture comprises three main components: data augmentation, pre-detection module, and LSTM-based plagiarism detection.

Figure 1.

A system hierarchy of cross-lingual text plagiarism detection, consisting of three main components: data augmentation, pre-detection module, and LSTM-based plagiarism detection.

Figure 1.

A system hierarchy of cross-lingual text plagiarism detection, consisting of three main components: data augmentation, pre-detection module, and LSTM-based plagiarism detection.

Close modal

Firstly, we employ a translation-based data augmentation method to expand the cross-lingual Tibetan-Chinese corpus. This step enhances the availability of training data for our plagiarism detection system.

Following that, a pre-detection module, utilizing abstract document vectors, is applied to effectively handle copy-and-paste plagiarism. This module facilitates coarse-grained plagiarism detection.

The final component leverages an improved attentive Siamese LSTM network for the detection of semantic rewriting and translation-based plagiarism. This enhances the system's capability to identify these intricate forms of plagiarism.

3.1 Translation-based Data Augmentation

To address the scarcity of Tibetan-Chinese plagiarism data, we employ a translation-based data augmentation method to expand the Tibetan-Chinese corpus. The effectiveness of synthetic data in enhancing performance across various applications has been demonstrated in previous studies [30, 31].

For this study, we utilize two datasets: CWMT and SICK from SemEval2014. The CWMT corpus comprises 146,000 Tibetan-Chinese sentence pairs. As an external dataset, SICK consists of 10,000 English sentence pairs, each manually labeled with a sentence similarity value ranging from 0 to 5. In this labeling scheme, a similarity value of 0 indicates completely different semantics between the two sentences, while a value of 5 denotes identical content in the two sentences.

The translation-based data augmentation process involves the following steps:

  • Similarity Model Training: Initially, a similarity model is trained using an attentive Siamese LSTM model on the SICK corpus. The details of the model structure will be discussed in Section 3.3.

  • Sentence Regrouping: Subsequently, we regroup the English sentence pairs by randomly selecting two sentences from the original set of 10,000 pairs. The similarity value for each pair is then calculated using the trained model.

  • Translation: The selected English sentence pairs are translated into Chinese and Tibetan using machine translation models [32, 33]. This results in a new cross-lingual Tibetan-Chinese corpus, denoted as SICKcn-tib. For each sentence pair (en1, en2) labeled with a similarity score of s, the corresponding Tibetan-Chinese translation pair will have the same score. In other words, sim(tib1, cn2) = sim(tib2, cn1) = s, where sim(x, y) denotes the similarity score of the sentence pair x and y.

Finally, we generated a total of 217,975 Tibetan-Chinese sentence pairs using a data augmentation method. Table 1 presents samples of the generated Tibetan-Chinese sentence pairs.

Table 1.

Tibetan-Chinese sentence pairs after data augmentation from English sentences: ‘He stared with all his eyes at the golden scene.’ and ‘He won everyone's respect with courage.’. In the table, ‘Tib’, ‘Cn’, and ‘En’ represent the abbreviations for Tibet, Chinese, and English, respectively. Additionally, the sentence in English has the same meaning with the sentence above in Tibet or Chinese.

Sentence pairsSimilarity
Tib1: ཁོས་བློ་སེམ་གཅིག་ཏུ་བསྡུས་ནས་གསེར་མདོག་གི་ཡུལ་ལྗོངས་འདིར་བལྟས་འདུག
(En: He stared with all his eyes at the golden scene.)
Cn1: 他全神注视着这片金黄色的景色。
(En: He stared with all his eyes at the golden scene.) 
Tib2: ཁོས་བློ་སྤོབས་ཀྱིས་ཚང་མའི་བརྩི་བཀུརཐོབ།
(En: He won everyone's respect with courage.)
Cn2: 他以勇气赢得大家的尊敬。
(En: He won everyone's respect with courage.) 
Cn1: 他全神注视着这片金黄色的景色。
(En: He stared with all his eyes at the golden scene.)
Cn2: 他以勇气赢得大家的尊敬。
(En: He won everyone's respect with courage.) 
1.6 
Tib1: ཁོས་བློ་སེམ་གཅིག་ཏུ་བསྡུས་ནས་གསེར་མདོག་གི་ཡུལ་ལྗོངས་འདིར་བལྟས་འདུག
(En: He stared with all his eyes at the golden scene.)
Tib2: ཁོས་བློ་སྤོབས་ཀྱིས་ཚང་མའི་བརྩི་བཀུརཐོབ།
(En: He won everyone's respect with courage.) 
1.6 
Tib2: ཁོས་བློ་སྤོབས་ཀྱིས་ཚང་མའི་བརྩི་བཀུརཐོབ།
(En: He won everyone's respect with courage.)
Cn1: 他全神注视着这片金黄色的景色。
(En: He stared with all his eyes at the golden scene.) 
1.6 
Tib1: ཁོས་བློ་སེམ་གཅིག་ཏུ་བསྡུས་ནས་གསེར་མདོག་གི་ཡུལ་ལྗོངས་འདིར་བལྟས་འདུག
(En: He stared with all his eyes at the golden scene.)
Cn2: 他以勇气赢得大家的尊敬。
(En: He won everyone's respect with courage.) 
1.6 
Sentence pairsSimilarity
Tib1: ཁོས་བློ་སེམ་གཅིག་ཏུ་བསྡུས་ནས་གསེར་མདོག་གི་ཡུལ་ལྗོངས་འདིར་བལྟས་འདུག
(En: He stared with all his eyes at the golden scene.)
Cn1: 他全神注视着这片金黄色的景色。
(En: He stared with all his eyes at the golden scene.) 
Tib2: ཁོས་བློ་སྤོབས་ཀྱིས་ཚང་མའི་བརྩི་བཀུརཐོབ།
(En: He won everyone's respect with courage.)
Cn2: 他以勇气赢得大家的尊敬。
(En: He won everyone's respect with courage.) 
Cn1: 他全神注视着这片金黄色的景色。
(En: He stared with all his eyes at the golden scene.)
Cn2: 他以勇气赢得大家的尊敬。
(En: He won everyone's respect with courage.) 
1.6 
Tib1: ཁོས་བློ་སེམ་གཅིག་ཏུ་བསྡུས་ནས་གསེར་མདོག་གི་ཡུལ་ལྗོངས་འདིར་བལྟས་འདུག
(En: He stared with all his eyes at the golden scene.)
Tib2: ཁོས་བློ་སྤོབས་ཀྱིས་ཚང་མའི་བརྩི་བཀུརཐོབ།
(En: He won everyone's respect with courage.) 
1.6 
Tib2: ཁོས་བློ་སྤོབས་ཀྱིས་ཚང་མའི་བརྩི་བཀུརཐོབ།
(En: He won everyone's respect with courage.)
Cn1: 他全神注视着这片金黄色的景色。
(En: He stared with all his eyes at the golden scene.) 
1.6 
Tib1: ཁོས་བློ་སེམ་གཅིག་ཏུ་བསྡུས་ནས་གསེར་མདོག་གི་ཡུལ་ལྗོངས་འདིར་བལྟས་འདུག
(En: He stared with all his eyes at the golden scene.)
Cn2: 他以勇气赢得大家的尊敬。
(En: He won everyone's respect with courage.) 
1.6 

3.2 Pre-Detection based on Abstract Document Vector

The abstract of an academic paper serves as a succinct summary, encapsulating the core of the research. Drawing inspiration from the observation that significantly different abstracts indicate low likelihood of plagiarism, this module aims to assess the correlation between a source document (dsrc) and a potential plagiarism candidate document (dplg) based on their abstracts. This assessment serves as a preliminary step to determine the need for further plagiarism detection measures.

To achieve this, we train document vectors for abstracts using the Doc2vec algorithm. Doc2vec algorithm is the same as word2vec algorithm, which trains N-gram language model through deep learning algorithm. Doc2vec add paragraph vector in the input layer. Word2vec model gets the vector representation of each word in the training process, and the doc2vec model gets the vector representation of words and paragraphs. Abstracts from 90 Chinese papers and 60 Tibetan papers are utilized for this purpose. The document vectors are trained separately for Chinese and Tibetan papers, with both the document vector and word vector dimensions set to 300, and the window length set to 9. Following the extraction of abstract document vectors, we calculate the cosine similarity between the source document and the candidate document. To evaluate this similarity, native speakers manually annotate the similarity between the abstracts of the 90 Chinese papers and the 60 Tibetan papers.

We calculate the Pearson correlation coefficients (PCC) between the manual annotations and the experimental results. The PCC for Tibetan papers is 0.6367, while for Chinese papers, it is 0.8706. These values indicate a strong correlation between the experimental results and manual annotations. The PCC for Tibetan-Chinese papers is 0.539, which shows a moderate relevance. Therefore, the pre-detection method based on abstract document vectors effectively identifies similarities between Tibetan and Chinese papers.

After analyzing the experimental results, we set the threshold of 0.5 in the pre-detection module based on abstract document vector. When the similarity exceeds 0.5, two papers are suspected of plagiarism, the full paper detection is required. Otherwise, it is considered that there is no suspicion of plagiarism and the full text detection is no longer carried out.

3.3 Improved Attentive Siamese LSTM Network for Plagiarism Detection

Many of the plagiarism detection tools available online demonstrate proficiency in identifying instances of continuous text plagiarism. However, they often encounter challenges when it comes to effectively detecting more nuanced forms of plagiarism, such as semantic rewriting and translation-based plagiarism. Addressing this limitation, our paper introduces a refined and enhanced attentive Siamese LSTM model, specifically crafted to tackle Tibetan semantic rewriting and Tibetan-Chinese translation plagiarism.

The Siamese LSTM Network, initially proposed by Mueller [21], serves as the backbone for computing semantic similarity between texts. This network features two parallel LSTM sub-networks that share weights. While Mueller's original approach focuses solely on the last state of each sentence in the hidden layers for similarity calculation, our experiments take a step further by incorporating an attention mechanism layer. This addition allows our model to fully exploit the available information and enhance its ability to discern subtle nuances in the text. The architectural blueprint of our improved attentive Siamese LSTM model is elucidated in Figure 2.

Figure 2.

The architecture of proposed attentive Siamese LSTM model.

Figure 2.

The architecture of proposed attentive Siamese LSTM model.

Close modal

The model comprises five parts: an input layer for sentence pairs (facilitating both monolingual and cross-lingual scenarios), an embedded layer dedicated to representing the input as vectors, hidden layers adept at extracting semantic information, an attention layer responsible for generating weight vectors, and an output layer tasked with producing the similarity value between the two sentences. Setting it apart from conventional plagiarism detection methods, our proposed attentive Siamese LSTM model operates with sentence pairs and word vectors as inputs, thus negating the need for prior knowledge or manual feature engineering. This not only streamlines the process but also enhances the model's adaptability to varied linguistic contexts and intricate instances of plagiarism.

In this section, we have conducted a series of experiments, employing the following formulas to gauge the correlation between the outcomes of our system and manual annotation:

Pearson Correlation Coefficient: PCC is defined as the ratio of the covariance between two variables to the product of their standard deviations, formulated as

(1)

where simout and simlabel represent the similarity of the plagiarism detection model's output and manual annotation, respectively.

Mean Square Error (MSE):

(2)

where n denotes the number of data instances.

Spearman Correlation Coefficient:

(3)

where simout and simlabel denote the mean of simout and simlabel, respectively.

  • Extremely strong relevance: [0.8, 1.0]

  • Strong relevance: [0.6, 0.8]

  • Moderate relevance: [0.4, 0.6]

  • Weak relevance: [0.2, 0.4]

  • Very weak relevance or not relevant: [0, 0.2]

4.1 Tibetan Semantic Rewriting Plagiarism Detection

In this section, we conducted comparable experiments to evaluate the performance using Tibetan word vectors and syllable vectors.

The Tibetan word vector available online was trained by Facebook on a small-scale corpus. In contrast, we trained a large-scale Tibetan word vector using FastText based on the corpus collected in this paper. To compare the effects of different scale word vectors, we conducted a series of experiments, as shown in Table 2. The results indicate the following: (1) Word vector scale does have an impact on experimental performance. The Chinese plagiarism detection model achieves the best performance due to its large-scale word vector. When comparing Tibetan (Wiki) with Tibetan (this paper), the latter achieves better results, with a 0.214 higher score than Tibetan (Wiki). (2) The domain of the training corpus for word vectors also affects the performance. The results on Chinese (Wiki) show the same performance as Tibetan (this paper). We believe that the training data for Tibetan (this paper) includes news, academic papers, and other texts that are closer to the domain of the test set.

Table 2.

Experimental results on different scales of word vectors, where ‘training size’ means the number of words in training corpus.

ModelTraining sizeρMSEρs
Chinese (Wiki) 332,647 0.5605 0.9235 0.5151 
Tibetan (Wiki) 12,651 0.3562 1.3714 0.3339 
Tibetan (this paper) 102,054 0.5702 1.0086 0.5532 
ModelTraining sizeρMSEρs
Chinese (Wiki) 332,647 0.5605 0.9235 0.5151 
Tibetan (Wiki) 12,651 0.3562 1.3714 0.3339 
Tibetan (this paper) 102,054 0.5702 1.0086 0.5532 

Tibetan belongs to an agglutinative language, which has unique rules of word formation and rich morphological changes. For example, there are a large number of adhesive words and a wealth of case particle words and function words in Tibetan. These reasons lead to ambiguity and difficulty in identifying unknown words in Tibetan word segmentation. Tibetan syllable segmentation is commonly used in Tibetan unknown word segmentation and POS tagging. In our experiments, we also trained Tibetan syllable vectors. The experimental results are presented in Table 3. It can be observed that Tibetan syllable vectors achieve better performance. The results of the AttSiaLSTM model are superior to those of the SiaLSTM model. The ρ value can reach 0.678, which is significantly higher than the results of other models. Moreover, the MSE and ρs values of the AttSiaLSTM model with Tibetan syllable vectors outperform the other models.

Table 3.

Experimental results on Tibetan word vector and Tibetan syllable vector, where SiaLSTM model is Siamese LSTM network and AttSiaLSTM is attentive Siamese LSTM network.

ModelMethodρMSEρs
SiaLSTM Tibetan word vector 0.4985 1.2512 0.4877 
 Tibetan syllable vector 0.5691 1.0152 0.5489 
AttSiaLSTM Tibetan word vector 0.5702 1.0086 0.5532 
 Tibetan syllable vector 0.6780 0.8329 0.6623 
ModelMethodρMSEρs
SiaLSTM Tibetan word vector 0.4985 1.2512 0.4877 
 Tibetan syllable vector 0.5691 1.0152 0.5489 
AttSiaLSTM Tibetan word vector 0.5702 1.0086 0.5532 
 Tibetan syllable vector 0.6780 0.8329 0.6623 

To compare the performance between Tibetan syllable vectors and Tibetan word vectors, we extracted several sentence pairs with similarity from the Tibetan plagiarism detection model, as shown in Table 4. When comparing the output between Tibetan syllable vectors and word vectors, it is evident that the former aligns more closely with manual annotation values. This suggests that the AttSiaLSTM model based on Tibetan syllable vectors performs better than the AttSiaLSTM model based on Tibetan word vectors.

Table 4.

Experimental samples of Tibetan plagiarism detection. Similarity presents manual annotation in SICK; out_syl presents experimental results of AttSiaLSTM model based on Tibetan syllable vector; out_word presents experimental results of AttSiaLSTM model based on Tibetan word vector. In the table, ‘Tib’, ‘Cn’, and ‘En’ represent the abbreviations for Tibet, Chinese, and English, respectively. Additionally, the sentence in English has the same meaning with the sentence above in Tibet or Chinese.

Sentence PairSimilarityout_sylout_word
lib1: སྐྱེས་པ་ཞིག་གིས་སྒེའུ་ཁུང་གི་འོག་ཏུ་པི་ཝང་རྡུང་བཞིན་ཡོད་།
(En: A man plays guitar under the window)
Tib2: སྐྱེས་པ་ཞིག་གིས་རྣོ་སྦྲེང་དཀྲོལ་བཞིན་ཡོད་།
(En: A man plays guitar) 
2.4 2.4287 2.6786 
Tib1: མི་ཞིག་གིས་རྣོ་སྦྲེང་འབུད་བཞིན་འདུག
(En: A man is playing flute alone)
Tib2: བུ་གཅིག་གིས་པི་ཝང་རྡུང་ཤེས།
(En: A boy can play the guitar) 
2.7 2.9699 3.1576 
Tib1: སྐྱེས་པ་ཞིག་གིས་རྐང་རྩེད་སྤོ་ལོ་རྒྱག་བཞིན་ཡོད་།
(En: A man is playing football)
Tib2: སྐྱེས་པ་ཞིག་གིས་སྤོ་ལོ་འགྲན་བསྡུར་ཞིག་ལ་བཀོལ་བཞིན་ཡོད་།
(En: A man is playing basketball in basketball game) 
3.4221 3.3670 
Sentence PairSimilarityout_sylout_word
lib1: སྐྱེས་པ་ཞིག་གིས་སྒེའུ་ཁུང་གི་འོག་ཏུ་པི་ཝང་རྡུང་བཞིན་ཡོད་།
(En: A man plays guitar under the window)
Tib2: སྐྱེས་པ་ཞིག་གིས་རྣོ་སྦྲེང་དཀྲོལ་བཞིན་ཡོད་།
(En: A man plays guitar) 
2.4 2.4287 2.6786 
Tib1: མི་ཞིག་གིས་རྣོ་སྦྲེང་འབུད་བཞིན་འདུག
(En: A man is playing flute alone)
Tib2: བུ་གཅིག་གིས་པི་ཝང་རྡུང་ཤེས།
(En: A boy can play the guitar) 
2.7 2.9699 3.1576 
Tib1: སྐྱེས་པ་ཞིག་གིས་རྐང་རྩེད་སྤོ་ལོ་རྒྱག་བཞིན་ཡོད་།
(En: A man is playing football)
Tib2: སྐྱེས་པ་ཞིག་གིས་སྤོ་ལོ་འགྲན་བསྡུར་ཞིག་ལ་བཀོལ་བཞིན་ཡོད་།
(En: A man is playing basketball in basketball game) 
3.4221 3.3670 

4.2 Tibetan-Chinese Translation Plagiarism Detection

Leveraging Tibetan-Chinese data from translation-based data augmentation, we trained word vectors based on multilingual unsupervised or supervised word embeddings (MUSE). MUSE, proposed by Facebook, is a cross-lingual word vector approach. We employed corpora of varying scales and utilized the Tibetan-Chinese cross-lingual word vector as input to train a Tibetan-Chinese translation plagiarism detection model. The aim was to investigate the impact of corpus scale on the experiments. Experimental results are shown in Table 5. The ρ value of the Tibetan-Chinese cross-language plagiarism detection experiment is 0.1505 when the training data contains 10,000 sentence pairs. This suggests that the model output is weakly correlated with manual annotation. However, as the size of the training data increases to 50,994, the ρ value improves significantly to 0.4062, indicating a substantial improvement of 0.25 over the baseline. Moreover, with a training data size of 207,975 sentence pairs, the ρ value reaches 0.5476, indicating a moderate correlation with manual annotation. However, as the number of sentence pairs continues to increase, the Pearson correlation coefficient starts to decline. This can be attributed to the accumulation of errors in the generated data. It is worth noting that the Chinese plagiarism detection model used for data augmentation achieves a Pearson correlation coefficient of 0.5605. However, since errors are introduced during the generation of training data, the generated data based on the data augmentation method also contain errors. Therefore, as the training corpus size increases, the accumulation of errors has a negative impact on the model.

Table 5.

Influence of corpora scale on Tibetan-Chinese translation plagiarism detection experiments.

Number of sentence pairsρMSEρs
10,000 0.1505 2.6627 0.1425 
14,994 0.2291 2.5062 0.1841 
50,994 0.4062 1.8417 0.5139 
101,988 0.3744 1.8402 0.4054 
151,988 0.4746 1.0589 0.5239 
167,976 0.4845 1.0424 0.5265 
177,975 0.4957 1.0425 0.5044 
197,975 0.5264 1.0311 0.5297 
207,975 0.5476 1.0205 0.5508 
217,975 0.5127 1.0753 0.5196 
Number of sentence pairsρMSEρs
10,000 0.1505 2.6627 0.1425 
14,994 0.2291 2.5062 0.1841 
50,994 0.4062 1.8417 0.5139 
101,988 0.3744 1.8402 0.4054 
151,988 0.4746 1.0589 0.5239 
167,976 0.4845 1.0424 0.5265 
177,975 0.4957 1.0425 0.5044 
197,975 0.5264 1.0311 0.5297 
207,975 0.5476 1.0205 0.5508 
217,975 0.5127 1.0753 0.5196 

We also extract several sentence pairs samples when the number of training sentences pair is 14,994, 151,988 and 207,975, respectively. We compare similarity values between manual annotation and model output, as shown in Table 6. It is shown that the performance of the first three sentence pairs becomes better when the number of training data is increasing. It indicates that the data augmentation method is effective for low-resource language. The negative impact of the data augmentation method also can be seen in the last two sentence pairs.

Table 6.

Experimental results of Tibetan-Chinese translation plagiarism detection in terms of similarity for manual annotation and model outputs with different data scales.

Tibetan-ChineseManualModeloutputs
sentence pair  14,994 151,988 207,975 
一个男人在唱歌
སྐྱེས་པ་ ཞིག་ གིས་གླུ་ ལེན་
(A man is singing) 
3.4 3.9366 3.7038 3.0950 
那人正在树林里坐着
སྐྱེས་པ་ ཞིག་ གིས་ནགས་ཚལ་ ནང་ བསྡད་
(The man is sitting in the woods) 
3.3 4.2417 3.9381 3.3296 
一个人拿着话筒唱着歌
སྐད་སྦུག་ བཟུང་ ནས་ སྐྱེས་པ་ ཞིག་ ལེན་ བཞིན་
(A man singing with a microphone) 
1.1 4.0065 3.7388 3.4043 
那人坐在火车上,把手放在脸上
སྐྱེས་པ་ ཞིག་ གིས་ མེ་འཁོར་ ནང་ བསྡད་ ། ལག་པ་ གདོང་ ལ་ བཞག་
(The man sat on the train and put his hands on his face) 
4.9 3.1056 3.5461 3.7200 
一个人骑着马
སྐྱེས་པ་ ཞིག་ གིས་རྟ་ བཞོན་ ནས་
(A man is riding a horse alone) 
1.5 3.6221 3.6377 3.4920 
Tibetan-ChineseManualModeloutputs
sentence pair  14,994 151,988 207,975 
一个男人在唱歌
སྐྱེས་པ་ ཞིག་ གིས་གླུ་ ལེན་
(A man is singing) 
3.4 3.9366 3.7038 3.0950 
那人正在树林里坐着
སྐྱེས་པ་ ཞིག་ གིས་ནགས་ཚལ་ ནང་ བསྡད་
(The man is sitting in the woods) 
3.3 4.2417 3.9381 3.3296 
一个人拿着话筒唱着歌
སྐད་སྦུག་ བཟུང་ ནས་ སྐྱེས་པ་ ཞིག་ ལེན་ བཞིན་
(A man singing with a microphone) 
1.1 4.0065 3.7388 3.4043 
那人坐在火车上,把手放在脸上
སྐྱེས་པ་ ཞིག་ གིས་ མེ་འཁོར་ ནང་ བསྡད་ ། ལག་པ་ གདོང་ ལ་ བཞག་
(The man sat on the train and put his hands on his face) 
4.9 3.1056 3.5461 3.7200 
一个人骑着马
སྐྱེས་པ་ ཞིག་ གིས་རྟ་ བཞོན་ ནས་
(A man is riding a horse alone) 
1.5 3.6221 3.6377 3.4920 

In this paper, we focus on cross-lingual Tibetan-Chinese text plagiarism and propose an improved attentive Siamese LSTM model for semantic rewriting plagiarism and translation plagiarism. A translation-based data augmentation strategy is explored to alleviate the problem of data scarcity, and a pre-detection method is proposed based on abstract document vector to improve the efficiency of detection. Experimental results show that the proposed plagiarism detection model achieves a strong PCC correlation with the manual annotation for single language and cross-lingual Tibetan-Chinese translation plagiarism.

Our future work is to improve the performance of Tibetan-Chinese text plagiarism detection. More Tibetan-Chinese corpus and SOTA deep learning models will be applied in the experiments. Additionally, we will explore data augmentation method in other low-resource languages.

W. Bao ([email protected]), J. Dong ([email protected]), Y. Xu ([email protected]), Y. Yang (wateryoo919@ aliyun.com), X. Qi ([email protected]) were all responsible for the system design, the experimental implementation of the approach, and the result analysis. All authors contributed to the manuscript writing and approved the submitted version.

This work is supported by the National Natural Science Foundation of China (No.62271456), the Open Projects Program of State Key Laboratory of Multimodal Artificial Intelligence Systems.

[1]
Moylan
,
E. C.
,
Kowalczuk
,
M. K.
:
Why articles are retracted: a retrospective cross-sectional study of retraction notices at biomed central
,
BMJ Open
e012047
(
2016
).
[2]
Gupta
,
P.
,
Barrón
,
C. A.
,
Rosso
,
P.
:
Cross-language high similarity search using a conceptual thesaurus
. In:
Proceedings of the Third International Conference of the CLEF Initiative:Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics
pp.
67
75
(
2012
).
[3]
Chambers
,
L. M.
,
Michener
,
C. M.
,
Falcone
,
T.
:
Plagiarism and data falsification are the most common reasons for retracted publications in obstetrics and gynaecology
.
BJOG
126
(
9
),
1134
1140
(
2019
).
[4]
Kong
,
L. L.
,
Qi
,
H. L.
,
Wang
,
S.
, et al
:
Approaches for candidate document retrieval and detailed comparison of plagiarism detection
. In:
Proceedings of the Conference and Labs of the Evaluation Forum
(
2012
).
[5]
Kong
,
L. L.
,
Qi
,
H. L.
,
Du
,
C. X.
, et al
:
Approaches for source retrieval and text alignment of plagiarism detection Notebook for PAN at CLEF 2013
. In:
Conference and Labs of the Evaluation Forum
(
2013
).
[6]
Sanchez-Perez
,
M. A.
,
Sidorov
,
G.
,
Gelbukh
,
A. F.
:
A winning approach to text alignment for text reuse detection at pan 2014
. In:
Proceedings of the Conference and Labs of the Evaluation Forum
pp.
1004
1011
(
2014
).
[7]
Fan
,
C. H.
,
Zhang
,
H. M.
,
Li
,
A. D.
, et al
:
Compnet: Complementary network for single-channel speech enhancement
.
Journal of Neural Networks
168
,
508
517
(
2023
).
[8]
Fan
,
K. F.
,
Li
,
F.
,
Yu
,
H. Y.
, et al
:
A blockchain-based flexible data auditing scheme for the cloud service
.
Journal of Chinese Journal of Electronics
6
(
30
),
1159
1166
(
2021
).
[9]
Torrejón
,
D. A. R.
,
Ramos
,
J. M. M.
:
Text Alignment Module in CoReMo 2.1 Plagiarism Detector Notebook for PAN at CLEF 2013
. In:
Proceedings of the Conference and Labs of the Evaluation Forum
(
2013
).
[10]
Rodríguez-Torrejón
,
D.
,
Ramos
,
J. M.
:
CoReMo system (contextual reference monotony) a fast, low cost and high performance plagiarism analyzer system: Lab report for pan at CLEF 2010
. In:
Proceedings of the Conference and Labs of the Evaluation Forum
(
2010
).
[11]
Stamatatos
,
E.
:
Plagiarism detection using stopword n-grams
.
Journal of the American Society for Information Science and Technology
62
(
12
),
2512
2527
(
2011
).
[12]
Stein
,
B.
,
Eissen
,
S. M. Z.
:
Near similarity search and plagiarism analysis
. In:
Proceedings of the 29th Annual Conference of the Gesellschaft für Klassifikation
pp.
430
437
(
2005
).
[13]
Chow
,
T. W.
,
Rahman
,
M.
:
Multilayer som with tree-structured data for efficient document retrieval and plagiarism detection
.
IEEE Transactions on Neural Networks
20
(
9
),
1385
1402
(
2009
).
[14]
Zhang
,
H. J.
,
Chow
,
T. W. S.
:
A coarse-to-fine framework to efficiently thwart Plagiarism
.
Journal of Pattern Recognition
pp.
471
487
(
2011
).
[15]
S.
Alzahrani
,
V.
Palade
,
N.
Salim
,
A.
Abraham
,
Using structural information and citation evidence to detect significant plagiarism cases in scientific publications
.
Journal of the American Society for Information Science and Technology
pp.
286
312
(
2012
).
[16]
Gipp
,
B.
,
Beel
,
J.
:
Citation based plagiarism detection: a new approach to identify plagiarized work language independently
. In:
Proceedings of the 21st ACM Conference on Hypertext and Hypermedia
pp.
273
274
(
2010
).
[17]
Gipp
,
B.
,
Meuschke
,
N.
,
Breitinger
,
C.
:
Citation-based plagiarism detection: Practicability on a large-scale scientific corpus
.
Journal of the Association for Information Science and Technology
pp.
1527
1540
(
2014
).
[18]
He
,
H.
,
Gimpel
,
K.
,
Lin
,
J.
:
Multi-perspective sentence similarity modeling with convolutional neural networks
. In:
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
pp.
1576
1586
(
2015
).
[19]
Tai
,
K. S.
,
Socher
,
R.
,
Manning
,
C. D.
:
Improved semantic representations from tree-structured long short-term memory networks
. In:
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing
pp.
1556
1566
(
2015
).
[20]
Kiros
,
R.
,
Zhu
,
Y.
,
Salakhutdinov
,
R. R.
:
Skip-thought vectors
. In:
Proceedings of the 28th International Conference on Neural Information Processing Systems
pp.
3294
3302
(
2015
).
[21]
Mueller
,
J.
,
Thyagarajan
,
A.
:
Siamese recurrent architectures for learning sentence similarity
. In:
Proceedings of the AAAI Conference on Artificial Intelligence
pp.
2786
2792
(
2016
).
[22]
Karzan
,
W.
,
Muhammad
,
G.
,
Shvan
,
T.
, et al
.:
Plagiarism detection system for the kurdish language
.
International Journal of Information Technology and Computer Science
pp.
64
71
(
2017
).
[23]
Sameen
,
S.
,
Sharjeel
,
M.
,
Nawab
,
R. M. A.
, et al
:
Measuring short text reuse for the urdu language
.
IEEE Access
pp.
7412
7421
(
2017
).
[24]
Rezaeian
,
N.
,
Novikova
,
G.
:
Detecting near-duplicates in Russian documents through using fingerprint algorithm simhash
.
Procedia Computer Science
, pp.
421
425
(
2017
).
[25]
Arifin
,
Y.
,
Isa
,
S.
,
Wulandhari
,
L.
, et al
.:
Plagiarism detection for Indonesian language using winnowing with parallel processing
.
Journal of Physics: Conference Series
, pp.
012082
(
2018
).
[26]
McNamee
,
P.
,
Mayfield
,
J.
:
Character n-gram tokenization for european language text retrieval
.
Journal of Information retrieval
7
,
73
97
(
2004
).
[27]
Barrón-Cedeno
,
A.
,
Rosso
,
P.
,
Pinto
,
D.
, et al
:
On cross-lingual plagiarism analysis using a statistical model
, In:
Proceedings of the 2008 International Conference on Uncovering Plagiarism
,
Authorship and Social Software
pp.
1
10
(
2008
).
[28]
Aljuaid
,
H.
:
Cross-language plagiarism detection using word embedding and inverse document frequency (IDF)
.
International Journal of Advanced Computer Science and Applications
11
(
2
),
232
237
(
2020
).
[29]
Mostafa
,
H. E.
,
Benabbou
,
F.
:
A deep learning based technique for plagiarism detection: a comparative study
.
International Journal of Artificial Intelligence
9
(
1
),
81
90
(
2020
).
[30]
Ding
,
L.
,
Tao
,
D. C.
:
The university of Sydney's machine translation system for WMT19
, In:
Proceedings of the Fourth Conference on Machine Translation
pp.
175
182
(
2019
).
[31]
Pino
,
J.
,
Puzon
,
L.
,
Gu
,
J.
:
Harnessing indirect training data for end-to-end automatic speech translation: Tricks of the trade
. In:
Proceedings of the 16th International Conference on Spoken Language Translation
(
2019
).
[32]
Zan
,
C. T.
,
Peng
,
K. Q.
,
Ding
,
L.
, et al
:
Vega-MT: The JD explore academy translation system for wmt22
. In:
Proceedings of the Seventh Conference on Machine Translation
pp.
411
422
(
2022
).
[33]
Lai
,
W.
,
Zhao
,
X. B.
,
Bao
,
W.
:
Tibetan-Chinese neural machine translation based on syllable segmentation
. In:
Proceedings of the AMTA 2018 Workshop on Technologies for MT of Low Resource Languages (LoResMT 2018)
pp.
21
29
(
2018
).
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.