A two-phase plagiarism detection system based on multi-layer long short-term memory networks

Received Aug 25, 2020 Revised May 15, 2021 Accepted May 26, 2021 Finding plagiarism strings between two given documents are the main task of the plagiarism detection problem. Traditional approaches based on string matching are not very useful in cases of similar semantic plagiarism. Deep learning approaches solve this problem by measuring the semantic similarity between pairs of sentences. However, these approaches still face the following challenging points. First, it is impossible to solve cases where only part of a sentence belongs to a plagiarism passage. Second, measuring the sentential similarity without considering the context of surrounding sentences leads to decreasing in accuracy. To solve the above problems, this paper proposes a two-phase plagiarism detection system based on multi-layer long short-term memory network model and feature extraction technique: (i) a passage-phase to recognize plagiarism passages, and (ii) a word-phase to determine the exact plagiarism strings. Our experiment results on PAN 2014 corpus reached 94.26% F-measure, higher than existing research in this field.


INTRODUCTION
Plagiarism is defined as the reuse of another person's ideas, processes, results, or words without explicitly acknowledging the source [1]. Plagiarism detection is the algorithm for automatically retrieving strings in a suspicious document reused from another document. Plagiarism methods are divided into two main types: literal plagiarism and intelligent one, based on the plagiarist's behavior [2]. Literal plagiarism is a common and popular case in which plagiarists do not spend much time hiding the academic crime they committed. For example, they copy and paste the text from the internet. Intelligent plagiarism is severe academic dishonesty wherein plagiarists try to deceive readers by changing others' contributions to appear as their own. Intelligent plagiarists try to hide, obfuscate, and change the original work in various intelligent ways, including text manipulation, translation, and idea adoption.
Over the past two decades, automatic plagiarism detection has received significant attention from the research community. Two main tasks of automatic plagiarism detection are source retrieval and text alignment. In the source retrieval task, given a suspicious document and a web search engine, the task is to retrieve all source documents from which text has been reused. In the text alignment subtask, given a pair of documents (a suspicious document and a source one), the task is to identify contiguous maximal-length passages of reused text. in which p d i is a string from d; p d′ j is a string from d'; p d i  p d ′ j indicates the similarity between p d i and p d ′ j ;  is a threshold that is used to determine whether two strings are similar enough to be considered as plagiarism.
The series of competition shared tasks for plagiarism detection named plagiarism analysis, authorship identification, and near-duplicate detection (PAN) has defined four types of plagiarism. a. None obfuscation: Create plagiarism cases by copying a paragraph from the source document and insert it into the suspicious one. b. Random obfuscation: Create plagiarism cases by inserting, deleting, changing the order of words from a paragraph of the source, and inserting it into the suspicious document. c. Translation obfuscation: Create plagiarism cases by translating a paragraph more than once through several languages and back to the original language using different machine translation tools. Then, inserting the translated paragraph into the suspicious document. d. Summary obfuscation: Create plagiarism cases by summarizing the source paragraph and inserting it into the suspicious document. This paper aims at solving plagiarism cases belong to all four types above. Our proposed system's workflow is shown in Figure 1, including three steps. − Pre-processing: This step splits input documents into sentences, removes stopwords and special characters, and combines sort sentences into one. − Passage-phase: After the pre-processing step, we use a context window sliding over the source and suspicious documents to create candidate passages. We extract features from these passages and generate an input feature matrix corresponding to these features. This matrix is feed into a binary classifier of the candidate selection module to obtain pairs of plagiarism passages. − Word-phase: The pairs of plagiarism passages are used as the input for the word-phase. The purpose of this phase is to define the exact plagiarism strings from the input passages. A binary classifier at the word-level is used to perform this task.

Pre-processing
The input documents are split into sentences using the sent tokenizer tool from the NLTK library. Then stopwords are removed from these sentences. Some specific cases can affect the accuracy of plagiarism selection. These cases are: − The input documents contain numbers that are written incorrectly, such as '8. 39', '7 p. m'. In this case, the sentence splitter incorrectly segments text into sentences at the dot ('.') character. − After removing stopwords, there are some short sentences containing none or only one or two tokens. For example, two sentences "Can you feel the burn?", "Who we are?" remain two words and empty, respectively, after cleaning stopwords and punctuation characters.
Since the similarities of short sentences do not have much meaning, we combine the short sentences with surrounding sentences and compare the similarity between the passages after combined. Therefore, to deal with the problems mentioned above, we first apply the sentence splitter and then remove stopwords, numbers, and special characters from the sentences. After cleaning the text, sentences with less than three words are combined with the next sentence to create extended sentences. To the best of our knowledge, the above combination step allows us to efficiently manage the passage's length after pairing and avoiding the case of creating too-long passages. We use a window of size w (sentences) sliding on both suspicious and The optimal window size for the PAN datasets is three sentences.

Passage-phase
The input of this phase is candidate plagiarism passages, each passage consisting of three consecutive sentences from the suspicious or source documents. In this phase, each passage is encoded as a semantic embedding vector. The semantic similarity between two passages is calculated based on the distance between these vectors. We use SBERT to encode passages, since it is proved in [14] that SBERT is better than other methods (e.g., Word2Vec [15], Glove [16], Fastext [17], InferSent [18], or Universal Sentence Encoder [19]) in various domains. Features representing for each passage is derived from these passage vectors. They are then used as inputs for the binary classification at the passage level to detect whether two passages are similar or not.

Passage-phase feature extraction
Given a set of all candidate passages in the suspicious document U = (u1,u2,…,un) and a set of all candidate passages in the source document V = (v1,v2,…,vm), with each passage ui and vj is represented as a passage embedding vector. We propose the following features for this phase: − Maximize passage similarity This feature is used to determine the maximum similarity of a passage vector ui against a set of passage vectors V. Let us say , is the similarity between two passage vectors ui and vj where ui ∈ U, vj ∈ V. Let , is the maximum passage similarity of the passage vector ui against the set of passage vectors V. It is calculated as: The maximize passage similarity feature vector of all passage vectors in the pair of suspicious and source document is determined by (3): −

Maximize passage intersection
To determine the maximum intersection value of a passage ui with a set of passages V, we split passages into words and find the intersection words of each passage pair (ui, vj), with ui ∈ U, vj ∈ V and take the maximum length of this intersection. This value is calculated as in (4): The maximize passage intersection feature vector of all passages in the pair of suspicious and source document is determined by (5): − Passage importance Term frequency-inverse document frequency (TF-IDF) is the most widely used and considered one of the most appropriate term weighting schemes. This TF-IDF is employed to get rid of terms with lower weights from documents and helps to increase the retrieval effectiveness. Term frequency-inverse document frequency is a numerical statistic that tells us how important a word is to a document in a collection or a corpus. It is mostly used as a weighting factor in various processes used for information retrieval and text mining. To determine similar passages, we put forward the idea of term frequency-inverse sentence frequency (TF-ISF) [20]. We treat each passage as a document and each document as a corpus, then calculate the values of TF(w,U), TF(ui,U), and ISF(ui,U), in which w is a term in a passage ui, U is the document containing ui. Given | | is the total number of words in the passage ui, TF(ui,U) is computed as: ISF(ui,U) is computed by (7): The passage importance of the passage ui in the document U is determined by (8): The passage importance feature vector of all passage in the pair of suspicious and source document is determined by (9): − The feature matrix for the passage-phase After extracting and creating three feature vectors psim(U,V), pinter(U,V), and pimp(U,V), we combine them into a two-dimensional matrix of size (n+m) x 3 where n+m is the total number of passages from suspicious and source documents. The feature matrix for all passages in the pair of suspicious and source documents is determined as in (10). It is used as the input for the multi-layer LSTM network model, described in section 2.2.2.

Plagiarism passage selection
We build our binary classifier by using a multi-layer LSTM network model, which is used to predict the probability of being a plagiarism passage in the pair of suspicious and source documents. Figure 2 shows the structure of our model at the passage-phase. At this phase, we generate the input vectors by reshaping the feature matrix fpassage into a three-dimensional matrix of batch_size, time_steps, and seq_len and feed them into the model. The parameters using in the LSTM model are: (i) batch_size equals the number of passages; (ii) time_steps equals 1; (iii) seq_len equals the number of features (seq_len=3). The output of the sigmoid activation function is always in the range of (0,1). This function is applied to the output of all units in the last hidden LSTM layer. Let = ( 1 , 2 , … , + ) is the output of the binary classification model (0 < yi < 1), and n+m is the number of passages in the pair of suspicious and source documents. Figure 3 shows the output of the model is a vector of 0s and 1s in which values 1 for all yi being higher than a threshold θ, and values 0 for the remaining.  Plagiarism passages are generated by selecting sentences corresponding to the longest values of 1 from the output of the model. When observing and analyzing the plagiarism passages obtained, we found that most plagiarism passages contain entire sentences. However, the plagiarism paragraph contains several redundant words at the two ends, such as the example in the PAN 2014 corpus explained by: this example. In this example, the underlined text is inside the plagiarism paragraph, whereas the rest is redundant. The suspicious plagiarism paragraph: The capsule was designed for entry into the Martian atmosphere, descent to the surface, impact survival, and surface lifetimes of as much as six months and contained the power, guidance, control communications, and data handling systems necessary to complete its mission. is perhaps the most productive space probe yet deployed, visiting four planets and their moons, including two primary visits to previously unexplored planets, with powerful cameras and a multitude of scientific instruments, at a fraction of the money later spent on specialized probes such as the and the probe. Along with, and Voyager 2 is an .Voyager 2 Galileo spacecraft Cassini-Huygens [2] [3] Pioneer 10 Pioneer 11 Voyager 1 New Horizons interstellar probe resident per year, or roughly half the cost of one candy bar each year since project inception.

The source plagiarism paragraph:
Voyager 2 unmanned interplanetary space probe Voyager program Voyager 1 Voyager 2 ecliptic Solar System Uranus Neptune gravity assist Saturn Voyager 2 Titan Planetary Grand Tour [1] is perhaps the most productive space probe yet deployed, visiting four planets and their moons, including two primary visits to previously unexplored planets, with powerful cameras and a multitude of scientific instruments, at a fraction of the money later spent on specialized probes such as the and the probe. Along with, , and Voyager 2 is an .Voyager 2 Galileo spacecraft Cassini-Huygens [2] [3] Pioneer 10 Pioneer 11 Voyager 1 New Horizons interstellar probe Contents Titan 3E Centaur was originally planned to be, part of the.
To solve this problem, we extend pairs of plagiarism passages from the suspicious and source documents by adding k sentences to the left and right of both passages. Extended passages will be used as the input for the word-phase to find exact plagiarism strings. It is done by removing redundant text from the extended plagiarism passages. The word-phase will be introduced next.

Word-phase
To remove the redundant text at the two ends of the extended plagiarism passages, we need to identify semantically related segments based on consecutive words of high similarity. To get the meaning of a word, we put that word in a window size of 3 with one word on the left and one word on the right. The text inside this window is used as the input of SBERT to create word feature vectors.

Word-level feature extraction
In this phase, three features are proposed based on the cosine similarity between the word and the sentence containing that word. The word similarity feature is a vector that contains the maximum similarity values of each word. The maximum similarity of a word in the suspicious passage is the maximum similarity of that word with each word in the source passage and vice versa. Features average word similarity and sentence based similarity are used to solve cases where the similarity value of a word has a big difference with the surrounding words. The average word similarity feature is a vector that each item is the average of the word similarity values within the sentence. The sentence based similarity feature is a vector that each item is the maximum of sentence similarities of the sentence containing that word. The detailed information on the word-phase features is explained by: Given the extended suspicious passage P=(p1,p2,…,pn), the extended source passage Q=(q1,q2,…,qm) with each word pi and qj is represented by a word embedding vector. − Word similarity Let us call sim(pi,qj) is the cosine similarity between two word vectors pi and qj. The word similarity feature between P and Q is a vector being computed as (11). − Average word similarity Given wi (with i= 1÷n+m), is the i-th word in the pair of suspicious and source passages, d is the sentence that wi ∈ d, and |d| is the total number of words in the sentence d. Let us call avg(wi) is the average similarity of word wi in the sentence d; wsim(i) is the value of the i-th item in the word similarity feature vector. Then, the avg(wi) is computed as: The average word similarity feature between two passages P and Q is a vector determined by the following formula: wavg(P,Q) = (avg(p1), avg (p2),…, avg (pn), avg(q1), avg(q2),…, avg(qm)) (13) − Sentence based similarity We reuse the maximize passage similarity feature (as described in the passage-phase) with the meaning of the passage is the sentence. Given the set of sentences U = (u1,u2,…,uk), and V = (v1,v2,…,vs) in the suspicious and source passages, respectively. Let us call sim_sent(pi) is the sentence based similarity of word pi in the sentence uj. The sim_sent(pi) is computed as: The sentence based similarity feature between two passages P and Q is a vector determined by the following formula:
The feature matrix of all the extended plagiarism passages is determined by (16). This feature matrix is used as the input for the multi-layer LSTM model, described in section 2.3.2.

Plagiarism string selection
In this section, we conduct two processing steps: (i) select plagiarism sentences and (ii) remove redundant text. The details of each step are described as: − Select plagiarism sentences To select exact plagiarism sentences from the extended plagiarism passages, we use a multi-layer LSTM model whose input is taken from the feature matrix fword as shown in Figure 4. The parameters using in this model are: (i) batch_size equals the number of words; (ii) time_steps equals 1; (iii) seq_len equals the number of features (seq_len=3).
In Figure 4, pi and qj denotes the i-th and j-th word in the pair of extended plagiarism passages, _ = ( 1 , 2 , … , + ) is the output of the binary classification model (0 < yi < 1), n+m is the total number of words in the pair of these passages. The predicted mean value of a sentence u is computed as in (17): where wi is a word in the sentence u.
After computing values _ _ for all sentences, we create a vector with the size corresponding to the total number of sentences in the pair of plagiarism passages. If the value of y_pred_sent of a sentence is higher than a threshold β, the value corresponding to that word in the sentence is 1; otherwise, it is 0. We select the longest strings with the value of 1 as the plagiarism sentences. Remove redundant text To achieve the exact plagiarism strings, we consider the leftmost plagiarism sentence and the rightmost one. The difference between these sentences' max_threshold and min_threshold is higher than t1 (t1=0.4). The max_threshold and min_threshold of a sentence u are determined by (18) and (19): with wi is a word in the sentence u. These sentences above have one part inside and the remaining part outside the plagiarism passage. The outside part is on the left (orient =1) if the sentence is on the left of the plagiarism sentences or on the right (orient =2) if the sentence is on the right of the plagiarism sentences. If the previous step result contains only one sentence, the outside part belongs to the two ends (orient =3) of the sentence. Analyzing the output vector of the LSTM model y_pred, we discover that the predicted value yi corresponding of the inside words is much higher than the predicted value yj corresponding of the outside ones.
Algorithm 1 is used to cut off the redundant text from these sentences. The idea of this algorithm is: Given a threshold α, find the longest text in the leftmost sentence and the rightmost one whose all of their words have the predictive value y_pred < α. We defined the left and right position as the first and last word of the exact plagiarism strings, respectively. The algorithm receives the following parameters as inputs: − y_d: is the predicted vector of the sentence. y_ = ( _ 1 , _ 2 , … , _ ) with t is the number of words in the sentence. sentences. The value of the array's element is 1 if y_pred_sent is higher than β, and 0 for others. Then we select a continuous string with the highest predicted value. Table 2 shows the accuracy and loss values in the LSTM training phase with the four datasets in PAN 2013.
To evaluate the effectiveness of our proposed features, we carried experiments using each feature instead of all features, with the input is pairs of documents from PAN 2014 test corpus. Figure 5 shows the effect of these features at the word-phase on the system output. Three pairs of Figures 5(a) to 5(f) show the prediction results of y_pred and the final results using 1, 2, and 3 features, respectively. In these figures, the blue line shows the predicted result; the red line shows the average predicted value by sentences. The green line separates the suspicious and source passage; the black line shows the range of the selected plagiarism passages. The evaluation results proved that all the proposed features are useful, solving well for both literal plagiarism and intelligent plagiarism.    Table 3 compares our system performance compared with existing research on this task using PAN 2014 as the test set. It shows that our system has a remarkable improvement comparing to other researches. It indicates that our system can detect most plagiarism cases comparing to others. The results prove that our proposed feature extraction techniques combining with our LSTM models is a promise solution for the case of detecting intelligent plagiarism in which the same content can be expressed in different ways and by different words, using a small training corpus. When analyzing our system output, we found that most of the incorrect results are due to the following situations: − Sentential redundancy: This situation occurs when the sentence near the plagiarism passage is semantically related to the plagiarism passage. In that case, the system often includes it to the plagiarism passage. − Word missing or redundancy: This situation occurs when only a part of the sentence is in the plagiarism passage. The pre-processing step has removed stopwords from the input documents. Therefore, when restoring the original text from the output of the word-phase, we need to recover these stopwords from the original documents. Redundance or some missing stopwords may occur at the beginning and the end of the recovery passage. These problems will be considered in our future work.

CONCLUSION
This paper has proposed an approach using feature extraction techniques and a two-phase plagiarism detection system based on multi-layer LSTM Networks to determine plagiarism strings between two documents. The key to the paper's success is to select appropriate features for both word matching and semantic-based plagiarism. Besides, the inheritance of research results on measuring the similarity between two sentences is also an essential factor in catching sentences inside plagiarism passages compared to outside sentences. The proposed method was evaluated using the PAN 2014 text alignment corpus and widely accepted evaluation metrics: precision, recall, and plagdet. The solution achieves the best recall and plagdet, and the second precision better compared to state-of-the-art systems. In our future work, we plan to find a method to automatically choose optimal parameters for our system. Also, we will investigate methods to solve the redundancy problem in the system' output, mentioned in section 3.2.