A deep learning based technique for plagiarism detection: a comparative study

ABSTRACT


INTRODUCTION
The advancement of information technology (IT) and particularly the Web has impressively expanded the accessibility of data and leads thus to the rising of plagiarism. Plagiarism is a practice of taking someone else's work or ideas and passing them off as one's own. Several plagiarism techniques are performed by some dishonest authors, and here bellow some of them [1][2]:

RESEARCH METHOD
In this section we will mention the different techniques used by the plagiarism detection approaches, whether in terms of its representation of its texts or the methods those calculate the similarity: a. Neural network based models Word embeddings are a type of word representation which stores the contextual information in a lowdimensional vector. This approach gained extreme popularity with the introduction of Word2Vec in 2013, groups of models to learn the word embeddings in a computationally efficient way. And Doc2Vec can be seen an extension of Word2Vec whose goal is to create a representational vector of a document or paragraph. Word2vec: is a model using neural network used to produce a distributed representation of word. Some researcher says that is not deep learning technique, because it is simple bi-layered neural network architecture. This model is shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus [19]. Doc2vec: Doc2vec is an unsupervised algorithm to generate vectors representation of sentences, paragraphs and documents [20]. Its model is based on Word2Vec, with only adding another vector (paragraph ID) to the input. The architecture of Doc2Vec model is shown Figure 1. Instead of using just nearby words to predict the word, we also added another feature vector, which is document-unique. b. Deep learning based models Deep learning is a set of learning methods attempting to model data with complex architectures combining different non-linear transformations. The elementary bricks of deep learning are the neural networks that are combined to form the deep neural networks. There exist several types of architectures for neural networks: Recursive neural networks (RNN): have been successful, for instance, in learning sequence and tree structures in natural language processing, mainly phrase and sentence continuous representations based on word embedding [21]. Siamese LSTM for Learning documents Similarity: LSTM is a king of recurrent neural network and it is great when we have an entire sequence of words or sentences. This is because RNNs can model and remember the relationships between different words and sentences. Manhattan LSTM models have two networks LSTMleft and LSTMright which process one of the sentences in a given pair independently. Siamese LSTM, a version of Manhattan LSTM where both LSTMleft and LSTMright have same tied weights such that LSTMleft = LSTMright. Such a model is useful for tasks like duplicate query detection and query ranking. Here, duplicate detection task is performed to find if two documents are similar or not. Similar model can be trained for query ranking using hit data for a given query and its matching results as a proxy for similarity [21]. Convolutional neural network: CNN is a class of deep, feed-forward artificial neural networks that uses a variation of multilayer perceptions designed to require minimal preprocessing. These are inspired by animal visual cortex. CNNs are generally used in computer vision; however, they have recently been applied to various NLP tasks like a text classification [21]. Deep Structured Semantic Model (DSSM): DSSM stands for Deep Structured Semantic Model, or more general, Deep Semantic Similarity Model. It is a deep neural network (DNN) modelling technique for representing text strings (sentences, queries, predicates, entity mentions, etc.) in a continuous semantic space and modelling semantic similarity between two text strings. c. Other models Other methods used to construct a vector representation of a given text can be found: GLOVE: is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space [22]. InferSent: is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks [22]. d. Similarity methods Finding similarity between elements is the core of sentence similarity. In the literature, there are many metrics for calculating similarity. This section shows different approaches used to calculate similarity between elements: Cosine similarity: is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any angle in the interval [0, π] radians [23]. Jaccard index: also known as Intersection over Union and the Jaccard similarity coefficient (originally given the French name coefficient de community by Paul Jaccard), is a statistic used for gauging the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets and is defined as the size of the intersection divided by the size of the union of the sample sets [23]. Euclidean Distance: refers to Euclidean distance. When data is dense or continuous, this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them, and it is obtained with the Pythagorean Theorem [23]. Longest common subsequence (LCS) method: consists of finding the longest subsequence common to all sequences in a set of sequences. The longest common subsequence problem is a classic computer science problem, the basis of data comparison programs such as the diff utility and has applications in computational linguistics and bioinformatics [24]. Word Mover's Distance (WMD): uses word embeddings to calculate the similarities, and precisely, it uses normalized Bag-of-words and word Embeddings to calculate the distance between documents [25].

RELATED WORK
Our study focuses on the detection of semantic plagiarism more precisely the identification of the plagiarism of ideas between two given texts, as illustrated below we dug on methods that detect this type of plagiarism: In [26] proposed a plagiarism detection system, which rely on use sentences comparison in two phases. They first extract word vectors by word2vec algorithm, and then remove Persian stop words while text pre-processing. After that, for each sentence an average of all word vectors is calculated. After feature extraction, in phase 1, each sentence in a suspicious document is compared with all the sentences in the source documents. Cosine similarity is used as a comparison metric. After this step which helps to find the nearest sentences in real time, in phase 2, lexical similarity of two sentences is evaluated by the Jaccard similarity measure. Two sentences which pass Jaccard similarity threshold considered as plagiarism at final step. In [27] proposed the use word2vec model in order to compute vector of features for every word. They choose documents from the corpus itself, however the documents used for testing was processed and the preprocessing that was made is stop words removal. The similarity between vectors was computed by using cosine similarity. [24] The aim of this approach is evaluating the validity of using the distributed representation to define the word similarity. They introduce three methods based on the following three document similarities: for two documents: The length of the longest common subsequence (LCS) divided by the length of the shorter document, the local maximal value of the length of LCS, and the local maximal value of the weighted length of LCS. The distributed representation was obtained from no particular data by word2vec.
Another approach uses the principle of Deep Structured Semantic Model (DSSM) proposed by [28]. DSSM is a deep learning-based technique that is proposed for semantic understanding of textual data. It maps short textual strings, such as sentences, to feature vectors in a low-dimensional semantic space. Then the vector representations are utilized for document retrieval by comparing the similarity between documents and queries. After obtaining the semantic feature vectors for each paired snippets of text, cosine similarity is utilized to measure the semantic similarity between the pair. Similarly, with the previous methods, in [29] deep learning documents or texts can be represented as vectors by the using document to vector technique (doc2vec). And the detection of plagiarism will be done by a simple comparison between all sentences of each two documents analysed.
The approach proposed in [30] is based on converting a paragraph to vectors and it's inspired by the methods for learning the word vectors. The inspiration is that the word vectors are asked to contribute to a prediction task about the next word in the sentence. So, despite the fact that the word vectors are initialized randomly, they can eventually capture semantics as an indirect result of the prediction task. It will use this idea in their paragraph vectors in a similar manner. The paragraph vectors are also asked to contribute to the prediction task of the next word given many contexts sampled from the paragraph.
These approaches [29][30] are used to perform similarity detection between the document vectors but also use the cosine to compare the vectors. In paper [31] they represent each word w by a vector. It constructs these word vectors using GloVe. This approach uses the recursive neural networks algorithm to have a vector representation of a sentence and use the cosine for calculate the similarity. In [32] two input sentences are processed in parallel by identical neural networks, outputting sentence representations. The sentence representations are compared by the structured similarity measurement layer. The similarity features are then passed to a fully-connected layer for computing the similarity score. Cosine distance measures the distance of two vectors according to the angle between them. The use of cosine to detect similarity between sentences remains a solution that carries many risks. InferSent [22] is an NLP technique for universal sentence representation developed by Facebook that uses supervised training to produce high transferable representations. They used a Bi-directional LSTM with attention that consistently surpassed many unsupervised training methods such as the SkipThought vectors. They also provide a Pytorch implementation that they used to generate sentence embedding. So, this approach needs to define a similarity measure to compare two vectors, and for that goal, it'll be the cosine similarity.
The authors in [33] used word embedding, vector representations of terms, computed from unlabelled data, that represent terms in a semantic space in which proximity of vectors can be interpreted as semantic similarity. They propose to go from word-level to text-level semantics by combining insights from methods based on external sources of semantic knowledge with word embedding. They derive multiple types of meta-features from the comparison of the word vectors for short text pairs, and from the vector means of their respective word embedding. The features representing labelled short text pairs are used to train a supervised learning algorithm. In [25] present the Word Mover's Distance (WMD), a novel distance function between text documents. This work is based on recent results in word embedding that learn semantically meaningful representations for words from local co-occurrences in sentences. The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to "travel" to reach the embedded words of another document. This article [34] proposed an innovative word embedding-based system devoted to calculating the semantic similarity in Arabic sentences. The main idea is to exploit vectors as word representations in a multidimensional space in order to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. In paper [35] they address the issue of finding an effective vector representation for a very short text fragment. By effective they mean that the representation should grasp most of the semantic information in that fragment. For this, they use semantic word embedding to represent individual words, and we learn how to weigh every word in the text through the use of tf-idf (term frequency-inverse document frequency) information to arrive at an overall representation of the fragment comparing two tf-idf vectors is done through a standard cosine similarity. [36] This paper investigates the effectiveness of several such naive techniques, as well as traditional tf-idf similarity, for fragments of different lengths. This main contribution is a first step towards a hybrid method that combines the strength of dense distributed representations-as opposed to sparse term matching-with the strength of tf-idf based methods to automatically reduce the impact of less informative terms. This approach outperforms the existing techniques in a toy experimental set-up, leading to the conclusion that the combination of word embedding and tf-idf information might lead to a better model for semantic content within very short text fragments. Between two such representations they then calculate the cosine similarity.
In the architecture proposed in [37], word embedding is first trained on API documents, tutorials, and reference documents, and then aggregated in order to estimate semantic similarities between documents where the similarity between vectors is usually defined as cosine similarity. In paper [38], they propose to combine explicit semantic analysis (ESA) representations and word2vec representations as a way to generate denser representations and, consequently, a better similarity measure between short texts. In [39] they proposed a semantic similarity approach for paraphrase identification in Arabic texts by combining different techniques of Natural Language Processing NLP such as: Term Frequency Inverse Document Frequency TF-IDF technique. The goal is to represent a word vector using word2vec. And also, to generate a sentence vector representation and after applying a similarity measurement operation based on different metrics of comparison, such as: Cosine Similarity and Euclidean Distance. This approach was evaluated on the Open Source Arabic Corpus OSAC and obtained a promising rate.
[40] This paper proposes a novel deep neural network-based approach that relies on coarse-grained sentence modelling using a convolutional neural network and a long short-term memory model, combined with a specific fine-grained word-level similarity matching model. In this component, they represent every sentence using their joint CNN and LSTM architecture. The CNN is able to learn the local features from words to phrases from the text, while the LSTM learns the long-term dependencies of the text. More specifically, they firstly take the word embedding as input to their CNN model, in which various types of convolutions and pooling techniques are applied to capture the maximum information from the text. Next, the encoded features are used as input to the LSTM network. Finally, the long-term dependencies learned by the LSTM becomes the semantic sentence representation. [41] This approach proposes to explicitly model pairwise word interactions and present a novel similarity focus mechanism to identify important correspondences for better similarity measurement. They used GloVe word embeddings for vector representation of word and their model contains four major components: 1. Bidirectional Long Short-Term Memory Net-works (Bi-LSTMs) are used for context modeling of input sentences. 2. A novel pairwise word interaction modeling technique encourages direct comparisons between word contexts across sentences. Cosine distance (cos) measures the distance of two vectors by the angle between them, while L2Euclidean distance (L2Euclid) and dotproduct distance (DotProduct) measure magnitude differences. We use three similarity functions for richermeasurement. 3. A novel similarity focus layer helps the model identify important pairwise word interactions across sentences.4. A layer deep convolutional neural network (ConvNet) converts the similarity measurement problem into a pattern recognition problem for final classification.
The model of [42] is applied to assess semantic similarity between sentences. For these applications, they provide word-embedding vectors word2vec to the LSTMs, which use a fixed size vector to encode the underlying meaning expressed in a sentence (irrespective of the particular wording/syntax). By restricting subsequent operations to rely on a simple Manhattan metric, they compel the sentence representations learned by their model to form a highly structured space whose geometry reflects complex semantic relationships. [43] This paper proposes a model for com-paring sentences that uses a multiplicity of perspectives. We first model each sentence using a convolutional neural network that extracts features at multiple levels of granularity and uses multiple types of pooling. We then compare our sentence representations at several granularities using multiple similarity metrics (cos, LEuclid). We apply our model to three tasks, including the Microsoft Research paraphrase identification task and two SemEval semantic textual similarity tasks.
In this paper [44], they present convolutional neural network architecture for reranking pairs of short texts, where they learn the optimal representation of text pairs and a similarity function to relate them in a supervised way from the available training data. Their network takes only words in the input, thus requiring minimal preprocessing. In particular, they consider the task of reranking short text pairs where elements of the pair are sentences. They test our deep learning system on two popular retrieval tasks from TREC: Question Answering and Microblog Retrieval. [45] This system combines convolution and recurrent neural networks to measure the semantic similarity of sentences. It uses a convolution network to take account of the local context of words and an LSTM to consider the global context of sentences. This combination of networks helps to preserve the relevant information of sentences and improves the calculation of the similarity between sentences. According to this state of the art we have been able to detect the strengths and weaknesses of each approach that helped us to build our approach. The Table 1 represents a summary compared to the methods above: Use of doc2vec is better then uses RNN. The semantic aspect of a paragraph is lost. [22] Word2vec InferSent sentence Cosine -The use of cosine to detect similarity between sentences remains a solution that carries many risks. The comparison is done at the sentence level, so we always encounter the problem of loss of the semantic aspect of the paragraph or text analysed. According to the study done by [31], [32] he found that the use of doc2vec gives trampling results. In addition to that we could detect the most powerful methods used for the representation of a text. It has been found that the use of the doc2vec principle remains the most relevant solution from the [29][30] study, and then we went further and took inspiration from it to build our learning system that detects plagiarism between the documents.

RESULTS AND DISCUSSION
In this part we will analyse the results found in the study carried out above, first we will illustrate the most important comparison criteria defined: Vector representation: This is a treatment performed on a text that will transform it to list of vectors which keep the semantic and syntactic aspect offered by the use of deep learning algorithms.
Level treatment: this criterion defines the level of the treatment of a text, more exactly if the text is treated by word or by sentence.
Similarity method: This part deals with the approaches used for calculating the similarity between the vectors that represent the texts, which will give us a global visibility to detect the strengths and weaknesses of each method. In addition, we are going to talk about the critical point for each approach illustrated in the paragraph above. Starting from the methods used for the vector representation of a text, according to the analysis it turns out that most of the approaches use either the word2vec or the doc2vec for its vector transformation, so we distinguish that the mikolov representations are the best methods used to keep the semantic aspect of a given text. In Revenge, Each Approach treats the text with its own way, there are some who transform it into a list of words and someone into a list of sentences, these representations yield results that differ from one approach to another but the transformation of a text to a list of sentences in our opinion remains the most relevant since the meaning of the text treated remains in consideration. With regard to the methods used for the similarity calculation, the preceding paragraphs mention the different ways used to detect whether there is a similarity or not between the analysed texts. There are also many approaches that work with CNN and RNN on its plagiarism detection architecture, but most of them use the word level for its vector representation, so they are used only for the detection of similarity between sentences but not texts.
In conclusion, we found that almost of these approaches use the cosine to calculate the similarity between documents, so it was found that these methods perform its similarity analyses in word-by-word or sentence-by-sentence, which will pose after reliability problem of these results, since we can find two documents that share the same word or the same sentences but they are not semantically similar, in addition to that we can lose the semantic aspect when the documents are treating via a list of sentences or words. So, you have to think of a method that manages this problem by proposing an approach that will represent a text by a list of sentences that will eventually be transformed into a list of vectors, and in addition to that we must use a treatment that keeps the semantic aspect of this list of sentences, so it well be a manipulation that processes a list of sentences to detect a similarity using an algorithm like the RNN that will keep the semantic aspect of a text.

CONCLUSION
In this paper, we have mentioned many different methods used in detection of plagiarism of ideas that stand for the principal of Deep Learning, and by this brilliant study we could construct our critical base of the previous weaknesses which we have seen during our study. This helped us to get a general idea about the different methods of deep learning used for plagiarism detection or especially semantic plagiarism detection. In addition to this, this study has given us the paths to follow for the construction of our approach by benefiting from the strengths of each method and bypassing the weak points of each method. Concerning the future work consists of construct and putting into practice our approach and comparing it with the other methods used at the level of the phase related work.