A two-phase plagiarism detection system based on multi-layer LSTM Networks

Nguyen Van Son, Le Thanh Huong, Nguyen Chi Thanh


Finding plagiarism strings between two given documents are the main task of the plagiarism detection problem. Traditional approaches based on string matching are not very useful in cases of similar semantic plagiarism. Deep learning approaches solve this problem by measuring the semantic similarity between pairs of sentences. However, these approaches still face the following challenging points. First, it is impossible to solve cases where only part of a sentence belongs to a plagiarism passage. Second, measuring the sentential similarity without considering the context of surrounding sentences leads to decreasing in accuracy. To solve the above problems, this paper proposes a two-phase plagiarism detection system based on multi-layer LSTM network model and feature extraction technique: (i) a passage-phase to recognize plagiarism passages, and (ii) a word-phase to determine the exact plagiarism strings. Our experiment results on PAN 2014 corpus reached 94.26% F-measure, higher than existing research in this field.


Plagiarism detection;Feature extraction;Deep learning; Multi-layer LSTM model;Two-phase


Barrón-Cedeño, Alberto, Marta Vila, M. Antònia Martí, and Paolo Rosso, “Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. ” Computational Linguistics 39, no. 4 (2013): 917-947.

Alzahrani, Salha M., Naomie Salim, and Ajith Abraham. “Understanding plagiarism linguistic patterns, textual features, and detection methods.” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42.2 (2011): 133-149.

Liu, Yang, et al. “Computing semantic text similarity using rich features.” The 29th Pacific Asia Conference on Language, Information and Computation. 2015.

Grozea, Cristian, Christian Gehl, and Marius Popescu. “ENCOPLOT: Pairwise sequence matching in linear time applied to plagiarism detection.” 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse. 2009.

Elhadi, Mohamed, and Amjad Al-Tobi. “Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures.” 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology. IEEE, 2009.

Kasprzak, Jan, Michal Brandejs, and Miroslav Kripac. “Finding plagiarism by evaluating document similarities.” SEPLN. Vol. 9. No. 4. 2009.

Barrón-Cedeño, Alberto, et al. “Word length n-grams for text re-use detection.” International Conference on Intelligent Text Processing and Computational Linguistics. Springer, Berlin, Heidelberg, 2010.

Murugesan, Mummoorthy, et al. “Efficient privacy-preserving similar document detection.” The VLDB Journal 19.4 (2010): 457-475.

Altheneyan, Alaa Saleh, and Mohamed El Bachir Menai. “Automatic plagiarism detection in obfuscated text.” Pattern Analysis and Applications (2020): 1-24.

Cherroun, Hadda, and Ali Alshehri. “Disguised plagiarism detection in Arabic text documents.” 2nd International Conference on Natural Language and Speech Processing (ICNLSP). IEEE, 2018.

Gharavi, Erfaneh, et al. “A fast multi-level plagiarism detection method based on document embedding representation.” Forum for Information Retrieval Evaluation. Springer, Cham, 2016.

Abnar, Samira, et al. “Expanded n-grams for semantic text alignment.” Cappellato et al. (2014).

Martin Potthast, Benno Stein, Alberto Barrón-cedeño, Paolo Rosso, Bauhaus-universität Weimar, “An evaluation framework for plagiarism detection”, The 23rd International Conference on Computational Linguistics, 2010.

Reimers, Nils, and Iryna Gurevych. “Sentence-bert: Sentence embeddings using siamese bert-networks.” arXiv preprint arXiv:1908.10084 (2019).

Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B. “Overview of the 6th International Competition on Plagiarism Detection.” Cappellato et al., pp. 845-876

Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks.” IEEE international conference on acoustics, speech and signal processing, 2013.

Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 2013.

Busta, Michal, Lukas Neumann, and Jiri Matas. “Fastext: Efficient unconstrained scene text detector." The IEEE International Conference on Computer Vision. 2015.

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. “Glove: Global vectors for word representation.” The 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

Conneau, Alexis, et al. “Supervised learning of universal sentence representations from natural language inference data.” arXiv preprint arXiv:1705.02364 (2017).

Cer, Daniel, et al. “Universal sentence encoder.” arXiv preprint arXiv:1803.11175 (2018).

Allan, James, Courtney Wade, and Alvaro Bolivar. “Retrieval and novelty detection at the sentence level.” The 26th annual international ACM SIGIR conference on research and development in informaion retrieval. 2003.

Potthast, Martin, et al. “Overview of the 5th international competition on plagiarism detection.” CLEF Conference on Multilingual and Multimodal Information Access Evaluation. CELCT, 2013.

Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B. “Overview of the 6th International Competition on Plagiarism Detection.” Cappellato et al., pp. 845-876

Palkovskii, Yurii, and Alexei Belov. “Developing high-resolution universal multi-type n-gram plagiarism detector.” Conference and Labs of the Evaluation Forum and Workshop (CLEF’14). 2014.

Oberreuter, G., and A. Eiselt. “Submission to the 6th international competition on plagiarism detection, From Innovand. io, Chile (2014).” (2003): 19-51.

Sanchez-Perez, Miguel A., Grigori Sidorov, and Alexander F. Gelbukh. “A Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014.” CLEF (Working Notes). 2014.

Glinos, Demetrios G. “A Hybrid Architecture for Plagiarism Detection.” CLEF (working notes). 2014.

Shrestha, Prasha, Suraj Maharjan, and Thamar Solorio. “Machine Translation Evaluation Metric for Text Alignment.” CLEF (working notes). 2014.

Gross, Philipp, and Pashutan Modaresi. “Plagiarism Alignment Detection by Merging Context Seeds.” CLEF (working notes). 2014.

Torrejón, Diego Antonio Rodríguez, and José Manuel Martín Ramos. “Coremo 2.3 plagiarism detector text alignment module.” Notebook for PAN at CLEF (2014).

DOI: http://doi.org/10.11591/ijai.v10.i3.pp%25p


  • There are currently no refbacks.

View IJAI Stats

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.