Word length n-grams for text re-use detection

Alberto Barron, Chiara Basile, Mirko Degli Esposti, Paolo Rosso

Research output: Chapter in Book/Report/Conference proceedingConference contribution

22 Citations (Scopus)

Abstract

The automatic detection of shared content in written documents -which includes text reuse and its unacknowledged commitment, plagiarism- has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages687-699
Number of pages13
Volume6008 LNCS
DOIs
Publication statusPublished - 2010
Externally publishedYes
Event11th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2010 - Iasi
Duration: 21 Mar 201027 Mar 2010

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume6008 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other11th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2010
CityIasi
Period21/3/1027/3/10

    Fingerprint

Keywords

  • Information retrieval
  • Plagiarism detection
  • Text reuse analysis
  • Text similarity analysis
  • Word length encoding

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Barron, A., Basile, C., Esposti, M. D., & Rosso, P. (2010). Word length n-grams for text re-use detection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6008 LNCS, pp. 687-699). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6008 LNCS). https://doi.org/10.1007/978-3-642-12116-6_58