Word length n-grams for text re-use detection

Alberto Barron, Chiara Basile, Mirko Degli Esposti, Paolo Rosso

Research output: Chapter in Book/Report/Conference proceedingConference contribution

22 Citations (Scopus)

Abstract

The automatic detection of shared content in written documents -which includes text reuse and its unacknowledged commitment, plagiarism- has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages687-699
Number of pages13
Volume6008 LNCS
DOIs
Publication statusPublished - 2010
Externally publishedYes
Event11th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2010 - Iasi
Duration: 21 Mar 201027 Mar 2010

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume6008 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other11th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2010
CityIasi
Period21/3/1027/3/10

Fingerprint

N-gram
Information retrieval
Reuse
Derivatives
Large Data Sets
Encoding
Coderivative
Expressiveness
Similarity Measure
Information Retrieval
Decrease
Text

Keywords

  • Information retrieval
  • Plagiarism detection
  • Text reuse analysis
  • Text similarity analysis
  • Word length encoding

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Barron, A., Basile, C., Esposti, M. D., & Rosso, P. (2010). Word length n-grams for text re-use detection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6008 LNCS, pp. 687-699). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6008 LNCS). https://doi.org/10.1007/978-3-642-12116-6_58

Word length n-grams for text re-use detection. / Barron, Alberto; Basile, Chiara; Esposti, Mirko Degli; Rosso, Paolo.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6008 LNCS 2010. p. 687-699 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6008 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Barron, A, Basile, C, Esposti, MD & Rosso, P 2010, Word length n-grams for text re-use detection. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 6008 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6008 LNCS, pp. 687-699, 11th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2010, Iasi, 21/3/10. https://doi.org/10.1007/978-3-642-12116-6_58
Barron A, Basile C, Esposti MD, Rosso P. Word length n-grams for text re-use detection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6008 LNCS. 2010. p. 687-699. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-12116-6_58
Barron, Alberto ; Basile, Chiara ; Esposti, Mirko Degli ; Rosso, Paolo. / Word length n-grams for text re-use detection. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6008 LNCS 2010. pp. 687-699 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{8753102ee2cd461a83fd683060fa39d2,
title = "Word length n-grams for text re-use detection",
abstract = "The automatic detection of shared content in written documents -which includes text reuse and its unacknowledged commitment, plagiarism- has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.",
keywords = "Information retrieval, Plagiarism detection, Text reuse analysis, Text similarity analysis, Word length encoding",
author = "Alberto Barron and Chiara Basile and Esposti, {Mirko Degli} and Paolo Rosso",
year = "2010",
doi = "10.1007/978-3-642-12116-6_58",
language = "English",
isbn = "3642121152",
volume = "6008 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "687--699",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Word length n-grams for text re-use detection

AU - Barron, Alberto

AU - Basile, Chiara

AU - Esposti, Mirko Degli

AU - Rosso, Paolo

PY - 2010

Y1 - 2010

N2 - The automatic detection of shared content in written documents -which includes text reuse and its unacknowledged commitment, plagiarism- has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.

AB - The automatic detection of shared content in written documents -which includes text reuse and its unacknowledged commitment, plagiarism- has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.

KW - Information retrieval

KW - Plagiarism detection

KW - Text reuse analysis

KW - Text similarity analysis

KW - Word length encoding

UR - http://www.scopus.com/inward/record.url?scp=78650465720&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78650465720&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-12116-6_58

DO - 10.1007/978-3-642-12116-6_58

M3 - Conference contribution

SN - 3642121152

SN - 9783642121159

VL - 6008 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 687

EP - 699

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -