Reducing the plagiarism detection search space on the basis of the Kullback-Leibler distance

Alberto Barron, Paolo Rosso, José Miguel Benedí

Research output: Chapter in Book/Report/Conference proceedingConference contribution

29 Citations (Scopus)

Abstract

Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a narrow set where any search strategy will produce a good output in a short time. However, this is not always true. Reference corpora are often composed of a big set of original documents where a simple exhaustive search strategy becomes practically impossible. Before carrying out an exhaustive search, it is necessary to reduce the search space, represented by the documents in the reference corpus, as much as possible. Our experiments with the METER corpus show that a previous search space reduction stage, based on the Kullback- Leibler symmetric distance, reduces the search process time dramatically. Additionally, it improves the Precision and Recall obtained by a search strategy based on the exhaustive comparison of word n-grams.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages523-534
Number of pages12
Volume5449 LNCS
DOIs
Publication statusPublished - 2009
Externally publishedYes
Event10th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2009 - Mexico City
Duration: 1 Mar 20097 Mar 2009

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5449 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other10th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2009
CityMexico City
Period1/3/097/3/09

Fingerprint

Kullback-Leibler Distance
Search Space
Search Strategy
Exhaustive Search
Experiments
N-gram
Fragment
Necessary
Corpus
Output
Experiment

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Barron, A., Rosso, P., & Benedí, J. M. (2009). Reducing the plagiarism detection search space on the basis of the Kullback-Leibler distance. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5449 LNCS, pp. 523-534). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5449 LNCS). https://doi.org/10.1007/978-3-642-00382-0_42

Reducing the plagiarism detection search space on the basis of the Kullback-Leibler distance. / Barron, Alberto; Rosso, Paolo; Benedí, José Miguel.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 5449 LNCS 2009. p. 523-534 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5449 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Barron, A, Rosso, P & Benedí, JM 2009, Reducing the plagiarism detection search space on the basis of the Kullback-Leibler distance. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 5449 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5449 LNCS, pp. 523-534, 10th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2009, Mexico City, 1/3/09. https://doi.org/10.1007/978-3-642-00382-0_42
Barron A, Rosso P, Benedí JM. Reducing the plagiarism detection search space on the basis of the Kullback-Leibler distance. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 5449 LNCS. 2009. p. 523-534. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-00382-0_42
Barron, Alberto ; Rosso, Paolo ; Benedí, José Miguel. / Reducing the plagiarism detection search space on the basis of the Kullback-Leibler distance. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 5449 LNCS 2009. pp. 523-534 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{7885c95b44324f80a2374bcb3cccb8f1,
title = "Reducing the plagiarism detection search space on the basis of the Kullback-Leibler distance",
abstract = "Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a narrow set where any search strategy will produce a good output in a short time. However, this is not always true. Reference corpora are often composed of a big set of original documents where a simple exhaustive search strategy becomes practically impossible. Before carrying out an exhaustive search, it is necessary to reduce the search space, represented by the documents in the reference corpus, as much as possible. Our experiments with the METER corpus show that a previous search space reduction stage, based on the Kullback- Leibler symmetric distance, reduces the search process time dramatically. Additionally, it improves the Precision and Recall obtained by a search strategy based on the exhaustive comparison of word n-grams.",
author = "Alberto Barron and Paolo Rosso and Bened{\'i}, {Jos{\'e} Miguel}",
year = "2009",
doi = "10.1007/978-3-642-00382-0_42",
language = "English",
isbn = "3642003818",
volume = "5449 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "523--534",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Reducing the plagiarism detection search space on the basis of the Kullback-Leibler distance

AU - Barron, Alberto

AU - Rosso, Paolo

AU - Benedí, José Miguel

PY - 2009

Y1 - 2009

N2 - Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a narrow set where any search strategy will produce a good output in a short time. However, this is not always true. Reference corpora are often composed of a big set of original documents where a simple exhaustive search strategy becomes practically impossible. Before carrying out an exhaustive search, it is necessary to reduce the search space, represented by the documents in the reference corpus, as much as possible. Our experiments with the METER corpus show that a previous search space reduction stage, based on the Kullback- Leibler symmetric distance, reduces the search process time dramatically. Additionally, it improves the Precision and Recall obtained by a search strategy based on the exhaustive comparison of word n-grams.

AB - Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a narrow set where any search strategy will produce a good output in a short time. However, this is not always true. Reference corpora are often composed of a big set of original documents where a simple exhaustive search strategy becomes practically impossible. Before carrying out an exhaustive search, it is necessary to reduce the search space, represented by the documents in the reference corpus, as much as possible. Our experiments with the METER corpus show that a previous search space reduction stage, based on the Kullback- Leibler symmetric distance, reduces the search process time dramatically. Additionally, it improves the Precision and Recall obtained by a search strategy based on the exhaustive comparison of word n-grams.

UR - http://www.scopus.com/inward/record.url?scp=67650535503&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67650535503&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-00382-0_42

DO - 10.1007/978-3-642-00382-0_42

M3 - Conference contribution

SN - 3642003818

SN - 9783642003813

VL - 5449 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 523

EP - 534

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -