On automatic plagiarism detection based on n-grams comparison

Alberto Barron, Paolo Rosso

Research output: Chapter in Book/Report/Conference proceedingConference contribution

60 Citations (Scopus)

Abstract

When automatic plagiarism detection is carried out considering a reference corpus, a suspicious text is compared to a set of original documents in order to relate the plagiarised text fragments to their potential source. One of the biggest difficulties in this task is to locateplagiarised fragments that have been modified (by rewording, insertion or deletion, for example) from the source text. The definition of proper text chunks as comparison units of the suspicious and original texts is crucial for the success of this kind of applications. Our experiments with the METER corpus show that the best results are obtained when considering low level word n-grams comparisons (n = {2, 3}).

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages696-700
Number of pages5
Volume5478 LNCS
DOIs
Publication statusPublished - 2009
Externally publishedYes
Event31th European Conference on Information Retrieval, ECIR 2009 - Toulouse
Duration: 6 Apr 20099 Apr 2009

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5478 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other31th European Conference on Information Retrieval, ECIR 2009
CityToulouse
Period6/4/099/4/09

Fingerprint

N-gram
Experiments
Fragment
Deletion
Insertion
Text
Unit
Experiment

Keywords

  • Information extraction
  • N-grams
  • Plagiarism detection
  • Reference corpus
  • Text reuse

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Barron, A., & Rosso, P. (2009). On automatic plagiarism detection based on n-grams comparison. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5478 LNCS, pp. 696-700). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5478 LNCS). https://doi.org/10.1007/978-3-642-00958-7_69

On automatic plagiarism detection based on n-grams comparison. / Barron, Alberto; Rosso, Paolo.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 5478 LNCS 2009. p. 696-700 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5478 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Barron, A & Rosso, P 2009, On automatic plagiarism detection based on n-grams comparison. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 5478 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5478 LNCS, pp. 696-700, 31th European Conference on Information Retrieval, ECIR 2009, Toulouse, 6/4/09. https://doi.org/10.1007/978-3-642-00958-7_69
Barron A, Rosso P. On automatic plagiarism detection based on n-grams comparison. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 5478 LNCS. 2009. p. 696-700. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-00958-7_69
Barron, Alberto ; Rosso, Paolo. / On automatic plagiarism detection based on n-grams comparison. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 5478 LNCS 2009. pp. 696-700 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{324bfaf8e8dd401792d525cf5956a354,
title = "On automatic plagiarism detection based on n-grams comparison",
abstract = "When automatic plagiarism detection is carried out considering a reference corpus, a suspicious text is compared to a set of original documents in order to relate the plagiarised text fragments to their potential source. One of the biggest difficulties in this task is to locateplagiarised fragments that have been modified (by rewording, insertion or deletion, for example) from the source text. The definition of proper text chunks as comparison units of the suspicious and original texts is crucial for the success of this kind of applications. Our experiments with the METER corpus show that the best results are obtained when considering low level word n-grams comparisons (n = {2, 3}).",
keywords = "Information extraction, N-grams, Plagiarism detection, Reference corpus, Text reuse",
author = "Alberto Barron and Paolo Rosso",
year = "2009",
doi = "10.1007/978-3-642-00958-7_69",
language = "English",
isbn = "3642009573",
volume = "5478 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "696--700",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - On automatic plagiarism detection based on n-grams comparison

AU - Barron, Alberto

AU - Rosso, Paolo

PY - 2009

Y1 - 2009

N2 - When automatic plagiarism detection is carried out considering a reference corpus, a suspicious text is compared to a set of original documents in order to relate the plagiarised text fragments to their potential source. One of the biggest difficulties in this task is to locateplagiarised fragments that have been modified (by rewording, insertion or deletion, for example) from the source text. The definition of proper text chunks as comparison units of the suspicious and original texts is crucial for the success of this kind of applications. Our experiments with the METER corpus show that the best results are obtained when considering low level word n-grams comparisons (n = {2, 3}).

AB - When automatic plagiarism detection is carried out considering a reference corpus, a suspicious text is compared to a set of original documents in order to relate the plagiarised text fragments to their potential source. One of the biggest difficulties in this task is to locateplagiarised fragments that have been modified (by rewording, insertion or deletion, for example) from the source text. The definition of proper text chunks as comparison units of the suspicious and original texts is crucial for the success of this kind of applications. Our experiments with the METER corpus show that the best results are obtained when considering low level word n-grams comparisons (n = {2, 3}).

KW - Information extraction

KW - N-grams

KW - Plagiarism detection

KW - Reference corpus

KW - Text reuse

UR - http://www.scopus.com/inward/record.url?scp=67650705687&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67650705687&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-00958-7_69

DO - 10.1007/978-3-642-00958-7_69

M3 - Conference contribution

SN - 3642009573

SN - 9783642009570

VL - 5478 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 696

EP - 700

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -