Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection

Alberto Barron, Marta Vila, M. Antònia Martí, Paolo Rosso

Research output: Contribution to journalArticle

71 Citations (Scopus)

Abstract

Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyze the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and which of them are detected by plagiarism detection systems. With this aim in mind, we created the P4P corpus, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection. The results of the Second International Competition on Plagiarism Detection were analyzed in the light of this annotation. The presented experiments show that (i) more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, (ii) lexical substitutions are the paraphrase mechanisms used the most when plagiarizing, and (iii) paraphrase mechanisms tend to shorten the plagiarized text. For the first time, the paraphrase mechanisms behind plagiarism have been analyzed, providing critical insights for the improvement of automatic plagiarism detection systems.

Original languageEnglish
Pages (from-to)917-947
Number of pages31
JournalComputational Linguistics
Volume39
Issue number4
DOIs
Publication statusPublished - Dec 2013
Externally publishedYes

Fingerprint

Linguistics
Substitution reactions
Second International
Detectors
international competition
PC
substitution
typology
Experiments
linguistics
experiment
resources
Paraphrasing
Plagiarism
Paraphrase

ASJC Scopus subject areas

  • Computer Science Applications
  • Artificial Intelligence
  • Linguistics and Language
  • Language and Linguistics

Cite this

Plagiarism Meets Paraphrasing : Insights for the Next Generation in Automatic Plagiarism Detection. / Barron, Alberto; Vila, Marta; Antònia Martí, M.; Rosso, Paolo.

In: Computational Linguistics, Vol. 39, No. 4, 12.2013, p. 917-947.

Research output: Contribution to journalArticle

Barron, Alberto ; Vila, Marta ; Antònia Martí, M. ; Rosso, Paolo. / Plagiarism Meets Paraphrasing : Insights for the Next Generation in Automatic Plagiarism Detection. In: Computational Linguistics. 2013 ; Vol. 39, No. 4. pp. 917-947.
@article{9aad43c0f42342598cc7880dd4f36def,
title = "Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection",
abstract = "Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyze the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and which of them are detected by plagiarism detection systems. With this aim in mind, we created the P4P corpus, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection. The results of the Second International Competition on Plagiarism Detection were analyzed in the light of this annotation. The presented experiments show that (i) more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, (ii) lexical substitutions are the paraphrase mechanisms used the most when plagiarizing, and (iii) paraphrase mechanisms tend to shorten the plagiarized text. For the first time, the paraphrase mechanisms behind plagiarism have been analyzed, providing critical insights for the improvement of automatic plagiarism detection systems.",
author = "Alberto Barron and Marta Vila and {Ant{\`o}nia Mart{\'i}}, M. and Paolo Rosso",
year = "2013",
month = "12",
doi = "10.1162/COLI_a_00153",
language = "English",
volume = "39",
pages = "917--947",
journal = "Computational Linguistics",
issn = "0891-2017",
publisher = "MIT Press Journals",
number = "4",

}

TY - JOUR

T1 - Plagiarism Meets Paraphrasing

T2 - Insights for the Next Generation in Automatic Plagiarism Detection

AU - Barron, Alberto

AU - Vila, Marta

AU - Antònia Martí, M.

AU - Rosso, Paolo

PY - 2013/12

Y1 - 2013/12

N2 - Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyze the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and which of them are detected by plagiarism detection systems. With this aim in mind, we created the P4P corpus, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection. The results of the Second International Competition on Plagiarism Detection were analyzed in the light of this annotation. The presented experiments show that (i) more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, (ii) lexical substitutions are the paraphrase mechanisms used the most when plagiarizing, and (iii) paraphrase mechanisms tend to shorten the plagiarized text. For the first time, the paraphrase mechanisms behind plagiarism have been analyzed, providing critical insights for the improvement of automatic plagiarism detection systems.

AB - Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyze the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and which of them are detected by plagiarism detection systems. With this aim in mind, we created the P4P corpus, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection. The results of the Second International Competition on Plagiarism Detection were analyzed in the light of this annotation. The presented experiments show that (i) more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, (ii) lexical substitutions are the paraphrase mechanisms used the most when plagiarizing, and (iii) paraphrase mechanisms tend to shorten the plagiarized text. For the first time, the paraphrase mechanisms behind plagiarism have been analyzed, providing critical insights for the improvement of automatic plagiarism detection systems.

UR - http://www.scopus.com/inward/record.url?scp=84877274650&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84877274650&partnerID=8YFLogxK

U2 - 10.1162/COLI_a_00153

DO - 10.1162/COLI_a_00153

M3 - Article

AN - SCOPUS:84877274650

VL - 39

SP - 917

EP - 947

JO - Computational Linguistics

JF - Computational Linguistics

SN - 0891-2017

IS - 4

ER -