Cross-language plagiarism detection

Martin Potthast, Alberto Barron, Benno Stein, Paolo Rosso

Research output: Contribution to journalArticle

95 Citations (Scopus)

Abstract

Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (1) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (2) state-of-the-art solutions for two important subtasks are reviewed, (3) retrieval models for the assessment of cross-language similarity are surveyed, and, (4) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120,000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the six languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on "exact" translations but does not generalize well.

Original languageEnglish
Pages (from-to)45-62
Number of pages18
JournalLanguage Resources and Evaluation
Volume45
Issue number1
DOIs
Publication statusPublished - Mar 2011
Externally publishedYes

Fingerprint

language
Wikipedia
evaluation
English language
Plagiarism
Cross-language
ranking
performance
Evaluation
Language
Syntax
Ranking

Keywords

  • Cross-language
  • Evaluation
  • Plagiarism detection
  • Retrieval model
  • Similarity

ASJC Scopus subject areas

  • Linguistics and Language
  • Library and Information Sciences

Cite this

Cross-language plagiarism detection. / Potthast, Martin; Barron, Alberto; Stein, Benno; Rosso, Paolo.

In: Language Resources and Evaluation, Vol. 45, No. 1, 03.2011, p. 45-62.

Research output: Contribution to journalArticle

Potthast, Martin ; Barron, Alberto ; Stein, Benno ; Rosso, Paolo. / Cross-language plagiarism detection. In: Language Resources and Evaluation. 2011 ; Vol. 45, No. 1. pp. 45-62.
@article{209b41d8b592401884a8b8a362f34377,
title = "Cross-language plagiarism detection",
abstract = "Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (1) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (2) state-of-the-art solutions for two important subtasks are reviewed, (3) retrieval models for the assessment of cross-language similarity are surveyed, and, (4) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120,000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the six languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on {"}exact{"} translations but does not generalize well.",
keywords = "Cross-language, Evaluation, Plagiarism detection, Retrieval model, Similarity",
author = "Martin Potthast and Alberto Barron and Benno Stein and Paolo Rosso",
year = "2011",
month = "3",
doi = "10.1007/s10579-009-9114-z",
language = "English",
volume = "45",
pages = "45--62",
journal = "Language Resources and Evaluation",
issn = "1574-020X",
publisher = "Springer Netherlands",
number = "1",

}

TY - JOUR

T1 - Cross-language plagiarism detection

AU - Potthast, Martin

AU - Barron, Alberto

AU - Stein, Benno

AU - Rosso, Paolo

PY - 2011/3

Y1 - 2011/3

N2 - Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (1) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (2) state-of-the-art solutions for two important subtasks are reviewed, (3) retrieval models for the assessment of cross-language similarity are surveyed, and, (4) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120,000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the six languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on "exact" translations but does not generalize well.

AB - Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (1) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (2) state-of-the-art solutions for two important subtasks are reviewed, (3) retrieval models for the assessment of cross-language similarity are surveyed, and, (4) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120,000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the six languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on "exact" translations but does not generalize well.

KW - Cross-language

KW - Evaluation

KW - Plagiarism detection

KW - Retrieval model

KW - Similarity

UR - http://www.scopus.com/inward/record.url?scp=79952244075&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79952244075&partnerID=8YFLogxK

U2 - 10.1007/s10579-009-9114-z

DO - 10.1007/s10579-009-9114-z

M3 - Article

VL - 45

SP - 45

EP - 62

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

SN - 1574-020X

IS - 1

ER -