Methods for cross-language plagiarism detection

Alberto Barron, Parth Gupta, Paolo Rosso

Research output: Contribution to journalArticle

40 Citations (Scopus)

Abstract

Three reasons make plagiarism across languages to be on the rise: (i) speakers of under-resourced languages often consult documentation in a foreign language, (ii) people immersed in a foreign country can still consult material written in their native language, and (iii) people are often interested in writing in a language different to their native one. Most efforts for automatically detecting cross-language plagiarism depend on a preliminary translation, which is not always available. In this paper we propose a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing. On top of this architecture we explore the suitability of three cross-language similarity estimation models: Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Character n-Grams (CL-CNG), and Translation plus Monolingual Analysis (T + MA); three inherently different models in nature and required resources. The three models are tested extensively under the same conditions on the different plagiarism detection sub-tasks - something never done before. The experiments show that T + MA produces the best results, closely followed by CL-ASA. Still CL-ASA obtains higher values of precision, an important factor in plagiarism detection when lesser user intervention is desired.

Original languageEnglish
Pages (from-to)211-217
Number of pages7
JournalKnowledge-Based Systems
Volume50
DOIs
Publication statusPublished - Sep 2013
Externally publishedYes

Fingerprint

Processing
Plagiarism
Language
Experiments
Alignment
Heuristics
Resources
Documentation
Factors
Experiment

Keywords

  • Automatic plagiarism detection
  • Cross-language plagiarism
  • Cross-language similarity
  • Plagiarism detection architecture
  • Text re-use analysis

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence
  • Management Information Systems
  • Information Systems and Management

Cite this

Methods for cross-language plagiarism detection. / Barron, Alberto; Gupta, Parth; Rosso, Paolo.

In: Knowledge-Based Systems, Vol. 50, 09.2013, p. 211-217.

Research output: Contribution to journalArticle

Barron, Alberto ; Gupta, Parth ; Rosso, Paolo. / Methods for cross-language plagiarism detection. In: Knowledge-Based Systems. 2013 ; Vol. 50. pp. 211-217.
@article{62e63c5431c04ac38d8735e2060425be,
title = "Methods for cross-language plagiarism detection",
abstract = "Three reasons make plagiarism across languages to be on the rise: (i) speakers of under-resourced languages often consult documentation in a foreign language, (ii) people immersed in a foreign country can still consult material written in their native language, and (iii) people are often interested in writing in a language different to their native one. Most efforts for automatically detecting cross-language plagiarism depend on a preliminary translation, which is not always available. In this paper we propose a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing. On top of this architecture we explore the suitability of three cross-language similarity estimation models: Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Character n-Grams (CL-CNG), and Translation plus Monolingual Analysis (T + MA); three inherently different models in nature and required resources. The three models are tested extensively under the same conditions on the different plagiarism detection sub-tasks - something never done before. The experiments show that T + MA produces the best results, closely followed by CL-ASA. Still CL-ASA obtains higher values of precision, an important factor in plagiarism detection when lesser user intervention is desired.",
keywords = "Automatic plagiarism detection, Cross-language plagiarism, Cross-language similarity, Plagiarism detection architecture, Text re-use analysis",
author = "Alberto Barron and Parth Gupta and Paolo Rosso",
year = "2013",
month = "9",
doi = "10.1016/j.knosys.2013.06.018",
language = "English",
volume = "50",
pages = "211--217",
journal = "Knowledge-Based Systems",
issn = "0950-7051",
publisher = "Elsevier",

}

TY - JOUR

T1 - Methods for cross-language plagiarism detection

AU - Barron, Alberto

AU - Gupta, Parth

AU - Rosso, Paolo

PY - 2013/9

Y1 - 2013/9

N2 - Three reasons make plagiarism across languages to be on the rise: (i) speakers of under-resourced languages often consult documentation in a foreign language, (ii) people immersed in a foreign country can still consult material written in their native language, and (iii) people are often interested in writing in a language different to their native one. Most efforts for automatically detecting cross-language plagiarism depend on a preliminary translation, which is not always available. In this paper we propose a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing. On top of this architecture we explore the suitability of three cross-language similarity estimation models: Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Character n-Grams (CL-CNG), and Translation plus Monolingual Analysis (T + MA); three inherently different models in nature and required resources. The three models are tested extensively under the same conditions on the different plagiarism detection sub-tasks - something never done before. The experiments show that T + MA produces the best results, closely followed by CL-ASA. Still CL-ASA obtains higher values of precision, an important factor in plagiarism detection when lesser user intervention is desired.

AB - Three reasons make plagiarism across languages to be on the rise: (i) speakers of under-resourced languages often consult documentation in a foreign language, (ii) people immersed in a foreign country can still consult material written in their native language, and (iii) people are often interested in writing in a language different to their native one. Most efforts for automatically detecting cross-language plagiarism depend on a preliminary translation, which is not always available. In this paper we propose a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing. On top of this architecture we explore the suitability of three cross-language similarity estimation models: Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Character n-Grams (CL-CNG), and Translation plus Monolingual Analysis (T + MA); three inherently different models in nature and required resources. The three models are tested extensively under the same conditions on the different plagiarism detection sub-tasks - something never done before. The experiments show that T + MA produces the best results, closely followed by CL-ASA. Still CL-ASA obtains higher values of precision, an important factor in plagiarism detection when lesser user intervention is desired.

KW - Automatic plagiarism detection

KW - Cross-language plagiarism

KW - Cross-language similarity

KW - Plagiarism detection architecture

KW - Text re-use analysis

UR - http://www.scopus.com/inward/record.url?scp=84881315849&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84881315849&partnerID=8YFLogxK

U2 - 10.1016/j.knosys.2013.06.018

DO - 10.1016/j.knosys.2013.06.018

M3 - Article

AN - SCOPUS:84881315849

VL - 50

SP - 211

EP - 217

JO - Knowledge-Based Systems

JF - Knowledge-Based Systems

SN - 0950-7051

ER -