Methods for cross-language plagiarism detection

Alberto Barron, Parth Gupta, Paolo Rosso

Research output: Contribution to journalArticle

42 Citations (Scopus)


Three reasons make plagiarism across languages to be on the rise: (i) speakers of under-resourced languages often consult documentation in a foreign language, (ii) people immersed in a foreign country can still consult material written in their native language, and (iii) people are often interested in writing in a language different to their native one. Most efforts for automatically detecting cross-language plagiarism depend on a preliminary translation, which is not always available. In this paper we propose a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing. On top of this architecture we explore the suitability of three cross-language similarity estimation models: Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Character n-Grams (CL-CNG), and Translation plus Monolingual Analysis (T + MA); three inherently different models in nature and required resources. The three models are tested extensively under the same conditions on the different plagiarism detection sub-tasks - something never done before. The experiments show that T + MA produces the best results, closely followed by CL-ASA. Still CL-ASA obtains higher values of precision, an important factor in plagiarism detection when lesser user intervention is desired.

Original languageEnglish
Pages (from-to)211-217
Number of pages7
JournalKnowledge-Based Systems
Publication statusPublished - Sep 2013
Externally publishedYes



  • Automatic plagiarism detection
  • Cross-language plagiarism
  • Cross-language similarity
  • Plagiarism detection architecture
  • Text re-use analysis

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence
  • Management Information Systems
  • Information Systems and Management

Cite this