Plagiarism detection across distant language pairs

Alberto Barron, Paolo Rosso, Eneko Agirre, Gorka Labaka

Research output: Chapter in Book/Report/Conference proceedingConference contribution

32 Citations (Scopus)

Abstract

Plagiarism, the unacknowledged reuse of text, does not end at language boundaries. Cross-language plagiarism occurs if a text is translated from a fragment written in a different language and no proper citation is provided. Regardless of the change of language, the contents and, in particular, the ideas remain the same. Whereas different methods for the detection of monolingual plagiarism have been developed, less attention has been paid to the cross language case. In this paper we compare two recently proposed cross-language plagiarism detection methods (CL-CNG, based on character n-grams and CL-ASA, based on statistical translation), to a novel approach to this problem, based on machine translation and monolingual similarity analysis (T+MA). We explore the effectiveness of the three approaches for less related languages. CL-CNG shows not be appropriate for this kind of language pairs, whereas T+MA performs better than the previously proposed models.

Original languageEnglish
Title of host publicationColing 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference
Pages37-45
Number of pages9
Volume2
Publication statusPublished - 2010
Externally publishedYes
Event23rd International Conference on Computational Linguistics, Coling 2010 - Beijing
Duration: 23 Aug 201027 Aug 2010

Other

Other23rd International Conference on Computational Linguistics, Coling 2010
CityBeijing
Period23/8/1027/8/10

Fingerprint

language
Plagiarism
Language
Cross-language
Citations
N-gram
Reuse
Machine Translation

ASJC Scopus subject areas

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Linguistics and Language

Cite this

Barron, A., Rosso, P., Agirre, E., & Labaka, G. (2010). Plagiarism detection across distant language pairs. In Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference (Vol. 2, pp. 37-45)

Plagiarism detection across distant language pairs. / Barron, Alberto; Rosso, Paolo; Agirre, Eneko; Labaka, Gorka.

Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference. Vol. 2 2010. p. 37-45.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Barron, A, Rosso, P, Agirre, E & Labaka, G 2010, Plagiarism detection across distant language pairs. in Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference. vol. 2, pp. 37-45, 23rd International Conference on Computational Linguistics, Coling 2010, Beijing, 23/8/10.
Barron A, Rosso P, Agirre E, Labaka G. Plagiarism detection across distant language pairs. In Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference. Vol. 2. 2010. p. 37-45
Barron, Alberto ; Rosso, Paolo ; Agirre, Eneko ; Labaka, Gorka. / Plagiarism detection across distant language pairs. Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference. Vol. 2 2010. pp. 37-45
@inproceedings{291ba35cf46d407fbc46fcaef18fcfa8,
title = "Plagiarism detection across distant language pairs",
abstract = "Plagiarism, the unacknowledged reuse of text, does not end at language boundaries. Cross-language plagiarism occurs if a text is translated from a fragment written in a different language and no proper citation is provided. Regardless of the change of language, the contents and, in particular, the ideas remain the same. Whereas different methods for the detection of monolingual plagiarism have been developed, less attention has been paid to the cross language case. In this paper we compare two recently proposed cross-language plagiarism detection methods (CL-CNG, based on character n-grams and CL-ASA, based on statistical translation), to a novel approach to this problem, based on machine translation and monolingual similarity analysis (T+MA). We explore the effectiveness of the three approaches for less related languages. CL-CNG shows not be appropriate for this kind of language pairs, whereas T+MA performs better than the previously proposed models.",
author = "Alberto Barron and Paolo Rosso and Eneko Agirre and Gorka Labaka",
year = "2010",
language = "English",
volume = "2",
pages = "37--45",
booktitle = "Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference",

}

TY - GEN

T1 - Plagiarism detection across distant language pairs

AU - Barron, Alberto

AU - Rosso, Paolo

AU - Agirre, Eneko

AU - Labaka, Gorka

PY - 2010

Y1 - 2010

N2 - Plagiarism, the unacknowledged reuse of text, does not end at language boundaries. Cross-language plagiarism occurs if a text is translated from a fragment written in a different language and no proper citation is provided. Regardless of the change of language, the contents and, in particular, the ideas remain the same. Whereas different methods for the detection of monolingual plagiarism have been developed, less attention has been paid to the cross language case. In this paper we compare two recently proposed cross-language plagiarism detection methods (CL-CNG, based on character n-grams and CL-ASA, based on statistical translation), to a novel approach to this problem, based on machine translation and monolingual similarity analysis (T+MA). We explore the effectiveness of the three approaches for less related languages. CL-CNG shows not be appropriate for this kind of language pairs, whereas T+MA performs better than the previously proposed models.

AB - Plagiarism, the unacknowledged reuse of text, does not end at language boundaries. Cross-language plagiarism occurs if a text is translated from a fragment written in a different language and no proper citation is provided. Regardless of the change of language, the contents and, in particular, the ideas remain the same. Whereas different methods for the detection of monolingual plagiarism have been developed, less attention has been paid to the cross language case. In this paper we compare two recently proposed cross-language plagiarism detection methods (CL-CNG, based on character n-grams and CL-ASA, based on statistical translation), to a novel approach to this problem, based on machine translation and monolingual similarity analysis (T+MA). We explore the effectiveness of the three approaches for less related languages. CL-CNG shows not be appropriate for this kind of language pairs, whereas T+MA performs better than the previously proposed models.

UR - http://www.scopus.com/inward/record.url?scp=80053424869&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053424869&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:80053424869

VL - 2

SP - 37

EP - 45

BT - Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference

ER -