Towards the detection of cross-language source code reuse

Enrique Flores, Alberto Barron, Paolo Rosso, Lidia Moreno

Research output: Chapter in Book/Report/Conference proceedingConference contribution

17 Citations (Scopus)

Abstract

Internet has made available huge amounts of information, also source code. Source code repositories and, in general, programming related websites, facilitate its reuse. In this work, we propose a simple approach to the detection of cross-language source code reuse, a nearly investigated problem. Our preliminary experiments, based on character n-grams comparison, show that considering different sections of the code (i.e., comments, code, reserved words, etc.), leads to different results. When considering three programming languages: C++, Java, and Python, the best result is obtained when comments are discarded and the entire source code is considered.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages250-253
Number of pages4
Volume6716 LNCS
DOIs
Publication statusPublished - 2011
Externally publishedYes
Event16th International Conference on Applications of Natural Language to Information Systems, NLDB 2011 - Alicante
Duration: 28 Jun 201130 Jun 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume6716 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other16th International Conference on Applications of Natural Language to Information Systems, NLDB 2011
CityAlicante
Period28/6/1130/6/11

Fingerprint

Computer programming languages
Reuse
Websites
Internet
Experiments
Python
N-gram
C++
Repository
Java
Programming Languages
Language
Programming
Entire
Experiment

Keywords

  • cross-language source code reuse analysis
  • plagiarism detection
  • Source code reuse

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Flores, E., Barron, A., Rosso, P., & Moreno, L. (2011). Towards the detection of cross-language source code reuse. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6716 LNCS, pp. 250-253). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6716 LNCS). https://doi.org/10.1007/978-3-642-22327-3_31

Towards the detection of cross-language source code reuse. / Flores, Enrique; Barron, Alberto; Rosso, Paolo; Moreno, Lidia.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6716 LNCS 2011. p. 250-253 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6716 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Flores, E, Barron, A, Rosso, P & Moreno, L 2011, Towards the detection of cross-language source code reuse. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 6716 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6716 LNCS, pp. 250-253, 16th International Conference on Applications of Natural Language to Information Systems, NLDB 2011, Alicante, 28/6/11. https://doi.org/10.1007/978-3-642-22327-3_31
Flores E, Barron A, Rosso P, Moreno L. Towards the detection of cross-language source code reuse. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6716 LNCS. 2011. p. 250-253. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-22327-3_31
Flores, Enrique ; Barron, Alberto ; Rosso, Paolo ; Moreno, Lidia. / Towards the detection of cross-language source code reuse. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6716 LNCS 2011. pp. 250-253 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{fbb9433ae32244e59d3f2d5a9c0d72eb,
title = "Towards the detection of cross-language source code reuse",
abstract = "Internet has made available huge amounts of information, also source code. Source code repositories and, in general, programming related websites, facilitate its reuse. In this work, we propose a simple approach to the detection of cross-language source code reuse, a nearly investigated problem. Our preliminary experiments, based on character n-grams comparison, show that considering different sections of the code (i.e., comments, code, reserved words, etc.), leads to different results. When considering three programming languages: C++, Java, and Python, the best result is obtained when comments are discarded and the entire source code is considered.",
keywords = "cross-language source code reuse analysis, plagiarism detection, Source code reuse",
author = "Enrique Flores and Alberto Barron and Paolo Rosso and Lidia Moreno",
year = "2011",
doi = "10.1007/978-3-642-22327-3_31",
language = "English",
isbn = "9783642223266",
volume = "6716 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "250--253",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Towards the detection of cross-language source code reuse

AU - Flores, Enrique

AU - Barron, Alberto

AU - Rosso, Paolo

AU - Moreno, Lidia

PY - 2011

Y1 - 2011

N2 - Internet has made available huge amounts of information, also source code. Source code repositories and, in general, programming related websites, facilitate its reuse. In this work, we propose a simple approach to the detection of cross-language source code reuse, a nearly investigated problem. Our preliminary experiments, based on character n-grams comparison, show that considering different sections of the code (i.e., comments, code, reserved words, etc.), leads to different results. When considering three programming languages: C++, Java, and Python, the best result is obtained when comments are discarded and the entire source code is considered.

AB - Internet has made available huge amounts of information, also source code. Source code repositories and, in general, programming related websites, facilitate its reuse. In this work, we propose a simple approach to the detection of cross-language source code reuse, a nearly investigated problem. Our preliminary experiments, based on character n-grams comparison, show that considering different sections of the code (i.e., comments, code, reserved words, etc.), leads to different results. When considering three programming languages: C++, Java, and Python, the best result is obtained when comments are discarded and the entire source code is considered.

KW - cross-language source code reuse analysis

KW - plagiarism detection

KW - Source code reuse

UR - http://www.scopus.com/inward/record.url?scp=79959683657&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79959683657&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-22327-3_31

DO - 10.1007/978-3-642-22327-3_31

M3 - Conference contribution

AN - SCOPUS:79959683657

SN - 9783642223266

VL - 6716 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 250

EP - 253

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -