Cross-language source code re-use detection using latent semantic analysis

Enrique Flores, Alberto Barron, Lidia Moreno, Paolo Rosso

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Nowadays, Internet is the main source to get information from blogs, encyclopedias, discussion forums, source code repositories, and more resources which are available just one click away. The temptation to re-use these materials is very high. Even source codes are easily available through a simple search on the Web. There is a need of detecting potential instances of source code re-use. Source code re-use detection has usually been approached comparing source codes in their compiled version. When dealing with cross-language source code re-use, traditional approaches can deal only with the programming languages supported by the compiler. We assume that a source code is a piece of text,with its syntax and structure, so we aim at applying models for free text re-use detection to source code. In this paper we compare a Latent Semantic Analysis (LSA) approach with previously used text re-use detection models for measuring cross-language similarity in source code. The LSA-based approach shows slightly better results than the other models, being able to distinguish between re-used and related source codes with a high performance.

Original languageEnglish
Pages (from-to)1708-1725
Number of pages18
JournalJournal of Universal Computer Science
Volume21
Issue number13
Publication statusPublished - 2015

Fingerprint

Latent Semantic Analysis
Reuse
Semantics
Blogs
Computer programming languages
Internet
Language
Compiler
Repository
Programming Languages
High Performance

Keywords

  • Cross-language re-use detection
  • Latent semantic analysis
  • Plagiarism
  • Source code

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Cross-language source code re-use detection using latent semantic analysis. / Flores, Enrique; Barron, Alberto; Moreno, Lidia; Rosso, Paolo.

In: Journal of Universal Computer Science, Vol. 21, No. 13, 2015, p. 1708-1725.

Research output: Contribution to journalArticle

Flores, Enrique ; Barron, Alberto ; Moreno, Lidia ; Rosso, Paolo. / Cross-language source code re-use detection using latent semantic analysis. In: Journal of Universal Computer Science. 2015 ; Vol. 21, No. 13. pp. 1708-1725.
@article{bd2bfd5ef9954f858912b909e39cdf7a,
title = "Cross-language source code re-use detection using latent semantic analysis",
abstract = "Nowadays, Internet is the main source to get information from blogs, encyclopedias, discussion forums, source code repositories, and more resources which are available just one click away. The temptation to re-use these materials is very high. Even source codes are easily available through a simple search on the Web. There is a need of detecting potential instances of source code re-use. Source code re-use detection has usually been approached comparing source codes in their compiled version. When dealing with cross-language source code re-use, traditional approaches can deal only with the programming languages supported by the compiler. We assume that a source code is a piece of text,with its syntax and structure, so we aim at applying models for free text re-use detection to source code. In this paper we compare a Latent Semantic Analysis (LSA) approach with previously used text re-use detection models for measuring cross-language similarity in source code. The LSA-based approach shows slightly better results than the other models, being able to distinguish between re-used and related source codes with a high performance.",
keywords = "Cross-language re-use detection, Latent semantic analysis, Plagiarism, Source code",
author = "Enrique Flores and Alberto Barron and Lidia Moreno and Paolo Rosso",
year = "2015",
language = "English",
volume = "21",
pages = "1708--1725",
journal = "Journal of Universal Computer Science",
issn = "0948-6968",
publisher = "Springer Verlag",
number = "13",

}

TY - JOUR

T1 - Cross-language source code re-use detection using latent semantic analysis

AU - Flores, Enrique

AU - Barron, Alberto

AU - Moreno, Lidia

AU - Rosso, Paolo

PY - 2015

Y1 - 2015

N2 - Nowadays, Internet is the main source to get information from blogs, encyclopedias, discussion forums, source code repositories, and more resources which are available just one click away. The temptation to re-use these materials is very high. Even source codes are easily available through a simple search on the Web. There is a need of detecting potential instances of source code re-use. Source code re-use detection has usually been approached comparing source codes in their compiled version. When dealing with cross-language source code re-use, traditional approaches can deal only with the programming languages supported by the compiler. We assume that a source code is a piece of text,with its syntax and structure, so we aim at applying models for free text re-use detection to source code. In this paper we compare a Latent Semantic Analysis (LSA) approach with previously used text re-use detection models for measuring cross-language similarity in source code. The LSA-based approach shows slightly better results than the other models, being able to distinguish between re-used and related source codes with a high performance.

AB - Nowadays, Internet is the main source to get information from blogs, encyclopedias, discussion forums, source code repositories, and more resources which are available just one click away. The temptation to re-use these materials is very high. Even source codes are easily available through a simple search on the Web. There is a need of detecting potential instances of source code re-use. Source code re-use detection has usually been approached comparing source codes in their compiled version. When dealing with cross-language source code re-use, traditional approaches can deal only with the programming languages supported by the compiler. We assume that a source code is a piece of text,with its syntax and structure, so we aim at applying models for free text re-use detection to source code. In this paper we compare a Latent Semantic Analysis (LSA) approach with previously used text re-use detection models for measuring cross-language similarity in source code. The LSA-based approach shows slightly better results than the other models, being able to distinguish between re-used and related source codes with a high performance.

KW - Cross-language re-use detection

KW - Latent semantic analysis

KW - Plagiarism

KW - Source code

UR - http://www.scopus.com/inward/record.url?scp=84959461104&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959461104&partnerID=8YFLogxK

M3 - Article

VL - 21

SP - 1708

EP - 1725

JO - Journal of Universal Computer Science

JF - Journal of Universal Computer Science

SN - 0948-6968

IS - 13

ER -