Irish

A Hidden Markov Model to detect coded information islands in free text

Luigi Cerulo, Massimiliano Di Penta, Alberto Bacchelli, Michele Ceccarelli, Gerardo Canfora

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Developers' communication, as contained in emails, issue trackers, and forums, is a precious source of information to support the development process. For example, it can be used to capture knowledge about development practice or about a software project itself. Thus, extracting the content of developers' communication can be useful to support several software engineering tasks, such as program comprehension, source code analysis, and software analytics. However, automating the extraction process is challenging, due to the unstructured nature of free text, which mixes different coding languages (e.g., source code, stack dumps, and log traces) with natural language parts. We conduct an extensive evaluation of Irish (InfoRmation ISlands Hmm), an approach we proposed to extract islands of coded information from free text at token granularity, with respect to the state of art approaches based on island parsing or island parsing combined with machine learners. The evaluation considers a wide set of natural language documents (e.g., textbooks, forum discussions, and development emails) taken from different contexts and encompassing different coding languages. Results indicate an F-measure of Irish between 74% and 99%; this is in line with existing approaches which, differently from Irish, require specific expertise for the definition of regular expressions or grammars.

Original languageEnglish
Pages (from-to)26-43
Number of pages18
JournalScience of Computer Programming
Volume105
DOIs
Publication statusPublished - 1 Jul 2015

Fingerprint

Electronic mail
Hidden Markov models
Textbooks
Communication
Software engineering

Keywords

  • Developers' communication
  • Hidden Markov Models
  • Mining unstructured data

ASJC Scopus subject areas

  • Software

Cite this

Irish : A Hidden Markov Model to detect coded information islands in free text. / Cerulo, Luigi; Di Penta, Massimiliano; Bacchelli, Alberto; Ceccarelli, Michele; Canfora, Gerardo.

In: Science of Computer Programming, Vol. 105, 01.07.2015, p. 26-43.

Research output: Contribution to journalArticle

Cerulo, Luigi ; Di Penta, Massimiliano ; Bacchelli, Alberto ; Ceccarelli, Michele ; Canfora, Gerardo. / Irish : A Hidden Markov Model to detect coded information islands in free text. In: Science of Computer Programming. 2015 ; Vol. 105. pp. 26-43.
@article{1280d8bfa8f44eb5a5bba30fc7f263f3,
title = "Irish: A Hidden Markov Model to detect coded information islands in free text",
abstract = "Developers' communication, as contained in emails, issue trackers, and forums, is a precious source of information to support the development process. For example, it can be used to capture knowledge about development practice or about a software project itself. Thus, extracting the content of developers' communication can be useful to support several software engineering tasks, such as program comprehension, source code analysis, and software analytics. However, automating the extraction process is challenging, due to the unstructured nature of free text, which mixes different coding languages (e.g., source code, stack dumps, and log traces) with natural language parts. We conduct an extensive evaluation of Irish (InfoRmation ISlands Hmm), an approach we proposed to extract islands of coded information from free text at token granularity, with respect to the state of art approaches based on island parsing or island parsing combined with machine learners. The evaluation considers a wide set of natural language documents (e.g., textbooks, forum discussions, and development emails) taken from different contexts and encompassing different coding languages. Results indicate an F-measure of Irish between 74{\%} and 99{\%}; this is in line with existing approaches which, differently from Irish, require specific expertise for the definition of regular expressions or grammars.",
keywords = "Developers' communication, Hidden Markov Models, Mining unstructured data",
author = "Luigi Cerulo and {Di Penta}, Massimiliano and Alberto Bacchelli and Michele Ceccarelli and Gerardo Canfora",
year = "2015",
month = "7",
day = "1",
doi = "10.1016/j.scico.2014.11.017",
language = "English",
volume = "105",
pages = "26--43",
journal = "Science of Computer Programming",
issn = "0167-6423",
publisher = "Elsevier",

}

TY - JOUR

T1 - Irish

T2 - A Hidden Markov Model to detect coded information islands in free text

AU - Cerulo, Luigi

AU - Di Penta, Massimiliano

AU - Bacchelli, Alberto

AU - Ceccarelli, Michele

AU - Canfora, Gerardo

PY - 2015/7/1

Y1 - 2015/7/1

N2 - Developers' communication, as contained in emails, issue trackers, and forums, is a precious source of information to support the development process. For example, it can be used to capture knowledge about development practice or about a software project itself. Thus, extracting the content of developers' communication can be useful to support several software engineering tasks, such as program comprehension, source code analysis, and software analytics. However, automating the extraction process is challenging, due to the unstructured nature of free text, which mixes different coding languages (e.g., source code, stack dumps, and log traces) with natural language parts. We conduct an extensive evaluation of Irish (InfoRmation ISlands Hmm), an approach we proposed to extract islands of coded information from free text at token granularity, with respect to the state of art approaches based on island parsing or island parsing combined with machine learners. The evaluation considers a wide set of natural language documents (e.g., textbooks, forum discussions, and development emails) taken from different contexts and encompassing different coding languages. Results indicate an F-measure of Irish between 74% and 99%; this is in line with existing approaches which, differently from Irish, require specific expertise for the definition of regular expressions or grammars.

AB - Developers' communication, as contained in emails, issue trackers, and forums, is a precious source of information to support the development process. For example, it can be used to capture knowledge about development practice or about a software project itself. Thus, extracting the content of developers' communication can be useful to support several software engineering tasks, such as program comprehension, source code analysis, and software analytics. However, automating the extraction process is challenging, due to the unstructured nature of free text, which mixes different coding languages (e.g., source code, stack dumps, and log traces) with natural language parts. We conduct an extensive evaluation of Irish (InfoRmation ISlands Hmm), an approach we proposed to extract islands of coded information from free text at token granularity, with respect to the state of art approaches based on island parsing or island parsing combined with machine learners. The evaluation considers a wide set of natural language documents (e.g., textbooks, forum discussions, and development emails) taken from different contexts and encompassing different coding languages. Results indicate an F-measure of Irish between 74% and 99%; this is in line with existing approaches which, differently from Irish, require specific expertise for the definition of regular expressions or grammars.

KW - Developers' communication

KW - Hidden Markov Models

KW - Mining unstructured data

UR - http://www.scopus.com/inward/record.url?scp=84929711948&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84929711948&partnerID=8YFLogxK

U2 - 10.1016/j.scico.2014.11.017

DO - 10.1016/j.scico.2014.11.017

M3 - Article

VL - 105

SP - 26

EP - 43

JO - Science of Computer Programming

JF - Science of Computer Programming

SN - 0167-6423

ER -