A hidden markov model to detect coded information islands in free text

Luigi Cerulo, Michele Ceccarelli, Massimiliano Di Penta, Gerardo Canfora

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

Emails and issue reports capture useful knowledge about development practices, bug fixing, and change activities. Extracting such a content is challenging, due to the mix-up of source code and natural language, unstructured text. In this paper we introduce an approach, based on Hidden Markov Models (HMMs), to extract coded information islands, such as source code, stack traces, and patches, from free text at a token level of granularity. We train a HMM for each category of information contained in the text, and adopt the Viterbi algorithm to recognize whether the sequence of tokens'e.g., words, language keywords, numbers, parentheses, punctuation marks, etc.'observed in a text switches among those HMMs. Although our implementation focuses on extracting source code from emails, the approach could be easily extended to include in principle any text-interleaved language. We evaluated our approach with respect to the state of art on a set of development emails and bug reports drawn from the software repositories of well known open source systems. Results indicate an accuracy between 82% and 99%, which is in line with existing approaches which, differently from ours, require the manual definition of regular expressions or parsers.

Original languageEnglish
Title of host publicationIEEE 13th International Working Conference on Source Code Analysis and Manipulation, SCAM 2013
PublisherIEEE Computer Society
Pages157-166
Number of pages10
DOIs
Publication statusPublished - 1 Jan 2013
Externally publishedYes
Event2013 IEEE 13th International Working Conference on Source Code Analysis and Manipulation, SCAM 2013 - Eindhoven, Netherlands
Duration: 22 Sep 201323 Sep 2013

Other

Other2013 IEEE 13th International Working Conference on Source Code Analysis and Manipulation, SCAM 2013
CountryNetherlands
CityEindhoven
Period22/9/1323/9/13

Fingerprint

Electronic mail
Hidden Markov models
Viterbi algorithm
Switches

Keywords

  • HMM
  • Mailing list mining
  • Natural language parsing

ASJC Scopus subject areas

  • Software

Cite this

Cerulo, L., Ceccarelli, M., Di Penta, M., & Canfora, G. (2013). A hidden markov model to detect coded information islands in free text. In IEEE 13th International Working Conference on Source Code Analysis and Manipulation, SCAM 2013 (pp. 157-166). [6648197] IEEE Computer Society. https://doi.org/10.1109/SCAM.2013.6648197

A hidden markov model to detect coded information islands in free text. / Cerulo, Luigi; Ceccarelli, Michele; Di Penta, Massimiliano; Canfora, Gerardo.

IEEE 13th International Working Conference on Source Code Analysis and Manipulation, SCAM 2013. IEEE Computer Society, 2013. p. 157-166 6648197.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Cerulo, L, Ceccarelli, M, Di Penta, M & Canfora, G 2013, A hidden markov model to detect coded information islands in free text. in IEEE 13th International Working Conference on Source Code Analysis and Manipulation, SCAM 2013., 6648197, IEEE Computer Society, pp. 157-166, 2013 IEEE 13th International Working Conference on Source Code Analysis and Manipulation, SCAM 2013, Eindhoven, Netherlands, 22/9/13. https://doi.org/10.1109/SCAM.2013.6648197
Cerulo L, Ceccarelli M, Di Penta M, Canfora G. A hidden markov model to detect coded information islands in free text. In IEEE 13th International Working Conference on Source Code Analysis and Manipulation, SCAM 2013. IEEE Computer Society. 2013. p. 157-166. 6648197 https://doi.org/10.1109/SCAM.2013.6648197
Cerulo, Luigi ; Ceccarelli, Michele ; Di Penta, Massimiliano ; Canfora, Gerardo. / A hidden markov model to detect coded information islands in free text. IEEE 13th International Working Conference on Source Code Analysis and Manipulation, SCAM 2013. IEEE Computer Society, 2013. pp. 157-166
@inproceedings{149a8e1d2041441685809004480b4ca3,
title = "A hidden markov model to detect coded information islands in free text",
abstract = "Emails and issue reports capture useful knowledge about development practices, bug fixing, and change activities. Extracting such a content is challenging, due to the mix-up of source code and natural language, unstructured text. In this paper we introduce an approach, based on Hidden Markov Models (HMMs), to extract coded information islands, such as source code, stack traces, and patches, from free text at a token level of granularity. We train a HMM for each category of information contained in the text, and adopt the Viterbi algorithm to recognize whether the sequence of tokens'e.g., words, language keywords, numbers, parentheses, punctuation marks, etc.'observed in a text switches among those HMMs. Although our implementation focuses on extracting source code from emails, the approach could be easily extended to include in principle any text-interleaved language. We evaluated our approach with respect to the state of art on a set of development emails and bug reports drawn from the software repositories of well known open source systems. Results indicate an accuracy between 82{\%} and 99{\%}, which is in line with existing approaches which, differently from ours, require the manual definition of regular expressions or parsers.",
keywords = "HMM, Mailing list mining, Natural language parsing",
author = "Luigi Cerulo and Michele Ceccarelli and {Di Penta}, Massimiliano and Gerardo Canfora",
year = "2013",
month = "1",
day = "1",
doi = "10.1109/SCAM.2013.6648197",
language = "English",
pages = "157--166",
booktitle = "IEEE 13th International Working Conference on Source Code Analysis and Manipulation, SCAM 2013",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - A hidden markov model to detect coded information islands in free text

AU - Cerulo, Luigi

AU - Ceccarelli, Michele

AU - Di Penta, Massimiliano

AU - Canfora, Gerardo

PY - 2013/1/1

Y1 - 2013/1/1

N2 - Emails and issue reports capture useful knowledge about development practices, bug fixing, and change activities. Extracting such a content is challenging, due to the mix-up of source code and natural language, unstructured text. In this paper we introduce an approach, based on Hidden Markov Models (HMMs), to extract coded information islands, such as source code, stack traces, and patches, from free text at a token level of granularity. We train a HMM for each category of information contained in the text, and adopt the Viterbi algorithm to recognize whether the sequence of tokens'e.g., words, language keywords, numbers, parentheses, punctuation marks, etc.'observed in a text switches among those HMMs. Although our implementation focuses on extracting source code from emails, the approach could be easily extended to include in principle any text-interleaved language. We evaluated our approach with respect to the state of art on a set of development emails and bug reports drawn from the software repositories of well known open source systems. Results indicate an accuracy between 82% and 99%, which is in line with existing approaches which, differently from ours, require the manual definition of regular expressions or parsers.

AB - Emails and issue reports capture useful knowledge about development practices, bug fixing, and change activities. Extracting such a content is challenging, due to the mix-up of source code and natural language, unstructured text. In this paper we introduce an approach, based on Hidden Markov Models (HMMs), to extract coded information islands, such as source code, stack traces, and patches, from free text at a token level of granularity. We train a HMM for each category of information contained in the text, and adopt the Viterbi algorithm to recognize whether the sequence of tokens'e.g., words, language keywords, numbers, parentheses, punctuation marks, etc.'observed in a text switches among those HMMs. Although our implementation focuses on extracting source code from emails, the approach could be easily extended to include in principle any text-interleaved language. We evaluated our approach with respect to the state of art on a set of development emails and bug reports drawn from the software repositories of well known open source systems. Results indicate an accuracy between 82% and 99%, which is in line with existing approaches which, differently from ours, require the manual definition of regular expressions or parsers.

KW - HMM

KW - Mailing list mining

KW - Natural language parsing

UR - http://www.scopus.com/inward/record.url?scp=84891053714&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84891053714&partnerID=8YFLogxK

U2 - 10.1109/SCAM.2013.6648197

DO - 10.1109/SCAM.2013.6648197

M3 - Conference contribution

SP - 157

EP - 166

BT - IEEE 13th International Working Conference on Source Code Analysis and Manipulation, SCAM 2013

PB - IEEE Computer Society

ER -