Learning sequential classifiers from long and noisy discrete-event sequences efficiently

Gessé Dafé, Adriano Veloso, Mohammed Zaki, Wagner Meira

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

A variety of applications, such as information extraction, intrusion detection and protein fold recognition, can be expressed as sequences of discrete events or elements (rather than unordered sets of features), that is, there is an order dependence among the elements composing each data instance. These applications may be modeled as classification problems, and in this case the classifier should exploit sequential interactions among the elements, so that the ordering relationship among them is properly captured. Dominant approaches to this problem include: (i) learning Hidden Markov Models, (ii) exploiting frequent sequences extracted from the data and (iii) computing string kernels. Such approaches, however, are computationally hard and vulnerable to noise, especially if the data shows long range dependencies (i.e., long subsequences are necessary in order to model the data). In this paper we provide simple algorithms that build highly effective sequential classifiers. Our algorithms are based on enumerating approximately contiguous subsequences from the training set on a demand-driven basis, exploiting a lightweight and flexible subsequence matching function and an innovative subsequence enumeration strategy called pattern silhouettes, making our learning algorithms fast and the corresponding classifiers robust to noisy data. Our empirical results on a variety of datasets indicate that the best trade-off between accuracy and learning time is usually obtained by limiting the length of the subsequences by a factor of log n, which leads to a O(n log n) learning cost (where n is the length of the sequence being classified). Finally, we show that, in most of the cases, our classifiers are faster than existing solutions (sometimes, by orders of magnitude), also providing significant accuracy improvements in most of the evaluated cases.

Original languageEnglish
Pages (from-to)1685-1708
Number of pages24
JournalData Mining and Knowledge Discovery
Volume29
Issue number6
DOIs
Publication statusPublished - 4 Nov 2014
Externally publishedYes

    Fingerprint

Keywords

  • Approximately contiguous sequences
  • Efficient learning
  • Long range sequences
  • Partial matching
  • Sequential classifiers

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computer Networks and Communications

Cite this