Learning sequential classifiers from long and noisy discrete-event sequences efficiently

Gessé Dafé, Adriano Veloso, Mohammed Zaki, Wagner Meira

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

A variety of applications, such as information extraction, intrusion detection and protein fold recognition, can be expressed as sequences of discrete events or elements (rather than unordered sets of features), that is, there is an order dependence among the elements composing each data instance. These applications may be modeled as classification problems, and in this case the classifier should exploit sequential interactions among the elements, so that the ordering relationship among them is properly captured. Dominant approaches to this problem include: (i) learning Hidden Markov Models, (ii) exploiting frequent sequences extracted from the data and (iii) computing string kernels. Such approaches, however, are computationally hard and vulnerable to noise, especially if the data shows long range dependencies (i.e., long subsequences are necessary in order to model the data). In this paper we provide simple algorithms that build highly effective sequential classifiers. Our algorithms are based on enumerating approximately contiguous subsequences from the training set on a demand-driven basis, exploiting a lightweight and flexible subsequence matching function and an innovative subsequence enumeration strategy called pattern silhouettes, making our learning algorithms fast and the corresponding classifiers robust to noisy data. Our empirical results on a variety of datasets indicate that the best trade-off between accuracy and learning time is usually obtained by limiting the length of the subsequences by a factor of log n, which leads to a O(n log n) learning cost (where n is the length of the sequence being classified). Finally, we show that, in most of the cases, our classifiers are faster than existing solutions (sometimes, by orders of magnitude), also providing significant accuracy improvements in most of the evaluated cases.

Original languageEnglish
Pages (from-to)1685-1708
Number of pages24
JournalData Mining and Knowledge Discovery
Volume29
Issue number6
DOIs
Publication statusPublished - 4 Nov 2014
Externally publishedYes

Fingerprint

Classifiers
Intrusion detection
Hidden Markov models
Learning algorithms
Proteins
Costs

Keywords

  • Approximately contiguous sequences
  • Efficient learning
  • Long range sequences
  • Partial matching
  • Sequential classifiers

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computer Networks and Communications

Cite this

Learning sequential classifiers from long and noisy discrete-event sequences efficiently. / Dafé, Gessé; Veloso, Adriano; Zaki, Mohammed; Meira, Wagner.

In: Data Mining and Knowledge Discovery, Vol. 29, No. 6, 04.11.2014, p. 1685-1708.

Research output: Contribution to journalArticle

Dafé, Gessé ; Veloso, Adriano ; Zaki, Mohammed ; Meira, Wagner. / Learning sequential classifiers from long and noisy discrete-event sequences efficiently. In: Data Mining and Knowledge Discovery. 2014 ; Vol. 29, No. 6. pp. 1685-1708.
@article{2decbdff50734eaf871c03e789a35609,
title = "Learning sequential classifiers from long and noisy discrete-event sequences efficiently",
abstract = "A variety of applications, such as information extraction, intrusion detection and protein fold recognition, can be expressed as sequences of discrete events or elements (rather than unordered sets of features), that is, there is an order dependence among the elements composing each data instance. These applications may be modeled as classification problems, and in this case the classifier should exploit sequential interactions among the elements, so that the ordering relationship among them is properly captured. Dominant approaches to this problem include: (i) learning Hidden Markov Models, (ii) exploiting frequent sequences extracted from the data and (iii) computing string kernels. Such approaches, however, are computationally hard and vulnerable to noise, especially if the data shows long range dependencies (i.e., long subsequences are necessary in order to model the data). In this paper we provide simple algorithms that build highly effective sequential classifiers. Our algorithms are based on enumerating approximately contiguous subsequences from the training set on a demand-driven basis, exploiting a lightweight and flexible subsequence matching function and an innovative subsequence enumeration strategy called pattern silhouettes, making our learning algorithms fast and the corresponding classifiers robust to noisy data. Our empirical results on a variety of datasets indicate that the best trade-off between accuracy and learning time is usually obtained by limiting the length of the subsequences by a factor of log n, which leads to a O(n log n) learning cost (where n is the length of the sequence being classified). Finally, we show that, in most of the cases, our classifiers are faster than existing solutions (sometimes, by orders of magnitude), also providing significant accuracy improvements in most of the evaluated cases.",
keywords = "Approximately contiguous sequences, Efficient learning, Long range sequences, Partial matching, Sequential classifiers",
author = "Gess{\'e} Daf{\'e} and Adriano Veloso and Mohammed Zaki and Wagner Meira",
year = "2014",
month = "11",
day = "4",
doi = "10.1007/s10618-014-0391-9",
language = "English",
volume = "29",
pages = "1685--1708",
journal = "Data Mining and Knowledge Discovery",
issn = "1384-5810",
publisher = "Springer Netherlands",
number = "6",

}

TY - JOUR

T1 - Learning sequential classifiers from long and noisy discrete-event sequences efficiently

AU - Dafé, Gessé

AU - Veloso, Adriano

AU - Zaki, Mohammed

AU - Meira, Wagner

PY - 2014/11/4

Y1 - 2014/11/4

N2 - A variety of applications, such as information extraction, intrusion detection and protein fold recognition, can be expressed as sequences of discrete events or elements (rather than unordered sets of features), that is, there is an order dependence among the elements composing each data instance. These applications may be modeled as classification problems, and in this case the classifier should exploit sequential interactions among the elements, so that the ordering relationship among them is properly captured. Dominant approaches to this problem include: (i) learning Hidden Markov Models, (ii) exploiting frequent sequences extracted from the data and (iii) computing string kernels. Such approaches, however, are computationally hard and vulnerable to noise, especially if the data shows long range dependencies (i.e., long subsequences are necessary in order to model the data). In this paper we provide simple algorithms that build highly effective sequential classifiers. Our algorithms are based on enumerating approximately contiguous subsequences from the training set on a demand-driven basis, exploiting a lightweight and flexible subsequence matching function and an innovative subsequence enumeration strategy called pattern silhouettes, making our learning algorithms fast and the corresponding classifiers robust to noisy data. Our empirical results on a variety of datasets indicate that the best trade-off between accuracy and learning time is usually obtained by limiting the length of the subsequences by a factor of log n, which leads to a O(n log n) learning cost (where n is the length of the sequence being classified). Finally, we show that, in most of the cases, our classifiers are faster than existing solutions (sometimes, by orders of magnitude), also providing significant accuracy improvements in most of the evaluated cases.

AB - A variety of applications, such as information extraction, intrusion detection and protein fold recognition, can be expressed as sequences of discrete events or elements (rather than unordered sets of features), that is, there is an order dependence among the elements composing each data instance. These applications may be modeled as classification problems, and in this case the classifier should exploit sequential interactions among the elements, so that the ordering relationship among them is properly captured. Dominant approaches to this problem include: (i) learning Hidden Markov Models, (ii) exploiting frequent sequences extracted from the data and (iii) computing string kernels. Such approaches, however, are computationally hard and vulnerable to noise, especially if the data shows long range dependencies (i.e., long subsequences are necessary in order to model the data). In this paper we provide simple algorithms that build highly effective sequential classifiers. Our algorithms are based on enumerating approximately contiguous subsequences from the training set on a demand-driven basis, exploiting a lightweight and flexible subsequence matching function and an innovative subsequence enumeration strategy called pattern silhouettes, making our learning algorithms fast and the corresponding classifiers robust to noisy data. Our empirical results on a variety of datasets indicate that the best trade-off between accuracy and learning time is usually obtained by limiting the length of the subsequences by a factor of log n, which leads to a O(n log n) learning cost (where n is the length of the sequence being classified). Finally, we show that, in most of the cases, our classifiers are faster than existing solutions (sometimes, by orders of magnitude), also providing significant accuracy improvements in most of the evaluated cases.

KW - Approximately contiguous sequences

KW - Efficient learning

KW - Long range sequences

KW - Partial matching

KW - Sequential classifiers

UR - http://www.scopus.com/inward/record.url?scp=84942504545&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84942504545&partnerID=8YFLogxK

U2 - 10.1007/s10618-014-0391-9

DO - 10.1007/s10618-014-0391-9

M3 - Article

VL - 29

SP - 1685

EP - 1708

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

SN - 1384-5810

IS - 6

ER -