### Abstract

A variety of applications, such as information extraction, intrusion detection and protein fold recognition, can be expressed as sequences of discrete events or elements (rather than unordered sets of features), that is, there is an order dependence among the elements composing each data instance. These applications may be modeled as classification problems, and in this case the classifier should exploit sequential interactions among the elements, so that the ordering relationship among them is properly captured. Dominant approaches to this problem include: (i) learning Hidden Markov Models, (ii) exploiting frequent sequences extracted from the data and (iii) computing string kernels. Such approaches, however, are computationally hard and vulnerable to noise, especially if the data shows long range dependencies (i.e., long subsequences are necessary in order to model the data). In this paper we provide simple algorithms that build highly effective sequential classifiers. Our algorithms are based on enumerating approximately contiguous subsequences from the training set on a demand-driven basis, exploiting a lightweight and flexible subsequence matching function and an innovative subsequence enumeration strategy called pattern silhouettes, making our learning algorithms fast and the corresponding classifiers robust to noisy data. Our empirical results on a variety of datasets indicate that the best trade-off between accuracy and learning time is usually obtained by limiting the length of the subsequences by a factor of log n, which leads to a O(n log n) learning cost (where n is the length of the sequence being classified). Finally, we show that, in most of the cases, our classifiers are faster than existing solutions (sometimes, by orders of magnitude), also providing significant accuracy improvements in most of the evaluated cases.

Original language | English |
---|---|

Pages (from-to) | 1685-1708 |

Number of pages | 24 |

Journal | Data Mining and Knowledge Discovery |

Volume | 29 |

Issue number | 6 |

DOIs | |

Publication status | Published - 4 Nov 2014 |

Externally published | Yes |

### Fingerprint

### Keywords

- Approximately contiguous sequences
- Efficient learning
- Long range sequences
- Partial matching
- Sequential classifiers

### ASJC Scopus subject areas

- Information Systems
- Computer Science Applications
- Computer Networks and Communications

### Cite this

*Data Mining and Knowledge Discovery*,

*29*(6), 1685-1708. https://doi.org/10.1007/s10618-014-0391-9

**Learning sequential classifiers from long and noisy discrete-event sequences efficiently.** / Dafé, Gessé; Veloso, Adriano; Zaki, Mohammed; Meira, Wagner.

Research output: Contribution to journal › Article

*Data Mining and Knowledge Discovery*, vol. 29, no. 6, pp. 1685-1708. https://doi.org/10.1007/s10618-014-0391-9

}

TY - JOUR

T1 - Learning sequential classifiers from long and noisy discrete-event sequences efficiently

AU - Dafé, Gessé

AU - Veloso, Adriano

AU - Zaki, Mohammed

AU - Meira, Wagner

PY - 2014/11/4

Y1 - 2014/11/4

N2 - A variety of applications, such as information extraction, intrusion detection and protein fold recognition, can be expressed as sequences of discrete events or elements (rather than unordered sets of features), that is, there is an order dependence among the elements composing each data instance. These applications may be modeled as classification problems, and in this case the classifier should exploit sequential interactions among the elements, so that the ordering relationship among them is properly captured. Dominant approaches to this problem include: (i) learning Hidden Markov Models, (ii) exploiting frequent sequences extracted from the data and (iii) computing string kernels. Such approaches, however, are computationally hard and vulnerable to noise, especially if the data shows long range dependencies (i.e., long subsequences are necessary in order to model the data). In this paper we provide simple algorithms that build highly effective sequential classifiers. Our algorithms are based on enumerating approximately contiguous subsequences from the training set on a demand-driven basis, exploiting a lightweight and flexible subsequence matching function and an innovative subsequence enumeration strategy called pattern silhouettes, making our learning algorithms fast and the corresponding classifiers robust to noisy data. Our empirical results on a variety of datasets indicate that the best trade-off between accuracy and learning time is usually obtained by limiting the length of the subsequences by a factor of log n, which leads to a O(n log n) learning cost (where n is the length of the sequence being classified). Finally, we show that, in most of the cases, our classifiers are faster than existing solutions (sometimes, by orders of magnitude), also providing significant accuracy improvements in most of the evaluated cases.

AB - A variety of applications, such as information extraction, intrusion detection and protein fold recognition, can be expressed as sequences of discrete events or elements (rather than unordered sets of features), that is, there is an order dependence among the elements composing each data instance. These applications may be modeled as classification problems, and in this case the classifier should exploit sequential interactions among the elements, so that the ordering relationship among them is properly captured. Dominant approaches to this problem include: (i) learning Hidden Markov Models, (ii) exploiting frequent sequences extracted from the data and (iii) computing string kernels. Such approaches, however, are computationally hard and vulnerable to noise, especially if the data shows long range dependencies (i.e., long subsequences are necessary in order to model the data). In this paper we provide simple algorithms that build highly effective sequential classifiers. Our algorithms are based on enumerating approximately contiguous subsequences from the training set on a demand-driven basis, exploiting a lightweight and flexible subsequence matching function and an innovative subsequence enumeration strategy called pattern silhouettes, making our learning algorithms fast and the corresponding classifiers robust to noisy data. Our empirical results on a variety of datasets indicate that the best trade-off between accuracy and learning time is usually obtained by limiting the length of the subsequences by a factor of log n, which leads to a O(n log n) learning cost (where n is the length of the sequence being classified). Finally, we show that, in most of the cases, our classifiers are faster than existing solutions (sometimes, by orders of magnitude), also providing significant accuracy improvements in most of the evaluated cases.

KW - Approximately contiguous sequences

KW - Efficient learning

KW - Long range sequences

KW - Partial matching

KW - Sequential classifiers

UR - http://www.scopus.com/inward/record.url?scp=84942504545&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84942504545&partnerID=8YFLogxK

U2 - 10.1007/s10618-014-0391-9

DO - 10.1007/s10618-014-0391-9

M3 - Article

VL - 29

SP - 1685

EP - 1708

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

SN - 1384-5810

IS - 6

ER -