Real-time data pre-processing technique for efficient feature extraction in large scale datasets

Ying Liu, Lucian V. Lita, R. Stefan Niculescu, Kun Bai, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Due to the continuous and rampant increase in the size of domain specific data sources, there is a real and sustained need for fast processing in time-sensitive applications, such as medical record information extraction at the point of care, genetic feature extraction for personalized treatment, as well as off-line knowledge discovery such as creating evidence based medicine. Since parallel multi-string matching is at the core of most data mining tasks in these applications, faster on-line matching in static and streaming data is needed to improve the overall efficiency of such knowledge discovery. To solve this data mining need not efficiently handled by traditional information extraction and retrieval techniques, we propose a Block Suffix Shifting-based approach, which is an improvement over the state of the art multi-string matching algorithms such as Aho-Corasick, Commentz-Walter, and Wu-Manber. The strength of our approach is its ability to exploit the different block structures of domain specific data for off-line and online parallel matching. Experiments on several real world datasets show how our approach translates into significant performance improvements.

Original languageEnglish
Title of host publicationInternational Conference on Information and Knowledge Management, Proceedings
Pages981-990
Number of pages10
DOIs
Publication statusPublished - 2008
Externally publishedYes
Event17th ACM Conference on Information and Knowledge Management, CIKM'08 - Napa Valley, CA, United States
Duration: 26 Oct 200830 Oct 2008

Other

Other17th ACM Conference on Information and Knowledge Management, CIKM'08
CountryUnited States
CityNapa Valley, CA
Period26/10/0830/10/08

Fingerprint

Feature extraction
Data mining
Knowledge discovery
Information extraction
Performance improvement
Information retrieval
Evidence-based medicine
Medical records
Experiment
Data sources

Keywords

  • Block suffix shift
  • Feature extraction
  • Multiple-pattern matching
  • Pre-processing

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Liu, Y., Lita, L. V., Niculescu, R. S., Bai, K., Mitra, P., & Giles, C. L. (2008). Real-time data pre-processing technique for efficient feature extraction in large scale datasets. In International Conference on Information and Knowledge Management, Proceedings (pp. 981-990) https://doi.org/10.1145/1458082.1458211

Real-time data pre-processing technique for efficient feature extraction in large scale datasets. / Liu, Ying; Lita, Lucian V.; Niculescu, R. Stefan; Bai, Kun; Mitra, Prasenjit; Giles, C. Lee.

International Conference on Information and Knowledge Management, Proceedings. 2008. p. 981-990.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Liu, Y, Lita, LV, Niculescu, RS, Bai, K, Mitra, P & Giles, CL 2008, Real-time data pre-processing technique for efficient feature extraction in large scale datasets. in International Conference on Information and Knowledge Management, Proceedings. pp. 981-990, 17th ACM Conference on Information and Knowledge Management, CIKM'08, Napa Valley, CA, United States, 26/10/08. https://doi.org/10.1145/1458082.1458211
Liu Y, Lita LV, Niculescu RS, Bai K, Mitra P, Giles CL. Real-time data pre-processing technique for efficient feature extraction in large scale datasets. In International Conference on Information and Knowledge Management, Proceedings. 2008. p. 981-990 https://doi.org/10.1145/1458082.1458211
Liu, Ying ; Lita, Lucian V. ; Niculescu, R. Stefan ; Bai, Kun ; Mitra, Prasenjit ; Giles, C. Lee. / Real-time data pre-processing technique for efficient feature extraction in large scale datasets. International Conference on Information and Knowledge Management, Proceedings. 2008. pp. 981-990
@inproceedings{de1a24c4ecc04a4f9dfddb9f58f80aa6,
title = "Real-time data pre-processing technique for efficient feature extraction in large scale datasets",
abstract = "Due to the continuous and rampant increase in the size of domain specific data sources, there is a real and sustained need for fast processing in time-sensitive applications, such as medical record information extraction at the point of care, genetic feature extraction for personalized treatment, as well as off-line knowledge discovery such as creating evidence based medicine. Since parallel multi-string matching is at the core of most data mining tasks in these applications, faster on-line matching in static and streaming data is needed to improve the overall efficiency of such knowledge discovery. To solve this data mining need not efficiently handled by traditional information extraction and retrieval techniques, we propose a Block Suffix Shifting-based approach, which is an improvement over the state of the art multi-string matching algorithms such as Aho-Corasick, Commentz-Walter, and Wu-Manber. The strength of our approach is its ability to exploit the different block structures of domain specific data for off-line and online parallel matching. Experiments on several real world datasets show how our approach translates into significant performance improvements.",
keywords = "Block suffix shift, Feature extraction, Multiple-pattern matching, Pre-processing",
author = "Ying Liu and Lita, {Lucian V.} and Niculescu, {R. Stefan} and Kun Bai and Prasenjit Mitra and Giles, {C. Lee}",
year = "2008",
doi = "10.1145/1458082.1458211",
language = "English",
isbn = "9781595939913",
pages = "981--990",
booktitle = "International Conference on Information and Knowledge Management, Proceedings",

}

TY - GEN

T1 - Real-time data pre-processing technique for efficient feature extraction in large scale datasets

AU - Liu, Ying

AU - Lita, Lucian V.

AU - Niculescu, R. Stefan

AU - Bai, Kun

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2008

Y1 - 2008

N2 - Due to the continuous and rampant increase in the size of domain specific data sources, there is a real and sustained need for fast processing in time-sensitive applications, such as medical record information extraction at the point of care, genetic feature extraction for personalized treatment, as well as off-line knowledge discovery such as creating evidence based medicine. Since parallel multi-string matching is at the core of most data mining tasks in these applications, faster on-line matching in static and streaming data is needed to improve the overall efficiency of such knowledge discovery. To solve this data mining need not efficiently handled by traditional information extraction and retrieval techniques, we propose a Block Suffix Shifting-based approach, which is an improvement over the state of the art multi-string matching algorithms such as Aho-Corasick, Commentz-Walter, and Wu-Manber. The strength of our approach is its ability to exploit the different block structures of domain specific data for off-line and online parallel matching. Experiments on several real world datasets show how our approach translates into significant performance improvements.

AB - Due to the continuous and rampant increase in the size of domain specific data sources, there is a real and sustained need for fast processing in time-sensitive applications, such as medical record information extraction at the point of care, genetic feature extraction for personalized treatment, as well as off-line knowledge discovery such as creating evidence based medicine. Since parallel multi-string matching is at the core of most data mining tasks in these applications, faster on-line matching in static and streaming data is needed to improve the overall efficiency of such knowledge discovery. To solve this data mining need not efficiently handled by traditional information extraction and retrieval techniques, we propose a Block Suffix Shifting-based approach, which is an improvement over the state of the art multi-string matching algorithms such as Aho-Corasick, Commentz-Walter, and Wu-Manber. The strength of our approach is its ability to exploit the different block structures of domain specific data for off-line and online parallel matching. Experiments on several real world datasets show how our approach translates into significant performance improvements.

KW - Block suffix shift

KW - Feature extraction

KW - Multiple-pattern matching

KW - Pre-processing

UR - http://www.scopus.com/inward/record.url?scp=70349248441&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70349248441&partnerID=8YFLogxK

U2 - 10.1145/1458082.1458211

DO - 10.1145/1458082.1458211

M3 - Conference contribution

SN - 9781595939913

SP - 981

EP - 990

BT - International Conference on Information and Knowledge Management, Proceedings

ER -