Finding a haystack in haystacks - Simultaneous identification of concepts in large bio-medical corpora

Ying Liu, Lucian V. Lita, R. Stefan Niculescu, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Since nearly all information is now created digitally, large text databases have become more prevalent than ever. Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we introduce a new, fast multi-string pattern matching method called the Block Suffix Shifting (BSS) algorithm, which is based on the well known Aho-Chorasick algorithm. The advantages of our algorithm include: the ability to exploit the natural structure of text, perform significant character shifting, avoid useless backtracking jumps, efficient matching time and avoid the typical "sub-string" false positive errors. Our algorithm is applicable to many fields with free text, such as the health care domain and the scientific document field. In this paper, we apply the BSS algorithm to health care data and mine hundreds of thousands of medical concepts from a large Electronic Medical Record (EMR) corpora simultaneously and efficiently. Experimental results show the superiority of our algorithm when compared with the top of the line multi-string matching algorithms (the Aho-Corasick and the Wu-Manber algorithm).

Original languageEnglish
Title of host publicationSociety for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130
Pages668-679
Number of pages12
Volume2
Publication statusPublished - 2008
Externally publishedYes
Event8th SIAM International Conference on Data Mining 2008, Applied Mathematics 130 - Atlanta, GA
Duration: 24 Apr 200826 Apr 2008

Other

Other8th SIAM International Conference on Data Mining 2008, Applied Mathematics 130
CityAtlanta, GA
Period24/4/0826/4/08

Fingerprint

String Matching
Suffix
Health care
Healthcare
Strings
String searching algorithms
String Algorithms
Electronic medical equipment
Pattern matching
Backtracking
Pattern Matching
Matching Algorithm
Corpus
Concepts
False Positive
Mining
Jump
Electronics
Line
Experimental Results

ASJC Scopus subject areas

  • Information Systems
  • Software
  • Signal Processing
  • Theoretical Computer Science

Cite this

Liu, Y., Lita, L. V., Niculescu, R. S., Mitra, P., & Giles, C. L. (2008). Finding a haystack in haystacks - Simultaneous identification of concepts in large bio-medical corpora. In Society for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130 (Vol. 2, pp. 668-679)

Finding a haystack in haystacks - Simultaneous identification of concepts in large bio-medical corpora. / Liu, Ying; Lita, Lucian V.; Niculescu, R. Stefan; Mitra, Prasenjit; Giles, C. Lee.

Society for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130. Vol. 2 2008. p. 668-679.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Liu, Y, Lita, LV, Niculescu, RS, Mitra, P & Giles, CL 2008, Finding a haystack in haystacks - Simultaneous identification of concepts in large bio-medical corpora. in Society for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130. vol. 2, pp. 668-679, 8th SIAM International Conference on Data Mining 2008, Applied Mathematics 130, Atlanta, GA, 24/4/08.
Liu Y, Lita LV, Niculescu RS, Mitra P, Giles CL. Finding a haystack in haystacks - Simultaneous identification of concepts in large bio-medical corpora. In Society for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130. Vol. 2. 2008. p. 668-679
Liu, Ying ; Lita, Lucian V. ; Niculescu, R. Stefan ; Mitra, Prasenjit ; Giles, C. Lee. / Finding a haystack in haystacks - Simultaneous identification of concepts in large bio-medical corpora. Society for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130. Vol. 2 2008. pp. 668-679
@inproceedings{ebe7226c9f1d4bf19c844ec9f670a4bc,
title = "Finding a haystack in haystacks - Simultaneous identification of concepts in large bio-medical corpora",
abstract = "Since nearly all information is now created digitally, large text databases have become more prevalent than ever. Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we introduce a new, fast multi-string pattern matching method called the Block Suffix Shifting (BSS) algorithm, which is based on the well known Aho-Chorasick algorithm. The advantages of our algorithm include: the ability to exploit the natural structure of text, perform significant character shifting, avoid useless backtracking jumps, efficient matching time and avoid the typical {"}sub-string{"} false positive errors. Our algorithm is applicable to many fields with free text, such as the health care domain and the scientific document field. In this paper, we apply the BSS algorithm to health care data and mine hundreds of thousands of medical concepts from a large Electronic Medical Record (EMR) corpora simultaneously and efficiently. Experimental results show the superiority of our algorithm when compared with the top of the line multi-string matching algorithms (the Aho-Corasick and the Wu-Manber algorithm).",
author = "Ying Liu and Lita, {Lucian V.} and Niculescu, {R. Stefan} and Prasenjit Mitra and Giles, {C. Lee}",
year = "2008",
language = "English",
isbn = "9781605603179",
volume = "2",
pages = "668--679",
booktitle = "Society for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130",

}

TY - GEN

T1 - Finding a haystack in haystacks - Simultaneous identification of concepts in large bio-medical corpora

AU - Liu, Ying

AU - Lita, Lucian V.

AU - Niculescu, R. Stefan

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2008

Y1 - 2008

N2 - Since nearly all information is now created digitally, large text databases have become more prevalent than ever. Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we introduce a new, fast multi-string pattern matching method called the Block Suffix Shifting (BSS) algorithm, which is based on the well known Aho-Chorasick algorithm. The advantages of our algorithm include: the ability to exploit the natural structure of text, perform significant character shifting, avoid useless backtracking jumps, efficient matching time and avoid the typical "sub-string" false positive errors. Our algorithm is applicable to many fields with free text, such as the health care domain and the scientific document field. In this paper, we apply the BSS algorithm to health care data and mine hundreds of thousands of medical concepts from a large Electronic Medical Record (EMR) corpora simultaneously and efficiently. Experimental results show the superiority of our algorithm when compared with the top of the line multi-string matching algorithms (the Aho-Corasick and the Wu-Manber algorithm).

AB - Since nearly all information is now created digitally, large text databases have become more prevalent than ever. Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we introduce a new, fast multi-string pattern matching method called the Block Suffix Shifting (BSS) algorithm, which is based on the well known Aho-Chorasick algorithm. The advantages of our algorithm include: the ability to exploit the natural structure of text, perform significant character shifting, avoid useless backtracking jumps, efficient matching time and avoid the typical "sub-string" false positive errors. Our algorithm is applicable to many fields with free text, such as the health care domain and the scientific document field. In this paper, we apply the BSS algorithm to health care data and mine hundreds of thousands of medical concepts from a large Electronic Medical Record (EMR) corpora simultaneously and efficiently. Experimental results show the superiority of our algorithm when compared with the top of the line multi-string matching algorithms (the Aho-Corasick and the Wu-Manber algorithm).

UR - http://www.scopus.com/inward/record.url?scp=52649134400&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=52649134400&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781605603179

VL - 2

SP - 668

EP - 679

BT - Society for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130

ER -