Identifying, indexing, and ranking chemical formulae and chemical names in digital documents

Bingjun Sun, Prasenjit Mitra, C. Lee Giles, Karl T. Mueller

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

End-users utilize chemical search engines to search for chemical formulae and chemical names. Chemical search engines identify and index chemical formulae and chemical names appearing in text documents to support efficient search and retrieval in the future. Identifying chemical formulae and chemical names in text automatically has been a hard problem that has met with varying degrees of success in the past. We propose algorithms for chemical formula and chemical name tagging using Conditional Random Fields (CRFs) and Support Vector Machines (SVMs) that achieve higher accuracy than existing (published) methods. After chemical entities have been identified in text documents, they must be indexed. In order to support userprovided search queries that require a partial match between the chemical name segment used as a keyword or a partial chemical formula, all possible (or a significant number of) subformulae of formulae that appear in any document and all possible subterms (e.g., "methyl") of chemical names (e.g., "methylethyl ketone") must be indexed. Indexing all possible subformulae and subterms results in an exponential increase in the storage and memory requirements as well as the time taken to process the indices. We propose techniques to prune the indices significantly without reducing the quality of the returned results significantly. Finally, we propose multiple query semantics to allow users to pose different types of partial search queries for chemical entities. We demonstrate empirically that our search engines improve the relevance of the returned results for search queries involving chemical entities.

Original languageEnglish
Article number12
JournalACM Transactions on Information Systems
Volume29
Issue number2
DOIs
Publication statusPublished - Apr 2011
Externally publishedYes

Fingerprint

Search engines
Indexing
Ranking
Ketones
Support vector machines
Semantics
Data storage equipment
Query
Search engine

Keywords

  • Chemical formula
  • Chemical name
  • Conditional Random Fields
  • Entity extraction
  • Hierarchical text segmentation
  • Independent frequent subsequence
  • Index pruning
  • Query models
  • Ranking
  • Similarity search
  • Support Vector Machines

ASJC Scopus subject areas

  • Information Systems
  • Business, Management and Accounting(all)
  • Computer Science Applications

Cite this

Identifying, indexing, and ranking chemical formulae and chemical names in digital documents. / Sun, Bingjun; Mitra, Prasenjit; Lee Giles, C.; Mueller, Karl T.

In: ACM Transactions on Information Systems, Vol. 29, No. 2, 12, 04.2011.

Research output: Contribution to journalArticle

Sun, Bingjun ; Mitra, Prasenjit ; Lee Giles, C. ; Mueller, Karl T. / Identifying, indexing, and ranking chemical formulae and chemical names in digital documents. In: ACM Transactions on Information Systems. 2011 ; Vol. 29, No. 2.
@article{f9e5b5956fc546a6be8b0464c67139e5,
title = "Identifying, indexing, and ranking chemical formulae and chemical names in digital documents",
abstract = "End-users utilize chemical search engines to search for chemical formulae and chemical names. Chemical search engines identify and index chemical formulae and chemical names appearing in text documents to support efficient search and retrieval in the future. Identifying chemical formulae and chemical names in text automatically has been a hard problem that has met with varying degrees of success in the past. We propose algorithms for chemical formula and chemical name tagging using Conditional Random Fields (CRFs) and Support Vector Machines (SVMs) that achieve higher accuracy than existing (published) methods. After chemical entities have been identified in text documents, they must be indexed. In order to support userprovided search queries that require a partial match between the chemical name segment used as a keyword or a partial chemical formula, all possible (or a significant number of) subformulae of formulae that appear in any document and all possible subterms (e.g., {"}methyl{"}) of chemical names (e.g., {"}methylethyl ketone{"}) must be indexed. Indexing all possible subformulae and subterms results in an exponential increase in the storage and memory requirements as well as the time taken to process the indices. We propose techniques to prune the indices significantly without reducing the quality of the returned results significantly. Finally, we propose multiple query semantics to allow users to pose different types of partial search queries for chemical entities. We demonstrate empirically that our search engines improve the relevance of the returned results for search queries involving chemical entities.",
keywords = "Chemical formula, Chemical name, Conditional Random Fields, Entity extraction, Hierarchical text segmentation, Independent frequent subsequence, Index pruning, Query models, Ranking, Similarity search, Support Vector Machines",
author = "Bingjun Sun and Prasenjit Mitra and {Lee Giles}, C. and Mueller, {Karl T.}",
year = "2011",
month = "4",
doi = "10.1145/1961209.1961215",
language = "English",
volume = "29",
journal = "ACM Transactions on Information Systems",
issn = "1046-8188",
publisher = "Association for Computing Machinery (ACM)",
number = "2",

}

TY - JOUR

T1 - Identifying, indexing, and ranking chemical formulae and chemical names in digital documents

AU - Sun, Bingjun

AU - Mitra, Prasenjit

AU - Lee Giles, C.

AU - Mueller, Karl T.

PY - 2011/4

Y1 - 2011/4

N2 - End-users utilize chemical search engines to search for chemical formulae and chemical names. Chemical search engines identify and index chemical formulae and chemical names appearing in text documents to support efficient search and retrieval in the future. Identifying chemical formulae and chemical names in text automatically has been a hard problem that has met with varying degrees of success in the past. We propose algorithms for chemical formula and chemical name tagging using Conditional Random Fields (CRFs) and Support Vector Machines (SVMs) that achieve higher accuracy than existing (published) methods. After chemical entities have been identified in text documents, they must be indexed. In order to support userprovided search queries that require a partial match between the chemical name segment used as a keyword or a partial chemical formula, all possible (or a significant number of) subformulae of formulae that appear in any document and all possible subterms (e.g., "methyl") of chemical names (e.g., "methylethyl ketone") must be indexed. Indexing all possible subformulae and subterms results in an exponential increase in the storage and memory requirements as well as the time taken to process the indices. We propose techniques to prune the indices significantly without reducing the quality of the returned results significantly. Finally, we propose multiple query semantics to allow users to pose different types of partial search queries for chemical entities. We demonstrate empirically that our search engines improve the relevance of the returned results for search queries involving chemical entities.

AB - End-users utilize chemical search engines to search for chemical formulae and chemical names. Chemical search engines identify and index chemical formulae and chemical names appearing in text documents to support efficient search and retrieval in the future. Identifying chemical formulae and chemical names in text automatically has been a hard problem that has met with varying degrees of success in the past. We propose algorithms for chemical formula and chemical name tagging using Conditional Random Fields (CRFs) and Support Vector Machines (SVMs) that achieve higher accuracy than existing (published) methods. After chemical entities have been identified in text documents, they must be indexed. In order to support userprovided search queries that require a partial match between the chemical name segment used as a keyword or a partial chemical formula, all possible (or a significant number of) subformulae of formulae that appear in any document and all possible subterms (e.g., "methyl") of chemical names (e.g., "methylethyl ketone") must be indexed. Indexing all possible subformulae and subterms results in an exponential increase in the storage and memory requirements as well as the time taken to process the indices. We propose techniques to prune the indices significantly without reducing the quality of the returned results significantly. Finally, we propose multiple query semantics to allow users to pose different types of partial search queries for chemical entities. We demonstrate empirically that our search engines improve the relevance of the returned results for search queries involving chemical entities.

KW - Chemical formula

KW - Chemical name

KW - Conditional Random Fields

KW - Entity extraction

KW - Hierarchical text segmentation

KW - Independent frequent subsequence

KW - Index pruning

KW - Query models

KW - Ranking

KW - Similarity search

KW - Support Vector Machines

UR - http://www.scopus.com/inward/record.url?scp=80051508034&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80051508034&partnerID=8YFLogxK

U2 - 10.1145/1961209.1961215

DO - 10.1145/1961209.1961215

M3 - Article

AN - SCOPUS:80051508034

VL - 29

JO - ACM Transactions on Information Systems

JF - ACM Transactions on Information Systems

SN - 1046-8188

IS - 2

M1 - 12

ER -