Mining, indexing, and searching for textual chemical molecule information on the web

Bingjun Sun, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Citations (Scopus)

Abstract

Current search engines do not support user searches for chemical entities (chemical names and formulae) beyond simple keyword searches. Usually a chemical molecule can be represented in multiple textual ways. A simple keyword search would retrieve only the exact match and not the others. We show how to build a search engine that enables searches for chemical entities and demonstrate empirically that it improves the relevance of returned documents. Our search engine first extracts chemical entities from text, performs novel indexing suitable for chemical names and formulae, and supports different query models that a scientist may require. We propose a model of hierarchical conditional random fields for chemical formula tagging that considers long-term dependencies at the sentence level. To substring searches of chemical names, a search engine must index substrings of chemical names. Indexing all possible sub-sequences is not feasible in practice. We propose an algorithm for independent frequent subsequence mining to discover sub-terms of chemical names with their probabilities. We then propose an unsupervised hierarchical text segmentation (HTS) method to represent a sequence with a tree structure based on discovered independent frequent subsequences, so that sub-terms on the HTS tree should be indexed. Query models with corresponding ranking functions are introduced for chemical name searches. Experiments show that our approaches to chemical entity tagging perform well. Furthermore, we show that index pruning can reduce the index size and query time without changing the returned ranked results significantly. Finally, experiments show that our approaches out-perform traditional methods for document search with ambiguous chemical terms.

Original languageEnglish
Title of host publicationProceeding of the 17th International Conference on World Wide Web 2008, WWW'08
Pages735-744
Number of pages10
DOIs
Publication statusPublished - 2008
Externally publishedYes
Event17th International Conference on World Wide Web 2008, WWW'08 - Beijing
Duration: 21 Apr 200825 Apr 2008

Other

Other17th International Conference on World Wide Web 2008, WWW'08
CityBeijing
Period21/4/0825/4/08

    Fingerprint

Keywords

  • Conditional random fields
  • Entity extraction
  • Hierarchical text segmentation
  • Independent frequent subsequence
  • Index pruning
  • Ranking
  • Similarity search
  • Substring search

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

Sun, B., Mitra, P., & Giles, C. L. (2008). Mining, indexing, and searching for textual chemical molecule information on the web. In Proceeding of the 17th International Conference on World Wide Web 2008, WWW'08 (pp. 735-744) https://doi.org/10.1145/1367497.1367597