Mining, indexing, and searching for textual chemical molecule information on the web

Bingjun Sun, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Citations (Scopus)

Abstract

Current search engines do not support user searches for chemical entities (chemical names and formulae) beyond simple keyword searches. Usually a chemical molecule can be represented in multiple textual ways. A simple keyword search would retrieve only the exact match and not the others. We show how to build a search engine that enables searches for chemical entities and demonstrate empirically that it improves the relevance of returned documents. Our search engine first extracts chemical entities from text, performs novel indexing suitable for chemical names and formulae, and supports different query models that a scientist may require. We propose a model of hierarchical conditional random fields for chemical formula tagging that considers long-term dependencies at the sentence level. To substring searches of chemical names, a search engine must index substrings of chemical names. Indexing all possible sub-sequences is not feasible in practice. We propose an algorithm for independent frequent subsequence mining to discover sub-terms of chemical names with their probabilities. We then propose an unsupervised hierarchical text segmentation (HTS) method to represent a sequence with a tree structure based on discovered independent frequent subsequences, so that sub-terms on the HTS tree should be indexed. Query models with corresponding ranking functions are introduced for chemical name searches. Experiments show that our approaches to chemical entity tagging perform well. Furthermore, we show that index pruning can reduce the index size and query time without changing the returned ranked results significantly. Finally, experiments show that our approaches out-perform traditional methods for document search with ambiguous chemical terms.

Original languageEnglish
Title of host publicationProceeding of the 17th International Conference on World Wide Web 2008, WWW'08
Pages735-744
Number of pages10
DOIs
Publication statusPublished - 2008
Externally publishedYes
Event17th International Conference on World Wide Web 2008, WWW'08 - Beijing
Duration: 21 Apr 200825 Apr 2008

Other

Other17th International Conference on World Wide Web 2008, WWW'08
CityBeijing
Period21/4/0825/4/08

Fingerprint

World Wide Web
Molecules
Search engines
Experiments

Keywords

  • Conditional random fields
  • Entity extraction
  • Hierarchical text segmentation
  • Independent frequent subsequence
  • Index pruning
  • Ranking
  • Similarity search
  • Substring search

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

Sun, B., Mitra, P., & Giles, C. L. (2008). Mining, indexing, and searching for textual chemical molecule information on the web. In Proceeding of the 17th International Conference on World Wide Web 2008, WWW'08 (pp. 735-744) https://doi.org/10.1145/1367497.1367597

Mining, indexing, and searching for textual chemical molecule information on the web. / Sun, Bingjun; Mitra, Prasenjit; Giles, C. Lee.

Proceeding of the 17th International Conference on World Wide Web 2008, WWW'08. 2008. p. 735-744.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sun, B, Mitra, P & Giles, CL 2008, Mining, indexing, and searching for textual chemical molecule information on the web. in Proceeding of the 17th International Conference on World Wide Web 2008, WWW'08. pp. 735-744, 17th International Conference on World Wide Web 2008, WWW'08, Beijing, 21/4/08. https://doi.org/10.1145/1367497.1367597
Sun B, Mitra P, Giles CL. Mining, indexing, and searching for textual chemical molecule information on the web. In Proceeding of the 17th International Conference on World Wide Web 2008, WWW'08. 2008. p. 735-744 https://doi.org/10.1145/1367497.1367597
Sun, Bingjun ; Mitra, Prasenjit ; Giles, C. Lee. / Mining, indexing, and searching for textual chemical molecule information on the web. Proceeding of the 17th International Conference on World Wide Web 2008, WWW'08. 2008. pp. 735-744
@inproceedings{1c60dcc94e7c43e0899117fa05bef688,
title = "Mining, indexing, and searching for textual chemical molecule information on the web",
abstract = "Current search engines do not support user searches for chemical entities (chemical names and formulae) beyond simple keyword searches. Usually a chemical molecule can be represented in multiple textual ways. A simple keyword search would retrieve only the exact match and not the others. We show how to build a search engine that enables searches for chemical entities and demonstrate empirically that it improves the relevance of returned documents. Our search engine first extracts chemical entities from text, performs novel indexing suitable for chemical names and formulae, and supports different query models that a scientist may require. We propose a model of hierarchical conditional random fields for chemical formula tagging that considers long-term dependencies at the sentence level. To substring searches of chemical names, a search engine must index substrings of chemical names. Indexing all possible sub-sequences is not feasible in practice. We propose an algorithm for independent frequent subsequence mining to discover sub-terms of chemical names with their probabilities. We then propose an unsupervised hierarchical text segmentation (HTS) method to represent a sequence with a tree structure based on discovered independent frequent subsequences, so that sub-terms on the HTS tree should be indexed. Query models with corresponding ranking functions are introduced for chemical name searches. Experiments show that our approaches to chemical entity tagging perform well. Furthermore, we show that index pruning can reduce the index size and query time without changing the returned ranked results significantly. Finally, experiments show that our approaches out-perform traditional methods for document search with ambiguous chemical terms.",
keywords = "Conditional random fields, Entity extraction, Hierarchical text segmentation, Independent frequent subsequence, Index pruning, Ranking, Similarity search, Substring search",
author = "Bingjun Sun and Prasenjit Mitra and Giles, {C. Lee}",
year = "2008",
doi = "10.1145/1367497.1367597",
language = "English",
isbn = "9781605580852",
pages = "735--744",
booktitle = "Proceeding of the 17th International Conference on World Wide Web 2008, WWW'08",

}

TY - GEN

T1 - Mining, indexing, and searching for textual chemical molecule information on the web

AU - Sun, Bingjun

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2008

Y1 - 2008

N2 - Current search engines do not support user searches for chemical entities (chemical names and formulae) beyond simple keyword searches. Usually a chemical molecule can be represented in multiple textual ways. A simple keyword search would retrieve only the exact match and not the others. We show how to build a search engine that enables searches for chemical entities and demonstrate empirically that it improves the relevance of returned documents. Our search engine first extracts chemical entities from text, performs novel indexing suitable for chemical names and formulae, and supports different query models that a scientist may require. We propose a model of hierarchical conditional random fields for chemical formula tagging that considers long-term dependencies at the sentence level. To substring searches of chemical names, a search engine must index substrings of chemical names. Indexing all possible sub-sequences is not feasible in practice. We propose an algorithm for independent frequent subsequence mining to discover sub-terms of chemical names with their probabilities. We then propose an unsupervised hierarchical text segmentation (HTS) method to represent a sequence with a tree structure based on discovered independent frequent subsequences, so that sub-terms on the HTS tree should be indexed. Query models with corresponding ranking functions are introduced for chemical name searches. Experiments show that our approaches to chemical entity tagging perform well. Furthermore, we show that index pruning can reduce the index size and query time without changing the returned ranked results significantly. Finally, experiments show that our approaches out-perform traditional methods for document search with ambiguous chemical terms.

AB - Current search engines do not support user searches for chemical entities (chemical names and formulae) beyond simple keyword searches. Usually a chemical molecule can be represented in multiple textual ways. A simple keyword search would retrieve only the exact match and not the others. We show how to build a search engine that enables searches for chemical entities and demonstrate empirically that it improves the relevance of returned documents. Our search engine first extracts chemical entities from text, performs novel indexing suitable for chemical names and formulae, and supports different query models that a scientist may require. We propose a model of hierarchical conditional random fields for chemical formula tagging that considers long-term dependencies at the sentence level. To substring searches of chemical names, a search engine must index substrings of chemical names. Indexing all possible sub-sequences is not feasible in practice. We propose an algorithm for independent frequent subsequence mining to discover sub-terms of chemical names with their probabilities. We then propose an unsupervised hierarchical text segmentation (HTS) method to represent a sequence with a tree structure based on discovered independent frequent subsequences, so that sub-terms on the HTS tree should be indexed. Query models with corresponding ranking functions are introduced for chemical name searches. Experiments show that our approaches to chemical entity tagging perform well. Furthermore, we show that index pruning can reduce the index size and query time without changing the returned ranked results significantly. Finally, experiments show that our approaches out-perform traditional methods for document search with ambiguous chemical terms.

KW - Conditional random fields

KW - Entity extraction

KW - Hierarchical text segmentation

KW - Independent frequent subsequence

KW - Index pruning

KW - Ranking

KW - Similarity search

KW - Substring search

UR - http://www.scopus.com/inward/record.url?scp=57349100781&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=57349100781&partnerID=8YFLogxK

U2 - 10.1145/1367497.1367597

DO - 10.1145/1367497.1367597

M3 - Conference contribution

AN - SCOPUS:57349100781

SN - 9781605580852

SP - 735

EP - 744

BT - Proceeding of the 17th International Conference on World Wide Web 2008, WWW'08

ER -