Extraction and search of chemical formulae in text documents on the web

Bingjun Sun, Qingzhao Tan, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

23 Citations (Scopus)

Abstract

Often scientists seek to search for articles on the Web related to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets articles where the exact keyword string expressing the chemical formula is found. Searching for the exact occurrence of keywords during searching results in two problems for this domain: a) if the author searches for CH4 and the article has H4C, the article is not returned, and b) ambiguous searches like "He" return all documents where Helium is mentioned as well as documents where the pronoun "he" occurs. To remedy these deficiencies, we propose a chemical formula search engine. To build a chemical formula search engine, we must solve the following problems: 1) extract chemical formulae from text documents, 2) index chemical formulae, and 3) designranking functions for the chemical formulae. Furthermore, query models are introduced for formula search, and for each a scoring scheme based on features of partial formulae is proposed tomeasure the relevance of chemical formulae and queries. We evaluate algorithms for identifying chemical formulae in documents using classification methods based on Support Vector Machines(SVM), and a probabilistic model based on conditional random fields (CRF). Different methods for SVM and CRF to tune the trade-off between recall and precision forim balanced data are proposed to improve the overall performance. A feature selection method based on frequency and discrimination isused to remove uninformative and redundant features. Experiments show that our approaches to chemical formula extraction work well, especially after trade-off tuning. The results also demonstrate that feature selection can reduce the index size without changing ranked query results much.

Original languageEnglish
Title of host publication16th International World Wide Web Conference, WWW2007
Pages251-260
Number of pages10
DOIs
Publication statusPublished - 22 Oct 2007
Event16th International World Wide Web Conference, WWW2007 - Banff, AB, Canada
Duration: 8 May 200712 May 2007

Publication series

Name16th International World Wide Web Conference, WWW2007

Other

Other16th International World Wide Web Conference, WWW2007
CountryCanada
CityBanff, AB
Period8/5/0712/5/07

Keywords

  • Chemical formula
  • Conditional random fields
  • Entity extraction
  • Feature boosting
  • Feature selection
  • Query models
  • Ranking
  • Similarity search
  • Support vector machines

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Fingerprint Dive into the research topics of 'Extraction and search of chemical formulae in text documents on the web'. Together they form a unique fingerprint.

  • Cite this

    Sun, B., Tan, Q., Mitra, P., & Giles, C. L. (2007). Extraction and search of chemical formulae in text documents on the web. In 16th International World Wide Web Conference, WWW2007 (pp. 251-260). (16th International World Wide Web Conference, WWW2007). https://doi.org/10.1145/1242572.1242607