Extraction and search of chemical formulae in text documents on the web

Bingjun Sun, Qingzhao Tan, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

23 Citations (Scopus)

Abstract

Often scientists seek to search for articles on the Web related to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets articles where the exact keyword string expressing the chemical formula is found. Searching for the exact occurrence of keywords during searching results in two problems for this domain: a) if the author searches for CH4 and the article has H4C, the article is not returned, and b) ambiguous searches like "He" return all documents where Helium is mentioned as well as documents where the pronoun "he" occurs. To remedy these deficiencies, we propose a chemical formula search engine. To build a chemical formula search engine, we must solve the following problems: 1) extract chemical formulae from text documents, 2) index chemical formulae, and 3) designranking functions for the chemical formulae. Furthermore, query models are introduced for formula search, and for each a scoring scheme based on features of partial formulae is proposed tomeasure the relevance of chemical formulae and queries. We evaluate algorithms for identifying chemical formulae in documents using classification methods based on Support Vector Machines(SVM), and a probabilistic model based on conditional random fields (CRF). Different methods for SVM and CRF to tune the trade-off between recall and precision forim balanced data are proposed to improve the overall performance. A feature selection method based on frequency and discrimination isused to remove uninformative and redundant features. Experiments show that our approaches to chemical formula extraction work well, especially after trade-off tuning. The results also demonstrate that feature selection can reduce the index size without changing ranked query results much.

Original languageEnglish
Title of host publication16th International World Wide Web Conference, WWW2007
Pages251-260
Number of pages10
DOIs
Publication statusPublished - 2007
Externally publishedYes
Event16th International World Wide Web Conference, WWW2007 - Banff, AB
Duration: 8 May 200712 May 2007

Other

Other16th International World Wide Web Conference, WWW2007
CityBanff, AB
Period8/5/0712/5/07

Fingerprint

Search engines
Support vector machines
Feature extraction
Helium
Tuning
Experiments

Keywords

  • Chemical formula
  • Conditional random fields
  • Entity extraction
  • Feature boosting
  • Feature selection
  • Query models
  • Ranking
  • Similarity search
  • Support vector machines

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Cite this

Sun, B., Tan, Q., Mitra, P., & Giles, C. L. (2007). Extraction and search of chemical formulae in text documents on the web. In 16th International World Wide Web Conference, WWW2007 (pp. 251-260) https://doi.org/10.1145/1242572.1242607

Extraction and search of chemical formulae in text documents on the web. / Sun, Bingjun; Tan, Qingzhao; Mitra, Prasenjit; Giles, C. Lee.

16th International World Wide Web Conference, WWW2007. 2007. p. 251-260.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sun, B, Tan, Q, Mitra, P & Giles, CL 2007, Extraction and search of chemical formulae in text documents on the web. in 16th International World Wide Web Conference, WWW2007. pp. 251-260, 16th International World Wide Web Conference, WWW2007, Banff, AB, 8/5/07. https://doi.org/10.1145/1242572.1242607
Sun B, Tan Q, Mitra P, Giles CL. Extraction and search of chemical formulae in text documents on the web. In 16th International World Wide Web Conference, WWW2007. 2007. p. 251-260 https://doi.org/10.1145/1242572.1242607
Sun, Bingjun ; Tan, Qingzhao ; Mitra, Prasenjit ; Giles, C. Lee. / Extraction and search of chemical formulae in text documents on the web. 16th International World Wide Web Conference, WWW2007. 2007. pp. 251-260
@inproceedings{932c62b42da8402f935b89d003a11e5e,
title = "Extraction and search of chemical formulae in text documents on the web",
abstract = "Often scientists seek to search for articles on the Web related to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets articles where the exact keyword string expressing the chemical formula is found. Searching for the exact occurrence of keywords during searching results in two problems for this domain: a) if the author searches for CH4 and the article has H4C, the article is not returned, and b) ambiguous searches like {"}He{"} return all documents where Helium is mentioned as well as documents where the pronoun {"}he{"} occurs. To remedy these deficiencies, we propose a chemical formula search engine. To build a chemical formula search engine, we must solve the following problems: 1) extract chemical formulae from text documents, 2) index chemical formulae, and 3) designranking functions for the chemical formulae. Furthermore, query models are introduced for formula search, and for each a scoring scheme based on features of partial formulae is proposed tomeasure the relevance of chemical formulae and queries. We evaluate algorithms for identifying chemical formulae in documents using classification methods based on Support Vector Machines(SVM), and a probabilistic model based on conditional random fields (CRF). Different methods for SVM and CRF to tune the trade-off between recall and precision forim balanced data are proposed to improve the overall performance. A feature selection method based on frequency and discrimination isused to remove uninformative and redundant features. Experiments show that our approaches to chemical formula extraction work well, especially after trade-off tuning. The results also demonstrate that feature selection can reduce the index size without changing ranked query results much.",
keywords = "Chemical formula, Conditional random fields, Entity extraction, Feature boosting, Feature selection, Query models, Ranking, Similarity search, Support vector machines",
author = "Bingjun Sun and Qingzhao Tan and Prasenjit Mitra and Giles, {C. Lee}",
year = "2007",
doi = "10.1145/1242572.1242607",
language = "English",
isbn = "1595936548",
pages = "251--260",
booktitle = "16th International World Wide Web Conference, WWW2007",

}

TY - GEN

T1 - Extraction and search of chemical formulae in text documents on the web

AU - Sun, Bingjun

AU - Tan, Qingzhao

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2007

Y1 - 2007

N2 - Often scientists seek to search for articles on the Web related to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets articles where the exact keyword string expressing the chemical formula is found. Searching for the exact occurrence of keywords during searching results in two problems for this domain: a) if the author searches for CH4 and the article has H4C, the article is not returned, and b) ambiguous searches like "He" return all documents where Helium is mentioned as well as documents where the pronoun "he" occurs. To remedy these deficiencies, we propose a chemical formula search engine. To build a chemical formula search engine, we must solve the following problems: 1) extract chemical formulae from text documents, 2) index chemical formulae, and 3) designranking functions for the chemical formulae. Furthermore, query models are introduced for formula search, and for each a scoring scheme based on features of partial formulae is proposed tomeasure the relevance of chemical formulae and queries. We evaluate algorithms for identifying chemical formulae in documents using classification methods based on Support Vector Machines(SVM), and a probabilistic model based on conditional random fields (CRF). Different methods for SVM and CRF to tune the trade-off between recall and precision forim balanced data are proposed to improve the overall performance. A feature selection method based on frequency and discrimination isused to remove uninformative and redundant features. Experiments show that our approaches to chemical formula extraction work well, especially after trade-off tuning. The results also demonstrate that feature selection can reduce the index size without changing ranked query results much.

AB - Often scientists seek to search for articles on the Web related to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets articles where the exact keyword string expressing the chemical formula is found. Searching for the exact occurrence of keywords during searching results in two problems for this domain: a) if the author searches for CH4 and the article has H4C, the article is not returned, and b) ambiguous searches like "He" return all documents where Helium is mentioned as well as documents where the pronoun "he" occurs. To remedy these deficiencies, we propose a chemical formula search engine. To build a chemical formula search engine, we must solve the following problems: 1) extract chemical formulae from text documents, 2) index chemical formulae, and 3) designranking functions for the chemical formulae. Furthermore, query models are introduced for formula search, and for each a scoring scheme based on features of partial formulae is proposed tomeasure the relevance of chemical formulae and queries. We evaluate algorithms for identifying chemical formulae in documents using classification methods based on Support Vector Machines(SVM), and a probabilistic model based on conditional random fields (CRF). Different methods for SVM and CRF to tune the trade-off between recall and precision forim balanced data are proposed to improve the overall performance. A feature selection method based on frequency and discrimination isused to remove uninformative and redundant features. Experiments show that our approaches to chemical formula extraction work well, especially after trade-off tuning. The results also demonstrate that feature selection can reduce the index size without changing ranked query results much.

KW - Chemical formula

KW - Conditional random fields

KW - Entity extraction

KW - Feature boosting

KW - Feature selection

KW - Query models

KW - Ranking

KW - Similarity search

KW - Support vector machines

UR - http://www.scopus.com/inward/record.url?scp=35348913835&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35348913835&partnerID=8YFLogxK

U2 - 10.1145/1242572.1242607

DO - 10.1145/1242572.1242607

M3 - Conference contribution

AN - SCOPUS:35348913835

SN - 1595936548

SN - 9781595936547

SP - 251

EP - 260

BT - 16th International World Wide Web Conference, WWW2007

ER -