Searching online book documents and analyzing book citations

Zhaohui Wu, Sujatha Das, Zhenhui Li, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

Academic search engines and digital libraries provide convenient online search and access facilities for scientific publications. However, most existing systems do not include books in their collections although several books are freely available online. Academic books are different from papers in terms of their length, contents and structure. We argue that accounting for academic books is important in understanding and assessing scientific impact. We introduce an open-book search engine that extracts and indexes metadata, contents, and bibliography from online PDF book documents. To the best of our knowledge, no previous work gives a systematical study on building a search engine for books. We propose a hybrid approach for extracting title and authors from a book that combines results from CiteSeer, a rule based extractor, and a SVM based extractor, leveraging web knowledge. For "table of contents" recognition, we propose rules based on multiple regularities based on numbering and ordering. In addition, we study bibliography extraction and citation parsing for a large dataset of books. Finally, we use the multiple fields available in books to rank books in response to search queries. Our system can effectively extract metadata and contents from large collections of online books and provides efficient book search and retrieval facilities.

Original languageEnglish
Title of host publicationDocEng 2013 - Proceedings of the 2013 ACM Symposium on Document Engineering
PublisherAssociation for Computing Machinery
Pages81-90
Number of pages10
ISBN (Print)9781450317894
DOIs
Publication statusPublished - 2013
Externally publishedYes
Event2013 ACM Symposium on Document Engineering, DocEng 2013 - Florence
Duration: 10 Sep 201313 Sep 2013

Other

Other2013 ACM Symposium on Document Engineering, DocEng 2013
CityFlorence
Period10/9/1313/9/13

Fingerprint

Online searching
Search engines
Bibliographies
Metadata
Digital libraries

Keywords

  • book citation analysis
  • book search
  • book structure extraction

ASJC Scopus subject areas

  • Software

Cite this

Wu, Z., Das, S., Li, Z., Mitra, P., & Giles, C. L. (2013). Searching online book documents and analyzing book citations. In DocEng 2013 - Proceedings of the 2013 ACM Symposium on Document Engineering (pp. 81-90). Association for Computing Machinery. https://doi.org/10.1145/2494266.2494282

Searching online book documents and analyzing book citations. / Wu, Zhaohui; Das, Sujatha; Li, Zhenhui; Mitra, Prasenjit; Giles, C. Lee.

DocEng 2013 - Proceedings of the 2013 ACM Symposium on Document Engineering. Association for Computing Machinery, 2013. p. 81-90.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Wu, Z, Das, S, Li, Z, Mitra, P & Giles, CL 2013, Searching online book documents and analyzing book citations. in DocEng 2013 - Proceedings of the 2013 ACM Symposium on Document Engineering. Association for Computing Machinery, pp. 81-90, 2013 ACM Symposium on Document Engineering, DocEng 2013, Florence, 10/9/13. https://doi.org/10.1145/2494266.2494282
Wu Z, Das S, Li Z, Mitra P, Giles CL. Searching online book documents and analyzing book citations. In DocEng 2013 - Proceedings of the 2013 ACM Symposium on Document Engineering. Association for Computing Machinery. 2013. p. 81-90 https://doi.org/10.1145/2494266.2494282
Wu, Zhaohui ; Das, Sujatha ; Li, Zhenhui ; Mitra, Prasenjit ; Giles, C. Lee. / Searching online book documents and analyzing book citations. DocEng 2013 - Proceedings of the 2013 ACM Symposium on Document Engineering. Association for Computing Machinery, 2013. pp. 81-90
@inproceedings{cc5f97e44b5a414388df548459a3452e,
title = "Searching online book documents and analyzing book citations",
abstract = "Academic search engines and digital libraries provide convenient online search and access facilities for scientific publications. However, most existing systems do not include books in their collections although several books are freely available online. Academic books are different from papers in terms of their length, contents and structure. We argue that accounting for academic books is important in understanding and assessing scientific impact. We introduce an open-book search engine that extracts and indexes metadata, contents, and bibliography from online PDF book documents. To the best of our knowledge, no previous work gives a systematical study on building a search engine for books. We propose a hybrid approach for extracting title and authors from a book that combines results from CiteSeer, a rule based extractor, and a SVM based extractor, leveraging web knowledge. For {"}table of contents{"} recognition, we propose rules based on multiple regularities based on numbering and ordering. In addition, we study bibliography extraction and citation parsing for a large dataset of books. Finally, we use the multiple fields available in books to rank books in response to search queries. Our system can effectively extract metadata and contents from large collections of online books and provides efficient book search and retrieval facilities.",
keywords = "book citation analysis, book search, book structure extraction",
author = "Zhaohui Wu and Sujatha Das and Zhenhui Li and Prasenjit Mitra and Giles, {C. Lee}",
year = "2013",
doi = "10.1145/2494266.2494282",
language = "English",
isbn = "9781450317894",
pages = "81--90",
booktitle = "DocEng 2013 - Proceedings of the 2013 ACM Symposium on Document Engineering",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - Searching online book documents and analyzing book citations

AU - Wu, Zhaohui

AU - Das, Sujatha

AU - Li, Zhenhui

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2013

Y1 - 2013

N2 - Academic search engines and digital libraries provide convenient online search and access facilities for scientific publications. However, most existing systems do not include books in their collections although several books are freely available online. Academic books are different from papers in terms of their length, contents and structure. We argue that accounting for academic books is important in understanding and assessing scientific impact. We introduce an open-book search engine that extracts and indexes metadata, contents, and bibliography from online PDF book documents. To the best of our knowledge, no previous work gives a systematical study on building a search engine for books. We propose a hybrid approach for extracting title and authors from a book that combines results from CiteSeer, a rule based extractor, and a SVM based extractor, leveraging web knowledge. For "table of contents" recognition, we propose rules based on multiple regularities based on numbering and ordering. In addition, we study bibliography extraction and citation parsing for a large dataset of books. Finally, we use the multiple fields available in books to rank books in response to search queries. Our system can effectively extract metadata and contents from large collections of online books and provides efficient book search and retrieval facilities.

AB - Academic search engines and digital libraries provide convenient online search and access facilities for scientific publications. However, most existing systems do not include books in their collections although several books are freely available online. Academic books are different from papers in terms of their length, contents and structure. We argue that accounting for academic books is important in understanding and assessing scientific impact. We introduce an open-book search engine that extracts and indexes metadata, contents, and bibliography from online PDF book documents. To the best of our knowledge, no previous work gives a systematical study on building a search engine for books. We propose a hybrid approach for extracting title and authors from a book that combines results from CiteSeer, a rule based extractor, and a SVM based extractor, leveraging web knowledge. For "table of contents" recognition, we propose rules based on multiple regularities based on numbering and ordering. In addition, we study bibliography extraction and citation parsing for a large dataset of books. Finally, we use the multiple fields available in books to rank books in response to search queries. Our system can effectively extract metadata and contents from large collections of online books and provides efficient book search and retrieval facilities.

KW - book citation analysis

KW - book search

KW - book structure extraction

UR - http://www.scopus.com/inward/record.url?scp=84887328647&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84887328647&partnerID=8YFLogxK

U2 - 10.1145/2494266.2494282

DO - 10.1145/2494266.2494282

M3 - Conference contribution

SN - 9781450317894

SP - 81

EP - 90

BT - DocEng 2013 - Proceedings of the 2013 ACM Symposium on Document Engineering

PB - Association for Computing Machinery

ER -