Table of contents recognition and extraction for heterogeneous book documents

Zhaohui Wu, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Citations (Scopus)

Abstract

Existing work on book table of contents (TOC) recognition has been almost all on small size, application-dependent, and domain-specific datasets. However, TOC of books from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large scale collection of heterogeneous books. We observed that TOCs can be placed into three basic styles, namely ''flat'', ''ordered'', and ''divided'', giving insights into how to achieve effective TOC parsing. As such, we propose a new TOC recognition approach which adaptively decides the most appropriate TOC parsing rules based on the classification of these three TOC styles. Evaluation on large number, over 25,000, of book documents from various domains demonstrates its effectiveness and efficiency.

Original languageEnglish
Title of host publicationProceedings of the International Conference on Document Analysis and Recognition, ICDAR
Pages1205-1209
Number of pages5
DOIs
Publication statusPublished - 2013
Externally publishedYes
Event12th International Conference on Document Analysis and Recognition, ICDAR 2013 - Washington, DC, United States
Duration: 25 Aug 201328 Aug 2013

Other

Other12th International Conference on Document Analysis and Recognition, ICDAR 2013
CountryUnited States
CityWashington, DC
Period25/8/1328/8/13

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition

Cite this

Wu, Z., Mitra, P., & Giles, C. L. (2013). Table of contents recognition and extraction for heterogeneous book documents. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR (pp. 1205-1209). [6628805] https://doi.org/10.1109/ICDAR.2013.244

Table of contents recognition and extraction for heterogeneous book documents. / Wu, Zhaohui; Mitra, Prasenjit; Giles, C. Lee.

Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 2013. p. 1205-1209 6628805.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Wu, Z, Mitra, P & Giles, CL 2013, Table of contents recognition and extraction for heterogeneous book documents. in Proceedings of the International Conference on Document Analysis and Recognition, ICDAR., 6628805, pp. 1205-1209, 12th International Conference on Document Analysis and Recognition, ICDAR 2013, Washington, DC, United States, 25/8/13. https://doi.org/10.1109/ICDAR.2013.244
Wu Z, Mitra P, Giles CL. Table of contents recognition and extraction for heterogeneous book documents. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 2013. p. 1205-1209. 6628805 https://doi.org/10.1109/ICDAR.2013.244
Wu, Zhaohui ; Mitra, Prasenjit ; Giles, C. Lee. / Table of contents recognition and extraction for heterogeneous book documents. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 2013. pp. 1205-1209
@inproceedings{8a13d264749049efaaeff7c5c415be76,
title = "Table of contents recognition and extraction for heterogeneous book documents",
abstract = "Existing work on book table of contents (TOC) recognition has been almost all on small size, application-dependent, and domain-specific datasets. However, TOC of books from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large scale collection of heterogeneous books. We observed that TOCs can be placed into three basic styles, namely ''flat'', ''ordered'', and ''divided'', giving insights into how to achieve effective TOC parsing. As such, we propose a new TOC recognition approach which adaptively decides the most appropriate TOC parsing rules based on the classification of these three TOC styles. Evaluation on large number, over 25,000, of book documents from various domains demonstrates its effectiveness and efficiency.",
author = "Zhaohui Wu and Prasenjit Mitra and Giles, {C. Lee}",
year = "2013",
doi = "10.1109/ICDAR.2013.244",
language = "English",
pages = "1205--1209",
booktitle = "Proceedings of the International Conference on Document Analysis and Recognition, ICDAR",

}

TY - GEN

T1 - Table of contents recognition and extraction for heterogeneous book documents

AU - Wu, Zhaohui

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2013

Y1 - 2013

N2 - Existing work on book table of contents (TOC) recognition has been almost all on small size, application-dependent, and domain-specific datasets. However, TOC of books from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large scale collection of heterogeneous books. We observed that TOCs can be placed into three basic styles, namely ''flat'', ''ordered'', and ''divided'', giving insights into how to achieve effective TOC parsing. As such, we propose a new TOC recognition approach which adaptively decides the most appropriate TOC parsing rules based on the classification of these three TOC styles. Evaluation on large number, over 25,000, of book documents from various domains demonstrates its effectiveness and efficiency.

AB - Existing work on book table of contents (TOC) recognition has been almost all on small size, application-dependent, and domain-specific datasets. However, TOC of books from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large scale collection of heterogeneous books. We observed that TOCs can be placed into three basic styles, namely ''flat'', ''ordered'', and ''divided'', giving insights into how to achieve effective TOC parsing. As such, we propose a new TOC recognition approach which adaptively decides the most appropriate TOC parsing rules based on the classification of these three TOC styles. Evaluation on large number, over 25,000, of book documents from various domains demonstrates its effectiveness and efficiency.

UR - http://www.scopus.com/inward/record.url?scp=84889577640&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84889577640&partnerID=8YFLogxK

U2 - 10.1109/ICDAR.2013.244

DO - 10.1109/ICDAR.2013.244

M3 - Conference contribution

SP - 1205

EP - 1209

BT - Proceedings of the International Conference on Document Analysis and Recognition, ICDAR

ER -