Table header detection and classification

Jing Fang, Prasenjit Mitra, Zhi Tang, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

20 Citations (Scopus)

Abstract

In digital libraries, a table, as a specific document component as well as a condensed way to present structured and relational data, contains rich information and often the only source of that information. In order to explore, retrieve, and reuse that data, tables should be identified and the data extracted. Table recognition is an old field of research. However, due to the diversity of table styles, the results are still far from satisfactory, and not a single algorithm performs well on all different types of tables. In this paper, we randomly take samples from the CiteSeer to investigate diverse table styles for automatic table extraction. We find that table headers are one of the main characteristics of complex table styles. We identify a set of features that can be used to segregate headers from tabular data and build a classifier to detect table headers. Our empirical evaluation on PDF documents shows that using a Random Forest classifier achieves an accuracy of 92%.

Original languageEnglish
Title of host publicationProceedings of the National Conference on Artificial Intelligence
Pages599-605
Number of pages7
Volume1
Publication statusPublished - 2012
Externally publishedYes
Event26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference, AAAI-12 / IAAI-12 - Toronto, ON
Duration: 22 Jul 201226 Jul 2012

Other

Other26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference, AAAI-12 / IAAI-12
CityToronto, ON
Period22/7/1226/7/12

Fingerprint

Classifiers
Digital libraries

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence

Cite this

Fang, J., Mitra, P., Tang, Z., & Giles, C. L. (2012). Table header detection and classification. In Proceedings of the National Conference on Artificial Intelligence (Vol. 1, pp. 599-605)

Table header detection and classification. / Fang, Jing; Mitra, Prasenjit; Tang, Zhi; Giles, C. Lee.

Proceedings of the National Conference on Artificial Intelligence. Vol. 1 2012. p. 599-605.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Fang, J, Mitra, P, Tang, Z & Giles, CL 2012, Table header detection and classification. in Proceedings of the National Conference on Artificial Intelligence. vol. 1, pp. 599-605, 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference, AAAI-12 / IAAI-12, Toronto, ON, 22/7/12.
Fang J, Mitra P, Tang Z, Giles CL. Table header detection and classification. In Proceedings of the National Conference on Artificial Intelligence. Vol. 1. 2012. p. 599-605
Fang, Jing ; Mitra, Prasenjit ; Tang, Zhi ; Giles, C. Lee. / Table header detection and classification. Proceedings of the National Conference on Artificial Intelligence. Vol. 1 2012. pp. 599-605
@inproceedings{f1c12743cdda4017b9d519d3ef1d3973,
title = "Table header detection and classification",
abstract = "In digital libraries, a table, as a specific document component as well as a condensed way to present structured and relational data, contains rich information and often the only source of that information. In order to explore, retrieve, and reuse that data, tables should be identified and the data extracted. Table recognition is an old field of research. However, due to the diversity of table styles, the results are still far from satisfactory, and not a single algorithm performs well on all different types of tables. In this paper, we randomly take samples from the CiteSeer to investigate diverse table styles for automatic table extraction. We find that table headers are one of the main characteristics of complex table styles. We identify a set of features that can be used to segregate headers from tabular data and build a classifier to detect table headers. Our empirical evaluation on PDF documents shows that using a Random Forest classifier achieves an accuracy of 92{\%}.",
author = "Jing Fang and Prasenjit Mitra and Zhi Tang and Giles, {C. Lee}",
year = "2012",
language = "English",
isbn = "9781577355687",
volume = "1",
pages = "599--605",
booktitle = "Proceedings of the National Conference on Artificial Intelligence",

}

TY - GEN

T1 - Table header detection and classification

AU - Fang, Jing

AU - Mitra, Prasenjit

AU - Tang, Zhi

AU - Giles, C. Lee

PY - 2012

Y1 - 2012

N2 - In digital libraries, a table, as a specific document component as well as a condensed way to present structured and relational data, contains rich information and often the only source of that information. In order to explore, retrieve, and reuse that data, tables should be identified and the data extracted. Table recognition is an old field of research. However, due to the diversity of table styles, the results are still far from satisfactory, and not a single algorithm performs well on all different types of tables. In this paper, we randomly take samples from the CiteSeer to investigate diverse table styles for automatic table extraction. We find that table headers are one of the main characteristics of complex table styles. We identify a set of features that can be used to segregate headers from tabular data and build a classifier to detect table headers. Our empirical evaluation on PDF documents shows that using a Random Forest classifier achieves an accuracy of 92%.

AB - In digital libraries, a table, as a specific document component as well as a condensed way to present structured and relational data, contains rich information and often the only source of that information. In order to explore, retrieve, and reuse that data, tables should be identified and the data extracted. Table recognition is an old field of research. However, due to the diversity of table styles, the results are still far from satisfactory, and not a single algorithm performs well on all different types of tables. In this paper, we randomly take samples from the CiteSeer to investigate diverse table styles for automatic table extraction. We find that table headers are one of the main characteristics of complex table styles. We identify a set of features that can be used to segregate headers from tabular data and build a classifier to detect table headers. Our empirical evaluation on PDF documents shows that using a Random Forest classifier achieves an accuracy of 92%.

UR - http://www.scopus.com/inward/record.url?scp=84868278666&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84868278666&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781577355687

VL - 1

SP - 599

EP - 605

BT - Proceedings of the National Conference on Artificial Intelligence

ER -