Identifying table boundaries in digital documents via sparse line detection

Ying Liu, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

24 Citations (Scopus)

Abstract

Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for table search, accurately detecting the table boundary plays a crucial role for the later table structure decomposition. Table boundary detection and content extraction is a challenging problem because tabular formats are not standardized across all documents. In this paper, we propose a simple but effective preprocessing method to improve the table boundary detection performance by considering the sparse-line property of table rows. Our method easily simplifies the table boundary detection problem into the sparse line analysis problem with much less noise. We design eight line label types and apply two machine learning techniques, Conditional Random Field (CRF) and Support Vector Machines (SVM), on the table boundary detection field. The experimental results not only compare the performances between the machine learning methods and the heuristical-based method, but also demonstrate the effectiveness of the sparse line analysis in the table boundary detection.

Original languageEnglish
Title of host publicationInternational Conference on Information and Knowledge Management, Proceedings
Pages1311-1320
Number of pages10
DOIs
Publication statusPublished - 2008
Externally publishedYes
Event17th ACM Conference on Information and Knowledge Management, CIKM'08 - Napa Valley, CA
Duration: 26 Oct 200830 Oct 2008

Other

Other17th ACM Conference on Information and Knowledge Management, CIKM'08
CityNapa Valley, CA
Period26/10/0830/10/08

Fingerprint

Machine learning
Data base
Decomposition
Support vector machine
Database management systems
Conditional random fields
Learning methods
Information extraction
Data sources

Keywords

  • Conditional random field
  • Sparse line property
  • Support vector machine
  • Table boundary detection
  • Table data collection
  • Table labeling

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Liu, Y., Mitra, P., & Giles, C. L. (2008). Identifying table boundaries in digital documents via sparse line detection. In International Conference on Information and Knowledge Management, Proceedings (pp. 1311-1320) https://doi.org/10.1145/1458082.1458255

Identifying table boundaries in digital documents via sparse line detection. / Liu, Ying; Mitra, Prasenjit; Giles, C. Lee.

International Conference on Information and Knowledge Management, Proceedings. 2008. p. 1311-1320.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Liu, Y, Mitra, P & Giles, CL 2008, Identifying table boundaries in digital documents via sparse line detection. in International Conference on Information and Knowledge Management, Proceedings. pp. 1311-1320, 17th ACM Conference on Information and Knowledge Management, CIKM'08, Napa Valley, CA, 26/10/08. https://doi.org/10.1145/1458082.1458255
Liu Y, Mitra P, Giles CL. Identifying table boundaries in digital documents via sparse line detection. In International Conference on Information and Knowledge Management, Proceedings. 2008. p. 1311-1320 https://doi.org/10.1145/1458082.1458255
Liu, Ying ; Mitra, Prasenjit ; Giles, C. Lee. / Identifying table boundaries in digital documents via sparse line detection. International Conference on Information and Knowledge Management, Proceedings. 2008. pp. 1311-1320
@inproceedings{b858ff91b56b46508ce9c6bea1f9f742,
title = "Identifying table boundaries in digital documents via sparse line detection",
abstract = "Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for table search, accurately detecting the table boundary plays a crucial role for the later table structure decomposition. Table boundary detection and content extraction is a challenging problem because tabular formats are not standardized across all documents. In this paper, we propose a simple but effective preprocessing method to improve the table boundary detection performance by considering the sparse-line property of table rows. Our method easily simplifies the table boundary detection problem into the sparse line analysis problem with much less noise. We design eight line label types and apply two machine learning techniques, Conditional Random Field (CRF) and Support Vector Machines (SVM), on the table boundary detection field. The experimental results not only compare the performances between the machine learning methods and the heuristical-based method, but also demonstrate the effectiveness of the sparse line analysis in the table boundary detection.",
keywords = "Conditional random field, Sparse line property, Support vector machine, Table boundary detection, Table data collection, Table labeling",
author = "Ying Liu and Prasenjit Mitra and Giles, {C. Lee}",
year = "2008",
doi = "10.1145/1458082.1458255",
language = "English",
isbn = "9781595939913",
pages = "1311--1320",
booktitle = "International Conference on Information and Knowledge Management, Proceedings",

}

TY - GEN

T1 - Identifying table boundaries in digital documents via sparse line detection

AU - Liu, Ying

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2008

Y1 - 2008

N2 - Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for table search, accurately detecting the table boundary plays a crucial role for the later table structure decomposition. Table boundary detection and content extraction is a challenging problem because tabular formats are not standardized across all documents. In this paper, we propose a simple but effective preprocessing method to improve the table boundary detection performance by considering the sparse-line property of table rows. Our method easily simplifies the table boundary detection problem into the sparse line analysis problem with much less noise. We design eight line label types and apply two machine learning techniques, Conditional Random Field (CRF) and Support Vector Machines (SVM), on the table boundary detection field. The experimental results not only compare the performances between the machine learning methods and the heuristical-based method, but also demonstrate the effectiveness of the sparse line analysis in the table boundary detection.

AB - Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for table search, accurately detecting the table boundary plays a crucial role for the later table structure decomposition. Table boundary detection and content extraction is a challenging problem because tabular formats are not standardized across all documents. In this paper, we propose a simple but effective preprocessing method to improve the table boundary detection performance by considering the sparse-line property of table rows. Our method easily simplifies the table boundary detection problem into the sparse line analysis problem with much less noise. We design eight line label types and apply two machine learning techniques, Conditional Random Field (CRF) and Support Vector Machines (SVM), on the table boundary detection field. The experimental results not only compare the performances between the machine learning methods and the heuristical-based method, but also demonstrate the effectiveness of the sparse line analysis in the table boundary detection.

KW - Conditional random field

KW - Sparse line property

KW - Support vector machine

KW - Table boundary detection

KW - Table data collection

KW - Table labeling

UR - http://www.scopus.com/inward/record.url?scp=70349260831&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70349260831&partnerID=8YFLogxK

U2 - 10.1145/1458082.1458255

DO - 10.1145/1458082.1458255

M3 - Conference contribution

SN - 9781595939913

SP - 1311

EP - 1320

BT - International Conference on Information and Knowledge Management, Proceedings

ER -