Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines

Ying Liu, Kun Bai, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

20 Citations (Scopus)

Abstract

As the rapid growth of PDF documents, recognizing the document structure and components are useful for document storage, classification and retrieval. Table, a ubiquitous document component, becomes an important information source. Accurately detecting the table boundary plays a crucial role for many applications, e.g., the increasing demand on the table data search. Rather than converting PDFs to image or HTML and then processing with other techniques (e.g., OCR), extracting and analyzing texts from PDFs directly is easy and accurate. However, text extraction tools face a common problem: text sequence error. In this paper, we propose two algorithms to recover the sequence of extracted sparse lines, which improve the table content collection. The experimental results show the comparison of the performance of both algorithms, and demonstrate the effectiveness of text sequence recovering for the table boundary detection.

Original languageEnglish
Title of host publicationProceedings of the International Conference on Document Analysis and Recognition, ICDAR
Pages1006-1010
Number of pages5
DOIs
Publication statusPublished - 2009
Externally publishedYes
EventICDAR2009 - 10th International Conference on Document Analysis and Recognition - Barcelona
Duration: 26 Jul 200929 Jul 2009

Other

OtherICDAR2009 - 10th International Conference on Document Analysis and Recognition
CityBarcelona
Period26/7/0929/7/09

Fingerprint

Optical character recognition
HTML
Processing

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition

Cite this

Liu, Y., Bai, K., Mitra, P., & Giles, C. L. (2009). Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR (pp. 1006-1010). [5277535] https://doi.org/10.1109/ICDAR.2009.138

Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines. / Liu, Ying; Bai, Kun; Mitra, Prasenjit; Giles, C. Lee.

Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 2009. p. 1006-1010 5277535.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Liu, Y, Bai, K, Mitra, P & Giles, CL 2009, Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines. in Proceedings of the International Conference on Document Analysis and Recognition, ICDAR., 5277535, pp. 1006-1010, ICDAR2009 - 10th International Conference on Document Analysis and Recognition, Barcelona, 26/7/09. https://doi.org/10.1109/ICDAR.2009.138
Liu Y, Bai K, Mitra P, Giles CL. Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 2009. p. 1006-1010. 5277535 https://doi.org/10.1109/ICDAR.2009.138
Liu, Ying ; Bai, Kun ; Mitra, Prasenjit ; Giles, C. Lee. / Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 2009. pp. 1006-1010
@inproceedings{cb97c04902d04e13b0d5e74f1b5cc48e,
title = "Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines",
abstract = "As the rapid growth of PDF documents, recognizing the document structure and components are useful for document storage, classification and retrieval. Table, a ubiquitous document component, becomes an important information source. Accurately detecting the table boundary plays a crucial role for many applications, e.g., the increasing demand on the table data search. Rather than converting PDFs to image or HTML and then processing with other techniques (e.g., OCR), extracting and analyzing texts from PDFs directly is easy and accurate. However, text extraction tools face a common problem: text sequence error. In this paper, we propose two algorithms to recover the sequence of extracted sparse lines, which improve the table content collection. The experimental results show the comparison of the performance of both algorithms, and demonstrate the effectiveness of text sequence recovering for the table boundary detection.",
author = "Ying Liu and Kun Bai and Prasenjit Mitra and Giles, {C. Lee}",
year = "2009",
doi = "10.1109/ICDAR.2009.138",
language = "English",
isbn = "9780769537252",
pages = "1006--1010",
booktitle = "Proceedings of the International Conference on Document Analysis and Recognition, ICDAR",

}

TY - GEN

T1 - Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines

AU - Liu, Ying

AU - Bai, Kun

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2009

Y1 - 2009

N2 - As the rapid growth of PDF documents, recognizing the document structure and components are useful for document storage, classification and retrieval. Table, a ubiquitous document component, becomes an important information source. Accurately detecting the table boundary plays a crucial role for many applications, e.g., the increasing demand on the table data search. Rather than converting PDFs to image or HTML and then processing with other techniques (e.g., OCR), extracting and analyzing texts from PDFs directly is easy and accurate. However, text extraction tools face a common problem: text sequence error. In this paper, we propose two algorithms to recover the sequence of extracted sparse lines, which improve the table content collection. The experimental results show the comparison of the performance of both algorithms, and demonstrate the effectiveness of text sequence recovering for the table boundary detection.

AB - As the rapid growth of PDF documents, recognizing the document structure and components are useful for document storage, classification and retrieval. Table, a ubiquitous document component, becomes an important information source. Accurately detecting the table boundary plays a crucial role for many applications, e.g., the increasing demand on the table data search. Rather than converting PDFs to image or HTML and then processing with other techniques (e.g., OCR), extracting and analyzing texts from PDFs directly is easy and accurate. However, text extraction tools face a common problem: text sequence error. In this paper, we propose two algorithms to recover the sequence of extracted sparse lines, which improve the table content collection. The experimental results show the comparison of the performance of both algorithms, and demonstrate the effectiveness of text sequence recovering for the table boundary detection.

UR - http://www.scopus.com/inward/record.url?scp=71249084337&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=71249084337&partnerID=8YFLogxK

U2 - 10.1109/ICDAR.2009.138

DO - 10.1109/ICDAR.2009.138

M3 - Conference contribution

SN - 9780769537252

SP - 1006

EP - 1010

BT - Proceedings of the International Conference on Document Analysis and Recognition, ICDAR

ER -