Automated analysis of images in documents for intelligent document search

Xiaonan Lu, Saurabh Kataria, William J. Brouwer, James Z. Wang, Prasenjit Mitra, C. Lee Giles

Research output: Contribution to journalArticle

38 Citations (Scopus)

Abstract

Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure's legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.

Original languageEnglish
Pages (from-to)65-81
Number of pages17
JournalInternational Journal on Document Analysis and Recognition
Volume12
Issue number2
DOIs
Publication statusPublished - 2009
Externally publishedYes

Fingerprint

Labels
Digital libraries
Supervised learning
Metadata

Keywords

  • 2-Dplot
  • Data extraction
  • Document search
  • Figure
  • Image
  • Text block extraction

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition
  • Computer Science Applications

Cite this

Automated analysis of images in documents for intelligent document search. / Lu, Xiaonan; Kataria, Saurabh; Brouwer, William J.; Wang, James Z.; Mitra, Prasenjit; Giles, C. Lee.

In: International Journal on Document Analysis and Recognition, Vol. 12, No. 2, 2009, p. 65-81.

Research output: Contribution to journalArticle

Lu, Xiaonan ; Kataria, Saurabh ; Brouwer, William J. ; Wang, James Z. ; Mitra, Prasenjit ; Giles, C. Lee. / Automated analysis of images in documents for intelligent document search. In: International Journal on Document Analysis and Recognition. 2009 ; Vol. 12, No. 2. pp. 65-81.
@article{7201de7a60c4460fb2901eb427bf13d1,
title = "Automated analysis of images in documents for intelligent document search",
abstract = "Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure's legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.",
keywords = "2-Dplot, Data extraction, Document search, Figure, Image, Text block extraction",
author = "Xiaonan Lu and Saurabh Kataria and Brouwer, {William J.} and Wang, {James Z.} and Prasenjit Mitra and Giles, {C. Lee}",
year = "2009",
doi = "10.1007/s10032-009-0081-0",
language = "English",
volume = "12",
pages = "65--81",
journal = "International Journal on Document Analysis and Recognition",
issn = "1433-2833",
publisher = "Springer Verlag",
number = "2",

}

TY - JOUR

T1 - Automated analysis of images in documents for intelligent document search

AU - Lu, Xiaonan

AU - Kataria, Saurabh

AU - Brouwer, William J.

AU - Wang, James Z.

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2009

Y1 - 2009

N2 - Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure's legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.

AB - Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure's legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.

KW - 2-Dplot

KW - Data extraction

KW - Document search

KW - Figure

KW - Image

KW - Text block extraction

UR - http://www.scopus.com/inward/record.url?scp=67650417928&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67650417928&partnerID=8YFLogxK

U2 - 10.1007/s10032-009-0081-0

DO - 10.1007/s10032-009-0081-0

M3 - Article

VL - 12

SP - 65

EP - 81

JO - International Journal on Document Analysis and Recognition

JF - International Journal on Document Analysis and Recognition

SN - 1433-2833

IS - 2

ER -