Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents

Saurabh Kataria, William Browuer, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

15 Citations (Scopus)

Abstract

Two dimensional plots (2-D) in digital documents on the web are an important source of information that is largely under-utilized. In this paper, we outline how data and text can be extracted automatically from these 2-D plots, thus eliminating a time consuming manual process. Our information extraction algorithm identifies the axes of the figures, extracts text blocks like axes-labels and legends and identifies data points in the figure. It also extracts the units appearing in the axes labels and segments the legends to identify the different lines in the legend, the different symbols and their associated text explanations. Our algorithm also performs the challenging task of separating out overlapping text and data points effectively. Our experiments indicate that these techniques are computationally efficient and provide acceptable accuracy.

Original languageEnglish
Title of host publicationProceedings of the National Conference on Artificial Intelligence
Pages1169-1174
Number of pages6
Volume2
Publication statusPublished - 2008
Externally publishedYes
Event23rd AAAI Conference on Artificial Intelligence and the 20th Innovative Applications of Artificial Intelligence Conference, AAAI-08/IAAI-08 - Chicago, IL
Duration: 13 Jul 200817 Jul 2008

Other

Other23rd AAAI Conference on Artificial Intelligence and the 20th Innovative Applications of Artificial Intelligence Conference, AAAI-08/IAAI-08
CityChicago, IL
Period13/7/0817/7/08

Fingerprint

Labels
Experiments

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence

Cite this

Kataria, S., Browuer, W., Mitra, P., & Giles, C. L. (2008). Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents. In Proceedings of the National Conference on Artificial Intelligence (Vol. 2, pp. 1169-1174)

Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents. / Kataria, Saurabh; Browuer, William; Mitra, Prasenjit; Giles, C. Lee.

Proceedings of the National Conference on Artificial Intelligence. Vol. 2 2008. p. 1169-1174.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kataria, S, Browuer, W, Mitra, P & Giles, CL 2008, Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents. in Proceedings of the National Conference on Artificial Intelligence. vol. 2, pp. 1169-1174, 23rd AAAI Conference on Artificial Intelligence and the 20th Innovative Applications of Artificial Intelligence Conference, AAAI-08/IAAI-08, Chicago, IL, 13/7/08.
Kataria S, Browuer W, Mitra P, Giles CL. Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents. In Proceedings of the National Conference on Artificial Intelligence. Vol. 2. 2008. p. 1169-1174
Kataria, Saurabh ; Browuer, William ; Mitra, Prasenjit ; Giles, C. Lee. / Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents. Proceedings of the National Conference on Artificial Intelligence. Vol. 2 2008. pp. 1169-1174
@inproceedings{bcd271ba432548909c83ec9366ac6f71,
title = "Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents",
abstract = "Two dimensional plots (2-D) in digital documents on the web are an important source of information that is largely under-utilized. In this paper, we outline how data and text can be extracted automatically from these 2-D plots, thus eliminating a time consuming manual process. Our information extraction algorithm identifies the axes of the figures, extracts text blocks like axes-labels and legends and identifies data points in the figure. It also extracts the units appearing in the axes labels and segments the legends to identify the different lines in the legend, the different symbols and their associated text explanations. Our algorithm also performs the challenging task of separating out overlapping text and data points effectively. Our experiments indicate that these techniques are computationally efficient and provide acceptable accuracy.",
author = "Saurabh Kataria and William Browuer and Prasenjit Mitra and Giles, {C. Lee}",
year = "2008",
language = "English",
isbn = "9781577353683",
volume = "2",
pages = "1169--1174",
booktitle = "Proceedings of the National Conference on Artificial Intelligence",

}

TY - GEN

T1 - Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents

AU - Kataria, Saurabh

AU - Browuer, William

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2008

Y1 - 2008

N2 - Two dimensional plots (2-D) in digital documents on the web are an important source of information that is largely under-utilized. In this paper, we outline how data and text can be extracted automatically from these 2-D plots, thus eliminating a time consuming manual process. Our information extraction algorithm identifies the axes of the figures, extracts text blocks like axes-labels and legends and identifies data points in the figure. It also extracts the units appearing in the axes labels and segments the legends to identify the different lines in the legend, the different symbols and their associated text explanations. Our algorithm also performs the challenging task of separating out overlapping text and data points effectively. Our experiments indicate that these techniques are computationally efficient and provide acceptable accuracy.

AB - Two dimensional plots (2-D) in digital documents on the web are an important source of information that is largely under-utilized. In this paper, we outline how data and text can be extracted automatically from these 2-D plots, thus eliminating a time consuming manual process. Our information extraction algorithm identifies the axes of the figures, extracts text blocks like axes-labels and legends and identifies data points in the figure. It also extracts the units appearing in the axes labels and segments the legends to identify the different lines in the legend, the different symbols and their associated text explanations. Our algorithm also performs the challenging task of separating out overlapping text and data points effectively. Our experiments indicate that these techniques are computationally efficient and provide acceptable accuracy.

UR - http://www.scopus.com/inward/record.url?scp=57749191459&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=57749191459&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:57749191459

SN - 9781577353683

VL - 2

SP - 1169

EP - 1174

BT - Proceedings of the National Conference on Artificial Intelligence

ER -