Figure metadata extraction from digital documents

Sagnik Ray Choudhury, Prasenjit Mitra, Andi Kirk, Silvia Szep, Donald Pellegrino, Sue Jones, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

23 Citations (Scopus)

Abstract

Academic papers contain multiple figures (information graphics) representing important findings and experimental results. Automatic data extraction from such figures and classification of information graphics is not straightforward and a well studied problem in document analysis cite{4275059}. Also, very few digital library search engines index figures and/or associated metadata (figure caption) from PDF documents. We describe the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task. Document layout, font information, lexical and linguistic features for figure caption extraction from PDF documents is considered for both rule based and machine learning based approaches. We also describe a digital library search engine that indexes figure captions and mentions from 150K documents, extracted by our custom built extractor.

Original languageEnglish
Title of host publicationProceedings of the International Conference on Document Analysis and Recognition, ICDAR
Pages135-139
Number of pages5
DOIs
Publication statusPublished - 2013
Externally publishedYes
Event12th International Conference on Document Analysis and Recognition, ICDAR 2013 - Washington, DC, United States
Duration: 25 Aug 201328 Aug 2013

Other

Other12th International Conference on Document Analysis and Recognition, ICDAR 2013
CountryUnited States
CityWashington, DC
Period25/8/1328/8/13

Fingerprint

Metadata
Digital libraries
Search engines
Linguistics
Learning systems

Keywords

  • information extraction
  • metadata based figure search

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition

Cite this

Choudhury, S. R., Mitra, P., Kirk, A., Szep, S., Pellegrino, D., Jones, S., & Giles, C. L. (2013). Figure metadata extraction from digital documents. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR (pp. 135-139). [6628599] https://doi.org/10.1109/ICDAR.2013.34

Figure metadata extraction from digital documents. / Choudhury, Sagnik Ray; Mitra, Prasenjit; Kirk, Andi; Szep, Silvia; Pellegrino, Donald; Jones, Sue; Giles, C. Lee.

Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 2013. p. 135-139 6628599.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Choudhury, SR, Mitra, P, Kirk, A, Szep, S, Pellegrino, D, Jones, S & Giles, CL 2013, Figure metadata extraction from digital documents. in Proceedings of the International Conference on Document Analysis and Recognition, ICDAR., 6628599, pp. 135-139, 12th International Conference on Document Analysis and Recognition, ICDAR 2013, Washington, DC, United States, 25/8/13. https://doi.org/10.1109/ICDAR.2013.34
Choudhury SR, Mitra P, Kirk A, Szep S, Pellegrino D, Jones S et al. Figure metadata extraction from digital documents. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 2013. p. 135-139. 6628599 https://doi.org/10.1109/ICDAR.2013.34
Choudhury, Sagnik Ray ; Mitra, Prasenjit ; Kirk, Andi ; Szep, Silvia ; Pellegrino, Donald ; Jones, Sue ; Giles, C. Lee. / Figure metadata extraction from digital documents. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 2013. pp. 135-139
@inproceedings{03460e07e5d9469a9a83547386730906,
title = "Figure metadata extraction from digital documents",
abstract = "Academic papers contain multiple figures (information graphics) representing important findings and experimental results. Automatic data extraction from such figures and classification of information graphics is not straightforward and a well studied problem in document analysis cite{4275059}. Also, very few digital library search engines index figures and/or associated metadata (figure caption) from PDF documents. We describe the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task. Document layout, font information, lexical and linguistic features for figure caption extraction from PDF documents is considered for both rule based and machine learning based approaches. We also describe a digital library search engine that indexes figure captions and mentions from 150K documents, extracted by our custom built extractor.",
keywords = "information extraction, metadata based figure search",
author = "Choudhury, {Sagnik Ray} and Prasenjit Mitra and Andi Kirk and Silvia Szep and Donald Pellegrino and Sue Jones and Giles, {C. Lee}",
year = "2013",
doi = "10.1109/ICDAR.2013.34",
language = "English",
pages = "135--139",
booktitle = "Proceedings of the International Conference on Document Analysis and Recognition, ICDAR",

}

TY - GEN

T1 - Figure metadata extraction from digital documents

AU - Choudhury, Sagnik Ray

AU - Mitra, Prasenjit

AU - Kirk, Andi

AU - Szep, Silvia

AU - Pellegrino, Donald

AU - Jones, Sue

AU - Giles, C. Lee

PY - 2013

Y1 - 2013

N2 - Academic papers contain multiple figures (information graphics) representing important findings and experimental results. Automatic data extraction from such figures and classification of information graphics is not straightforward and a well studied problem in document analysis cite{4275059}. Also, very few digital library search engines index figures and/or associated metadata (figure caption) from PDF documents. We describe the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task. Document layout, font information, lexical and linguistic features for figure caption extraction from PDF documents is considered for both rule based and machine learning based approaches. We also describe a digital library search engine that indexes figure captions and mentions from 150K documents, extracted by our custom built extractor.

AB - Academic papers contain multiple figures (information graphics) representing important findings and experimental results. Automatic data extraction from such figures and classification of information graphics is not straightforward and a well studied problem in document analysis cite{4275059}. Also, very few digital library search engines index figures and/or associated metadata (figure caption) from PDF documents. We describe the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task. Document layout, font information, lexical and linguistic features for figure caption extraction from PDF documents is considered for both rule based and machine learning based approaches. We also describe a digital library search engine that indexes figure captions and mentions from 150K documents, extracted by our custom built extractor.

KW - information extraction

KW - metadata based figure search

UR - http://www.scopus.com/inward/record.url?scp=84889610268&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84889610268&partnerID=8YFLogxK

U2 - 10.1109/ICDAR.2013.34

DO - 10.1109/ICDAR.2013.34

M3 - Conference contribution

SP - 135

EP - 139

BT - Proceedings of the International Conference on Document Analysis and Recognition, ICDAR

ER -