Automatic extraction of figures from scholarly documents

Sagnik Ray Choudhury, Prasenjit Mitra, Clyde Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple "figures" such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.

Original languageEnglish
Title of host publicationDocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering
PublisherAssociation for Computing Machinery, Inc
Pages47-50
Number of pages4
ISBN (Print)9781450333078
DOIs
Publication statusPublished - 8 Sep 2015
Externally publishedYes
EventACM Symposium on Document Engineering, DocEng 2015 - Lausanne, Switzerland
Duration: 8 Sep 201511 Sep 2015

Other

OtherACM Symposium on Document Engineering, DocEng 2015
CountrySwitzerland
CityLausanne
Period8/9/1511/9/15

Fingerprint

Semantics

Keywords

  • Document analysis
  • Figure extraction
  • PDF

ASJC Scopus subject areas

  • Information Systems
  • Software

Cite this

Choudhury, S. R., Mitra, P., & Giles, C. L. (2015). Automatic extraction of figures from scholarly documents. In DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering (pp. 47-50). Association for Computing Machinery, Inc. https://doi.org/10.1145/2682571.2797085

Automatic extraction of figures from scholarly documents. / Choudhury, Sagnik Ray; Mitra, Prasenjit; Giles, Clyde Lee.

DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering. Association for Computing Machinery, Inc, 2015. p. 47-50.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Choudhury, SR, Mitra, P & Giles, CL 2015, Automatic extraction of figures from scholarly documents. in DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering. Association for Computing Machinery, Inc, pp. 47-50, ACM Symposium on Document Engineering, DocEng 2015, Lausanne, Switzerland, 8/9/15. https://doi.org/10.1145/2682571.2797085
Choudhury SR, Mitra P, Giles CL. Automatic extraction of figures from scholarly documents. In DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering. Association for Computing Machinery, Inc. 2015. p. 47-50 https://doi.org/10.1145/2682571.2797085
Choudhury, Sagnik Ray ; Mitra, Prasenjit ; Giles, Clyde Lee. / Automatic extraction of figures from scholarly documents. DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering. Association for Computing Machinery, Inc, 2015. pp. 47-50
@inproceedings{e630ecb82247406eb1380e0ce9e8645c,
title = "Automatic extraction of figures from scholarly documents",
abstract = "Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple {"}figures{"} such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80{\%}.",
keywords = "Document analysis, Figure extraction, PDF",
author = "Choudhury, {Sagnik Ray} and Prasenjit Mitra and Giles, {Clyde Lee}",
year = "2015",
month = "9",
day = "8",
doi = "10.1145/2682571.2797085",
language = "English",
isbn = "9781450333078",
pages = "47--50",
booktitle = "DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - Automatic extraction of figures from scholarly documents

AU - Choudhury, Sagnik Ray

AU - Mitra, Prasenjit

AU - Giles, Clyde Lee

PY - 2015/9/8

Y1 - 2015/9/8

N2 - Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple "figures" such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.

AB - Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple "figures" such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.

KW - Document analysis

KW - Figure extraction

KW - PDF

UR - http://www.scopus.com/inward/record.url?scp=84959235832&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959235832&partnerID=8YFLogxK

U2 - 10.1145/2682571.2797085

DO - 10.1145/2682571.2797085

M3 - Conference contribution

AN - SCOPUS:84959235832

SN - 9781450333078

SP - 47

EP - 50

BT - DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering

PB - Association for Computing Machinery, Inc

ER -