Automatic categorization of figures in scientific documents

Xiaonan Lu, Prasenjit Mitra, James Z. Wang, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

22 Citations (Scopus)

Abstract

Figures are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the figures. We propose an architecture for retrieving documents by integrating figures and other information. The initial step in enabling integrated document search is to categorize figures into a set of pre-defined types. We propose several categories of figures based on their functionalities in scholarly articles. We have developed a machine-learning-based approach for automatic categorization of figures. Both global features, such as texture, and part features, such as lines, are utilized in the architecture for discriminating among figure categories. The proposed approach has been evaluated on a testbed document set collected from the CiteSeer scientific literature digital library. Experimental evaluation has demonstrated that our algorithms can produce acceptable results for real-world use. Our tools will be integrated into a scientific-document digital library.

Original languageEnglish
Title of host publicationProceedings of the ACM/IEEE Joint Conference on Digital Libraries
Pages129-138
Number of pages10
Volume2006
DOIs
Publication statusPublished - 2006
Externally publishedYes
Event6th ACM/IEEE-CS Joint Conference on Digital Libraries 2006: Opening Information Horizons, JCDL '06 - Chapel Hill, NC
Duration: 11 Jun 200615 Jun 2006

Other

Other6th ACM/IEEE-CS Joint Conference on Digital Libraries 2006: Opening Information Horizons, JCDL '06
CityChapel Hill, NC
Period11/6/0615/6/06

Fingerprint

Digital libraries
Testbeds
Learning systems
Textures

Keywords

  • Documents
  • Feature extraction
  • Figures
  • Machine learning
  • Scientific literature

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Lu, X., Mitra, P., Wang, J. Z., & Giles, C. L. (2006). Automatic categorization of figures in scientific documents. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (Vol. 2006, pp. 129-138) https://doi.org/10.1145/1141753.1141778

Automatic categorization of figures in scientific documents. / Lu, Xiaonan; Mitra, Prasenjit; Wang, James Z.; Giles, C. Lee.

Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. Vol. 2006 2006. p. 129-138.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Lu, X, Mitra, P, Wang, JZ & Giles, CL 2006, Automatic categorization of figures in scientific documents. in Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. vol. 2006, pp. 129-138, 6th ACM/IEEE-CS Joint Conference on Digital Libraries 2006: Opening Information Horizons, JCDL '06, Chapel Hill, NC, 11/6/06. https://doi.org/10.1145/1141753.1141778
Lu X, Mitra P, Wang JZ, Giles CL. Automatic categorization of figures in scientific documents. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. Vol. 2006. 2006. p. 129-138 https://doi.org/10.1145/1141753.1141778
Lu, Xiaonan ; Mitra, Prasenjit ; Wang, James Z. ; Giles, C. Lee. / Automatic categorization of figures in scientific documents. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. Vol. 2006 2006. pp. 129-138
@inproceedings{712b06ef0610454e8031e2a8180fe652,
title = "Automatic categorization of figures in scientific documents",
abstract = "Figures are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the figures. We propose an architecture for retrieving documents by integrating figures and other information. The initial step in enabling integrated document search is to categorize figures into a set of pre-defined types. We propose several categories of figures based on their functionalities in scholarly articles. We have developed a machine-learning-based approach for automatic categorization of figures. Both global features, such as texture, and part features, such as lines, are utilized in the architecture for discriminating among figure categories. The proposed approach has been evaluated on a testbed document set collected from the CiteSeer scientific literature digital library. Experimental evaluation has demonstrated that our algorithms can produce acceptable results for real-world use. Our tools will be integrated into a scientific-document digital library.",
keywords = "Documents, Feature extraction, Figures, Machine learning, Scientific literature",
author = "Xiaonan Lu and Prasenjit Mitra and Wang, {James Z.} and Giles, {C. Lee}",
year = "2006",
doi = "10.1145/1141753.1141778",
language = "English",
isbn = "1595933549",
volume = "2006",
pages = "129--138",
booktitle = "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries",

}

TY - GEN

T1 - Automatic categorization of figures in scientific documents

AU - Lu, Xiaonan

AU - Mitra, Prasenjit

AU - Wang, James Z.

AU - Giles, C. Lee

PY - 2006

Y1 - 2006

N2 - Figures are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the figures. We propose an architecture for retrieving documents by integrating figures and other information. The initial step in enabling integrated document search is to categorize figures into a set of pre-defined types. We propose several categories of figures based on their functionalities in scholarly articles. We have developed a machine-learning-based approach for automatic categorization of figures. Both global features, such as texture, and part features, such as lines, are utilized in the architecture for discriminating among figure categories. The proposed approach has been evaluated on a testbed document set collected from the CiteSeer scientific literature digital library. Experimental evaluation has demonstrated that our algorithms can produce acceptable results for real-world use. Our tools will be integrated into a scientific-document digital library.

AB - Figures are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the figures. We propose an architecture for retrieving documents by integrating figures and other information. The initial step in enabling integrated document search is to categorize figures into a set of pre-defined types. We propose several categories of figures based on their functionalities in scholarly articles. We have developed a machine-learning-based approach for automatic categorization of figures. Both global features, such as texture, and part features, such as lines, are utilized in the architecture for discriminating among figure categories. The proposed approach has been evaluated on a testbed document set collected from the CiteSeer scientific literature digital library. Experimental evaluation has demonstrated that our algorithms can produce acceptable results for real-world use. Our tools will be integrated into a scientific-document digital library.

KW - Documents

KW - Feature extraction

KW - Figures

KW - Machine learning

KW - Scientific literature

UR - http://www.scopus.com/inward/record.url?scp=34247258424&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34247258424&partnerID=8YFLogxK

U2 - 10.1145/1141753.1141778

DO - 10.1145/1141753.1141778

M3 - Conference contribution

AN - SCOPUS:34247258424

SN - 1595933549

SN - 9781595933546

VL - 2006

SP - 129

EP - 138

BT - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries

ER -