Summarizing figures, tables, and algorithms in scientific publications to augment search results

Sumit Bhatia, Prasenjit Mitra

Research output: Contribution to journalArticle

16 Citations (Scopus)

Abstract

Increasingly, special-purpose search engines are being built to enable the retrieval of document-elements like tables, figures, and algorithms [Bhatia et al. 2010; Liu et al. 2007; Hearst et al. 2007]. These search engines present a thumbnail view of document-elements, some document metadata such as the title of the papers and their authors, and the caption of the document-element. While some authors in some disciplines write carefully tailored captions, generally, the author of a document assumes that the caption will be read in the context of the text in the document. When the caption is presented out of context as in a document-elementsearch- engine result, it may not contain enough information to help the end-user understand what the content of the document-element is. Consequently, end-users examining document-element search results would want a short "synopsis" of this information presented along with the document-element. Having access to the synopsis allows the end-user to quickly understand the content of the document-element without having to download and read the entire document as examining the synopsis takes a shorter time than finding information about a document element by downloading, opening and reading the file. Furthermore, it may allow the end-user to examine more results than they would otherwise. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We use Naïve Bayes and support vector machine classifiers to identify relevant sentences from the document text based on the similarity and the proximity of the sentences with the caption and the sentences in the document text that refer to the document-element. We compare the two classification methods and study the effects of different features used. We also investigate the problem of choosing the optimum synopsissize that strikes a balance between the information content and the size of the generated synopses. A user study is also performed to measure how the synopses generated by our proposed method compare with other state-of-the-art approaches.

Original languageEnglish
Article number2094075
JournalACM Transactions on Information Systems
Volume30
Issue number1
DOIs
Publication statusPublished - Feb 2012
Externally publishedYes

Fingerprint

Search engines
Metadata
Support vector machines
Classifiers
Engines
End users
Scientific publications
Search engine

Keywords

  • Classification
  • Document-element
  • Summarization
  • Synopses

ASJC Scopus subject areas

  • Information Systems
  • Business, Management and Accounting(all)
  • Computer Science Applications

Cite this

Summarizing figures, tables, and algorithms in scientific publications to augment search results. / Bhatia, Sumit; Mitra, Prasenjit.

In: ACM Transactions on Information Systems, Vol. 30, No. 1, 2094075, 02.2012.

Research output: Contribution to journalArticle

@article{c64f2ce8b3ae4e5f9175315ce44d8e8c,
title = "Summarizing figures, tables, and algorithms in scientific publications to augment search results",
abstract = "Increasingly, special-purpose search engines are being built to enable the retrieval of document-elements like tables, figures, and algorithms [Bhatia et al. 2010; Liu et al. 2007; Hearst et al. 2007]. These search engines present a thumbnail view of document-elements, some document metadata such as the title of the papers and their authors, and the caption of the document-element. While some authors in some disciplines write carefully tailored captions, generally, the author of a document assumes that the caption will be read in the context of the text in the document. When the caption is presented out of context as in a document-elementsearch- engine result, it may not contain enough information to help the end-user understand what the content of the document-element is. Consequently, end-users examining document-element search results would want a short {"}synopsis{"} of this information presented along with the document-element. Having access to the synopsis allows the end-user to quickly understand the content of the document-element without having to download and read the entire document as examining the synopsis takes a shorter time than finding information about a document element by downloading, opening and reading the file. Furthermore, it may allow the end-user to examine more results than they would otherwise. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We use Na{\"i}ve Bayes and support vector machine classifiers to identify relevant sentences from the document text based on the similarity and the proximity of the sentences with the caption and the sentences in the document text that refer to the document-element. We compare the two classification methods and study the effects of different features used. We also investigate the problem of choosing the optimum synopsissize that strikes a balance between the information content and the size of the generated synopses. A user study is also performed to measure how the synopses generated by our proposed method compare with other state-of-the-art approaches.",
keywords = "Classification, Document-element, Summarization, Synopses",
author = "Sumit Bhatia and Prasenjit Mitra",
year = "2012",
month = "2",
doi = "10.1145/2094072.2094075",
language = "English",
volume = "30",
journal = "ACM Transactions on Information Systems",
issn = "1046-8188",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

TY - JOUR

T1 - Summarizing figures, tables, and algorithms in scientific publications to augment search results

AU - Bhatia, Sumit

AU - Mitra, Prasenjit

PY - 2012/2

Y1 - 2012/2

N2 - Increasingly, special-purpose search engines are being built to enable the retrieval of document-elements like tables, figures, and algorithms [Bhatia et al. 2010; Liu et al. 2007; Hearst et al. 2007]. These search engines present a thumbnail view of document-elements, some document metadata such as the title of the papers and their authors, and the caption of the document-element. While some authors in some disciplines write carefully tailored captions, generally, the author of a document assumes that the caption will be read in the context of the text in the document. When the caption is presented out of context as in a document-elementsearch- engine result, it may not contain enough information to help the end-user understand what the content of the document-element is. Consequently, end-users examining document-element search results would want a short "synopsis" of this information presented along with the document-element. Having access to the synopsis allows the end-user to quickly understand the content of the document-element without having to download and read the entire document as examining the synopsis takes a shorter time than finding information about a document element by downloading, opening and reading the file. Furthermore, it may allow the end-user to examine more results than they would otherwise. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We use Naïve Bayes and support vector machine classifiers to identify relevant sentences from the document text based on the similarity and the proximity of the sentences with the caption and the sentences in the document text that refer to the document-element. We compare the two classification methods and study the effects of different features used. We also investigate the problem of choosing the optimum synopsissize that strikes a balance between the information content and the size of the generated synopses. A user study is also performed to measure how the synopses generated by our proposed method compare with other state-of-the-art approaches.

AB - Increasingly, special-purpose search engines are being built to enable the retrieval of document-elements like tables, figures, and algorithms [Bhatia et al. 2010; Liu et al. 2007; Hearst et al. 2007]. These search engines present a thumbnail view of document-elements, some document metadata such as the title of the papers and their authors, and the caption of the document-element. While some authors in some disciplines write carefully tailored captions, generally, the author of a document assumes that the caption will be read in the context of the text in the document. When the caption is presented out of context as in a document-elementsearch- engine result, it may not contain enough information to help the end-user understand what the content of the document-element is. Consequently, end-users examining document-element search results would want a short "synopsis" of this information presented along with the document-element. Having access to the synopsis allows the end-user to quickly understand the content of the document-element without having to download and read the entire document as examining the synopsis takes a shorter time than finding information about a document element by downloading, opening and reading the file. Furthermore, it may allow the end-user to examine more results than they would otherwise. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We use Naïve Bayes and support vector machine classifiers to identify relevant sentences from the document text based on the similarity and the proximity of the sentences with the caption and the sentences in the document text that refer to the document-element. We compare the two classification methods and study the effects of different features used. We also investigate the problem of choosing the optimum synopsissize that strikes a balance between the information content and the size of the generated synopses. A user study is also performed to measure how the synopses generated by our proposed method compare with other state-of-the-art approaches.

KW - Classification

KW - Document-element

KW - Summarization

KW - Synopses

UR - http://www.scopus.com/inward/record.url?scp=84859012828&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84859012828&partnerID=8YFLogxK

U2 - 10.1145/2094072.2094075

DO - 10.1145/2094072.2094075

M3 - Article

VL - 30

JO - ACM Transactions on Information Systems

JF - ACM Transactions on Information Systems

SN - 1046-8188

IS - 1

M1 - 2094075

ER -