Book search: Indexing the valuable parts

Walid Magdy, Kareem Darwish

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

With massive book digitization efforts underway, there is a need for developing effective book retrieval strategies. This paper explores the relative contribution of different parts of digitized and OCR'ed books towards effective retrieval. The examined parts include the entire content of books, book headings, book titles, and table of content entries. Results show that indexing the headers and titles of books is nearly as effective as indexing the entire contents of books. These results indicate that certain portions of the books, specifically titles and headers, are more valuable than other parts of books. This is akin to web search where hypertext and page titles are more valuable to index than the rest of the webpage. Also, using a combination of evidence approach provides further improved retrieval effectiveness compared to using any portion of the book in isolation.

Original languageEnglish
Title of host publicationInternational Conference on Information and Knowledge Management, Proceedings
Pages53-56
Number of pages4
DOIs
Publication statusPublished - 1 Dec 2008
Externally publishedYes
Event2008 ACM Workshop on Research Advances in Large Digital Book Repositories, BooksOnline'08, Co-located with the 17th ACM Conference on Information and Knowledge Management, CIKM'08 - Napa Valley, CA, United States
Duration: 26 Oct 200830 Oct 2008

Other

Other2008 ACM Workshop on Research Advances in Large Digital Book Repositories, BooksOnline'08, Co-located with the 17th ACM Conference on Information and Knowledge Management, CIKM'08
CountryUnited States
CityNapa Valley, CA
Period26/10/0830/10/08

Fingerprint

Indexing
Isolation
Web search
Hypertext

Keywords

  • Book search
  • OCR retrieval

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Magdy, W., & Darwish, K. (2008). Book search: Indexing the valuable parts. In International Conference on Information and Knowledge Management, Proceedings (pp. 53-56) https://doi.org/10.1145/1458412.1458429

Book search : Indexing the valuable parts. / Magdy, Walid; Darwish, Kareem.

International Conference on Information and Knowledge Management, Proceedings. 2008. p. 53-56.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Magdy, W & Darwish, K 2008, Book search: Indexing the valuable parts. in International Conference on Information and Knowledge Management, Proceedings. pp. 53-56, 2008 ACM Workshop on Research Advances in Large Digital Book Repositories, BooksOnline'08, Co-located with the 17th ACM Conference on Information and Knowledge Management, CIKM'08, Napa Valley, CA, United States, 26/10/08. https://doi.org/10.1145/1458412.1458429
Magdy W, Darwish K. Book search: Indexing the valuable parts. In International Conference on Information and Knowledge Management, Proceedings. 2008. p. 53-56 https://doi.org/10.1145/1458412.1458429
Magdy, Walid ; Darwish, Kareem. / Book search : Indexing the valuable parts. International Conference on Information and Knowledge Management, Proceedings. 2008. pp. 53-56
@inproceedings{e79ccf604a8c4567acbe9dce6a88ab6f,
title = "Book search: Indexing the valuable parts",
abstract = "With massive book digitization efforts underway, there is a need for developing effective book retrieval strategies. This paper explores the relative contribution of different parts of digitized and OCR'ed books towards effective retrieval. The examined parts include the entire content of books, book headings, book titles, and table of content entries. Results show that indexing the headers and titles of books is nearly as effective as indexing the entire contents of books. These results indicate that certain portions of the books, specifically titles and headers, are more valuable than other parts of books. This is akin to web search where hypertext and page titles are more valuable to index than the rest of the webpage. Also, using a combination of evidence approach provides further improved retrieval effectiveness compared to using any portion of the book in isolation.",
keywords = "Book search, OCR retrieval",
author = "Walid Magdy and Kareem Darwish",
year = "2008",
month = "12",
day = "1",
doi = "10.1145/1458412.1458429",
language = "English",
isbn = "9781605582498",
pages = "53--56",
booktitle = "International Conference on Information and Knowledge Management, Proceedings",

}

TY - GEN

T1 - Book search

T2 - Indexing the valuable parts

AU - Magdy, Walid

AU - Darwish, Kareem

PY - 2008/12/1

Y1 - 2008/12/1

N2 - With massive book digitization efforts underway, there is a need for developing effective book retrieval strategies. This paper explores the relative contribution of different parts of digitized and OCR'ed books towards effective retrieval. The examined parts include the entire content of books, book headings, book titles, and table of content entries. Results show that indexing the headers and titles of books is nearly as effective as indexing the entire contents of books. These results indicate that certain portions of the books, specifically titles and headers, are more valuable than other parts of books. This is akin to web search where hypertext and page titles are more valuable to index than the rest of the webpage. Also, using a combination of evidence approach provides further improved retrieval effectiveness compared to using any portion of the book in isolation.

AB - With massive book digitization efforts underway, there is a need for developing effective book retrieval strategies. This paper explores the relative contribution of different parts of digitized and OCR'ed books towards effective retrieval. The examined parts include the entire content of books, book headings, book titles, and table of content entries. Results show that indexing the headers and titles of books is nearly as effective as indexing the entire contents of books. These results indicate that certain portions of the books, specifically titles and headers, are more valuable than other parts of books. This is akin to web search where hypertext and page titles are more valuable to index than the rest of the webpage. Also, using a combination of evidence approach provides further improved retrieval effectiveness compared to using any portion of the book in isolation.

KW - Book search

KW - OCR retrieval

UR - http://www.scopus.com/inward/record.url?scp=70349240824&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70349240824&partnerID=8YFLogxK

U2 - 10.1145/1458412.1458429

DO - 10.1145/1458412.1458429

M3 - Conference contribution

AN - SCOPUS:70349240824

SN - 9781605582498

SP - 53

EP - 56

BT - International Conference on Information and Knowledge Management, Proceedings

ER -