Term selection for searching printed Arabic

Kareem Darwish, Douglas W. Oard

Research output: Contribution to journalConference article

41 Citations (Scopus)

Abstract

Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.

Original languageEnglish
Pages (from-to)261-268
Number of pages8
JournalSIGIR Forum (ACM Special Interest Group on Information Retrieval)
Publication statusPublished - 1 Dec 2002
EventProceedings of the Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - Tampere, Finland
Duration: 11 Aug 200215 Aug 2002

    Fingerprint

Keywords

  • Arabic
  • Information retrieval
  • OCR
  • Term selection

ASJC Scopus subject areas

  • Management Information Systems
  • Hardware and Architecture

Cite this